This is a pre-release of the Open Access web version of Veridical Data Science. A print version of this book will be published by MIT Press in late 2024. This work and associated materials are subject to a Creative Commons CC-BY-NC-ND license.

For Bin Yu, the motivation for this book arose while teaching the PhD-level applied statistics and ML class, “Statistical Models: Theory and Application” (“STAT 215A”) at the University of California, Berkeley, beginning in the late 2000s. In taking up the role of teaching this legendary statistics class, Yu followed in the footsteps of Berkeley’s original applied statistics gurus: David Freedman, Terry Speed, Leo Breiman, and David Brillinger. Each of these individuals had a strong influence on an entire generation of statisticians who trained at Berkeley from the 1980s to the mid-2000s, including Yu herself, who in the late 1990s was inspired and encouraged by Leo Breiman to begin working on ML research. Her applied statistics training with Terry Speed, her experience at Bell Labs in the late 1990s, and her transition to ML-focused research laid the foundation for Yu’s approach to solving real data problems via collaborative interdisciplinary research and, by extension, this book.

Since this book arose in part out of Yu’s class, STAT 215A, which she has taught for 15 years, the graduate student teaching assistants (TAs) who helped teach this class played an indispensable role in shaping the veridical data science framework. So, a big thank-you is owed to the former 215A TAs: Charlotte Wickham, Karl Rohe, Yuval Benjamini, James Long, Jessica Li, Adam Bloniarz, Ryan Giordano, Karl Kumbier, Zoe Vernon, Tiffany Tang, James Duncan, Omer Ronen, Theo Saarinen, and her fellow author Rebecca Barter (who was the 215A TA in 2017). A big thank-you to all the STAT 215 students over the years too, especially those who provided helpful comments during the editing process of this book.

The Yu research group has also been instrumental in the development of the veridical data science framework and software, with special thanks to Karl Kumbier, Chinghway Lim, Siqi Wu, Sumanta Basu, Yuansi Chen, Reza Abbasi-Asl, Merle Behr, Yu (Hue) Wang, Raaz Dwivedi, Yanshuo Tan, Xiao Li, Briton Park, Mian Wei, Chandan Singh, James Duncan, Tiffany Tang, Corrine Elliott, Abhineet Agarwal, Rush Kapoor, and many more.

Much of the inspiration for the veridical data science framework arose from collaborative interdisciplinary projects that span a wide range of domains. So, we would also like to thank the Yu group’s many scientific collaborators, including Jack Gallant, Sue Celniker, Ben Brown, Erwin Frise, Prabhu Shankar, Peng Gong, Amy Braverman, Eugene Clothiaux, Prabhu Shankar, Aaron Kornblith, Anobel Odisho, James Priest, Euan Ashley, and Srigokul Upadhyayula (Gokul), among others.

We would like to thank the many people who have taken their time to read portions of this book and provide invaluable feedback (that we have hopefully taken to heart during our editing of this book). Thank you to Yoav Benjamini, John Rice, Andrew Gelman, Valérie Chavez-Demoulin, Tian Zheng, Jasjeet Sekhon, Philip Stark, Anthony Davison, Peter Bickel, Aaron Kornblith, Peter Bühlmann, Nicolai Meinshausen, Terry Speed, David Madigan, Xuming He, David Goldberg, Giles Hooker, Martin Wainwright, David Wagner, David Blei, Rajen Shah, Jane-Ling Wang, Hans Mueller, Mike Waterman, and many, many others, as well as the six anonymous reviewers who gave insightful and constructive comments on an earlier draft of this book.

Finally, we would like to sincerely thank our families and friends for their unwavering support over the many years that this book was in production.

This book was partially funded by generous grant support from the National Science Foundation (NSF), Army Research Office, Office of Naval Research, the NSF Foundation of Data Science Institute (FODSI), the Chan-Zuckerberg Biohub, Weill Neurohub, the Simons Foundation, and the Center for Science of Information (CSoI), an NSF Science and Technology Center.