# Answers to True or False Exercises

This is a pre-release of the Open Access web version of Veridical Data Science. A print version of this book will be published by MIT Press in late 2024. This work and associated materials are subject to a Creative Commons CC-BY-NC-ND license.

## Chapter 1

**False.**Data can never perfectly capture reality.**False.**Human beings will always be subject to confirmation bias.**True.**Even if the data is collected using the same mechanism, the underlying real-world trends may have changed.**False.**The results of your analysis correspond to just one possible answer that you could have obtained. Rather than*proof*of your conclusion, your analysis provides*evidence*of your conclusion.**True.**There are many ways that you can evaluate the predictability of your results, including showing that they are in line with domain knowledge and that they reemerge in future data.

## Chapter 2

**True.**Every data science problem can be formulated in multiple different ways.**False.**It is also important for the analyst themselves to determine whether their results are predictable and stable, even if they will not be presenting them to an external audience.**True**. The DSLC is non-linear. Often you will learn things about your data in the later stages that will require you to backtrack to your cleaning or preprocessing steps or conduct further exploration.**False**. There is no single correct way that an analysis should be conducted; indeed there are often many (that’s not to say that there are no*incorrect*ways).**False**. While some of the stages are very important in every project, such as the data cleaning and the exploratory data analysis (EDA) stages, there are some stages, such as the prediction stage, that may not be relevant to all projects.**True**. It is very common for results to be affected by the judgment calls made early in the DSLC, including at the data cleaning stage.**True**. Since PCS evaluations involve conducting stress tests of your results in new scenarios (such as using future data), it can help identify when a finding was just a reflection of the coincidental patterns that exist among specific data points used to generate them, which is another way of saying that it can help prevent data snooping.**False**. PCS uncertainty quantification is an extension of traditional statistical inference that takes into consideration uncertainty arising from judgment calls, rather than just from the data sampling mechanism (as in traditional statistical inference).**False**. Communication with domain experts is important throughout the entire DSLC.

## Chapter 3

**True**. Your`dslc_documentation`

files can (and should) contain code.**False**. While it is OK to edit a*copy*of your original raw data file if you absolutely must make manual edits (and be sure to document any edits that you make), you should avoid editing the original file itself.**False**. Quite the opposite: you should*regularly*rerun all your code, including the code that cleans your data, to maximize the reproducibility potential of your results.**False**. This is subjective, but we strongly believe that neither language is universally better than the other for data science (although each language may be better than the other for some particular types of projects).**False**. There is a broad spectrum of types of reproducibility, and this definition encompasses just one of them.**False**. The reproducibility crisis is a problem that is affecting a range of scientific fields.**True**. This is an excellent way to increase the chance that your results are reproducible, at least in the context of ensuring that you get the same results every time that your code is run.**False**. It is possible that the other person is not correctly implementing their reproduction or that they simply arrived at a different answer due to different humans making different judgment calls.**False**. This is strong evidence that your results are “trustworthy,” but this doesn’t*prove*that your results are “correct.” Moreover, you may*both*be wrong!

## Chapter 4

**False**. Every dataset should at least be examined to determine if it needs to be cleaned.**True**. This minimizes the chance that you will accidentally corrupt the original data file.**True**. Clean data can contain missing values if they are properly formatted as missing (e.g., as`NA`

values in R).**True**. Since the preprocessing steps that you will implement depend on the algorithm/analysis that you will conduct, missing values may or may not be a problem. In most (but not all) scenarios, however, you will end up removing or imputing missing values during preprocessing.**False**. Removing the offending rows is one way to handle missing values, but it is rarely the best way since it is likely to introduce bias into your data. Often, it is better to impute missing values or remove unimportant columns with large numbers of missing values.**True**. The set of valid values for a variable is indeed usually determined based on domain knowledge and how the data was collected.**False**. Just because a value is surprising doesn’t mean that it is incorrect or invalid. It does, however, warrant further exploration (e.g., by asking the person who collected the data how such a value might have arisen). If it is clear that the value is invalid or incorrect, then you may want to replace it with a missing value, but this should not be automatic.**False**. There are many reasonable ways to clean every dataset, and everyone will likely clean the same dataset slightly differently.**True**. This is because we don’t want to use information from the validation and test sets when cleaning your training data.**False**. Exploring your data is*part*of cleaning it.**False**. While we often find it helpful to keep the cleaning and preprocessing stages separate (so we can explore a clean version of the data that has not been modified to suit a particular algorithm), it is not necessary to do so if it feels overly complicated.**False**. You can instead add different arguments to your preprocessing function to create different versions of your data for algorithms with different formatting requirements.

## Chapter 5

**True**. You should always evaluate the predictability and stability of every explanatory finding, especially if you are going to present the findings. It is also a good idea to evaluate the predictability and stability of any exploratory findings that seem important.**True**. You will tend to make a lot of exploratory findings, but only those that are actually conveying important information that you want to communicate to an external audience need to be turned into explanatory figures.**True**. Exploratory (and explanatory) findings can include numeric summaries and tables.**False**. We recommend using color sparingly, and only if it adds information. Color can be added for purely aesthetic purposes, but its use should be kept as simple as possible.**False**. The correlation describes the*tightness*of the points around a line that can be at any angle.**True**. Correlation captures the strength only of the linear relationship between two variables.**False**. Whether to use the mean or the median is context dependent. The mean will be more influenced by the outliers, which may be better in*some*scenarios, but not others.**False**. The amount of detail that you should present depends on your audience. A research paper may call for more details than a presentation slide or a general public-facing information pamphlet.**True**. Variables with a causal relationship will also be correlated (but not necessarily the other way around).**False**. Correlated variables will not necessarily be causally related. There may be another variable that simultaneously and independently affects both variables.

## Chapter 6

**False**. SVD is a general technique used for a range of problems, one of which is principal component analysis.**True**. The fact that every matrix has a SVD decomposition is what makes it such a versatile technique.**False**. The principal component variable loading magnitudes are comparable only if the variables have the same*scale*(e.g., they have been SD-scaled).**False**. Even if all the values in the data are positive, principal component analysis can still yield negative variable loadings. This is evident from the examples shown in this chapter.**False**. While the first principal component typically contains the most information, there are many applications where you will want to consider multiple principal components beyond just the first one.**True**. If you clean or preprocess your data differently, you may end up with slightly different principal components.**True**. Principal component analysis can be applied only to datasets with numeric variables. If your data contains non-numeric variables, you can still apply principal component analysis if you convert the non-numeric variables to a numeric format.**True**. Principal component analysis is specifically designed to compute linear summaries.**False**. The reverse is true: the projected data points will exhibit greater variability when projected onto the first principal component than the second principal component.**False**. Principal component analysis does not*require*that the original variables have a Gaussian distribution, but you will often find more*meaningful*components when they do.**True**. This is exactly what a scree plot does.

## Chapter 7

**False**. Clustering can be based on a range of different distance metrics, but the Euclidean distance metric is the most common.**True**. Since the initial cluster positions are random, the K-means clustering results may be different each time the algorithm is run.**False**. The K-means algorithm only approximately minimizes the K-means loss function.**False**. Sometimes the K-means clusters will yield a higher silhouette score, but not always.**True**. One definition of a good cluster is indeed one that is tight and distinct.**True**. Even though the hierarchical clustering algorithm does not involve computing cluster centers, it is possible to compute the cluster centers of hierarchical clusters by computing the average value of each variable across all the observations in the cluster.**False**. The Rand Index does not require that the two sets of clusters being compared contain the same number of clusters apiece, but rather that they are based on the same sets of observations.**True**. Since the Rand Index involves comparing the clustering of each pair of data points, it requires that the two sets of clusters are of the same set of data points.**False**. Since the WSS will generally decrease as the number of clusters \(K\) increases, it is not recommended for use when comparing sets of clusters that contain differing numbers of clusters (i.e., different values of \(K\)).**True**. The silhouette score, unlike the WSS, can be used to compare sets of clusters that contain differing numbers of clusters (i.e., different values of \(K\)).**True**. Cross-validation (CV) is a widely used hyperparameter selection technique for a range of data science algorithms that goes beyond just clustering.**False**. Since it is still based on the training data, CV does not provide a reasonable approximation of the performance of an algorithm on external data (but it is a useful technique for hyperparameter selection).

## Chapter 8

**True**. There are often many ways to define a response variable (e.g., some response variables can be formatted as either a binary variable or a continuous variable).**False**. Labeled data is required for both training*and*evaluation. If you don’t know the true/observed response for the observations that you are using for evaluation, you cannot determine how accurate your predicted responses are.**False**. There are many scenarios where a random split may not be appropriate, such as when there is a time dependence in your data or when there are repeated observations from the same person/country/group.**True**. A predictive relationship does not imply a causal relationship. It may be that your predictor variable tends to have large values whenever your response variable does, but this doesn’t mean that an increase in the predictor variable will*cause*an increase in the response variable.**True**. These are all different terms commonly used (both in this book and outside of it) to mean the same thing.**False**. Cluster labels do not necessarily correspond to real-world groups; rather, they capture groups of similar data points.**False**. While collecting more training data will*often*lead to improved predictive performance, this is not always the case.**False**. Collecting additional predictor variables will improve the predictive performance only if the new variables have information that is actually predictive of the response variable.

## Chapter 9

**False**. Since the absolute loss function is not differentiable at its minimum, there is no explicit mathematical formula for computing the linear fit that minimizes it.**True**. Since the squared loss function is differentiable at its minimum, there is a computable formula that identifies the linear fit that minimizes it.**False**. You may use different loss functions for the processes of evaluation and training.**False**. While the LAD algorithm will*usually*produce more accurate predictions on the validation set than the LS algorithm when using the MAE predictive performance measure, this will not always be the case.**True**. Since the rMSE corresponds to the square root of the squared error, the rMSE and the MAE are on the same scale.**True**. Since MSE and MAE are measuring different things, one algorithm may yield predictions that are closer to the “true” response in terms of average squared error (MSE) but not in absolute error (MAE).**True**. If you have shown that the algorithm generates accurate predictions of data in a new context, then it is acceptable to use the algorithm in this new context.**True**. Since the algorithm was trained on the patterns and relationships that exist within the training data itself, being able to generate accurate predictions for the training data is no guarantee that it will be able to do so on new data (because the new data might feature slightly different patterns and relationships).**False**. One example of a predictability assessment of a predictive algorithm is to compute the MAE for the predictions computed for the*validation*data (rather than the training data).

## Chapter 10

**False**. This is true only if either (a) the original variables were on the same scale or have been standardized to a common scale before fitting the LS algorithm, or (b) the coefficients have been standardized (e.g., using the bootstrap) before making the comparison.**True**. The LS algorithm requires that all the predictor variables are numeric (binary variables are a type of numeric variable). Categorical variables thus need to be transformed to a numeric type (e.g., by creating one-hot encoded/dummy variables). Many algorithmic implementations, such as`lm()`

in R, will do this conversion under the hood for you.**True**. Although correlation isn’t a particularly good measure of association for binary variables, this doesn’t mean that you*can’t*use it.**False**. Collecting additional predictor variables will improve the predictive performance only if the new variables contain information that is actually predictive of the response variable.**False**. While the LS algorithm only captures linear relationships between the variables that are provided, if you apply a*non-linear transformation*(such as a logarithmic transformation) to either the predictor or response variables, LS can be used to capture non-linear relationships between the predictor and response variables.**True**. While the preprocessing stage may seem fairly removed from the prediction stage of the DSLC, the judgment calls that you make during preprocessing can absolutely affect your predictions (which is why we assess their effect on our results using stability assessments).**True**. This is the very definition of overfitting.**False**. It depends on how you define the “best” regularization hyperparameter. For instance, sometimes it is “better” to choose a slightly larger value of the regularization hyperparameter (e.g., the largest value within 1 SE of the value that achieves the minimum CV error).**False**. More regularization means that the coefficients are shrunk closer to 0 (away from the original LS coefficient values).**False**. Although the LS algorithm involves minimizing an L2 (squared) loss function, this doesn’t necessarily mean that L2 (ridge) regularization makes more sense than L1 (lasso) regularization.

## Chapter 11

**True**. We demonstrated in this chapter that LS can technically be used for binary prediction problems.**False**. Unlike the logistic regression predictions, the LS predictions for binary responses should*not*be interpreted as class probability predictions.**True**. One way to do this is to use a threshold value to convert a continuous response into a binary response.**True**. The logistic function can indeed be used to transform a linear combination into a predicted positive class probability.**False**. Unfortunately, there is no closed-form solution to the logistic regression optimization problem. Moreover, it cannot be directly solved using LS because we don’t ever observe the true class probabilities.**False**. There are many stability analyses that we could conduct, including investigating whether the predictive performance changes across data and cleaning/preprocessing judgment call perturbations.**False**. Whether a high true positive rate (sensitivity) is more or less important than a high true negative rate (specificity) will depend on the particular problem.**False**. In such a case, the sensitivity of this algorithm evaluated on the validation set will be equal to the proportion of*validation*(not*training*) data points in the positive class.**False**. Even if the AUC for algorithm A is higher, there may still be individual threshold choices for which algorithm B performs better than algorithm A.**True**. This is particularly true for prediction problems with class imbalance.**False**. Gaussianity of the predictive features is*not*a requirement for the logistic regression algorithm; however, transformations that improve the symmetry and/or Gaussianity of a feature’s distribution may lead to better predictive performance.

## Chapter 12

**True**. This is one of the biggest benefits of the decision tree.**False**. CART can be used to construct decision trees for predicting binary*and*continuous responses.**True**. It is certainly possible to improve predictive performance by tuning the hyperparameters, although, in our experience, this performance improvement is usually minor.**False**. In this chapter, we saw that the CART algorithm is substantially*less*stable to data perturbations relative to LS.**False**. The Gini impurity is used to train the CART algorithm using the training set, not to evaluate it using the validation set.**False**. Because the CART algorithm does not involve any random sampling of observations or predictive features when computing the splits, CART is not a random algorithm (it will produce the same fit every time it is trained on the same dataset).**True**. The RF algorithm is random because each tree is trained using a random sample of observations, and each split considers a random subset of predictive features.**False**. The RF feature importance measures do not require that you standardize the features.**True**. Since the permutation feature importance measure involves identifying how the predictive performance changes when you permute each feature separately (this idea is not specific to the RF algorithm), it could be used to compute a feature importance score for any predictive algorithm.**True**. While log-transforming predictive features can change the individual split threshold values, it will not affect the prediction output of the CART algorithm.

## Chapter 13

**False**. This is the traditional approach. In the PCS framework, we instead recommend considering*several*versions of each algorithm trained on different versions of the cleaned/preprocessed data.**False**. An ensemble can be created using a range of different algorithms, so long as they are all designed to predict the same response.**True**. An ensemble*combines*multiple predictions for each data point into a*single*prediction output.**True**. Since both the single prediction and ensemble predictions involve reporting a single response prediction for each data point, they can both be evaluated and compared using the standard performance measures. This is not true of the PPIs, however.**False**. As we have seen in this chapter, it is certainly possible to create an ensemble fit that performs worse than the single predictive fit with the highest validation set performance.**True**. Repeatedly using the test set to decide the final prediction (e.g., by comparing different options) means that the test set is no longer independent of the final predictions, and thus cannot be used to provide an independent assessment of predictive performance.**False**. The*uncalibrated*validation set coverage will rarely be 90 percent (unless we capture*all*possible sources of uncertainty in our perturbations—which is essentially impossible). Calibration is required to achieve a coverage closer to our goal of 90 percent.**False**. Unfortunately, for many projects, the intended users of your final fit will not have the technical skills required to make use of your fit if it can only be accessed with R or Python.**True**. If various alternative options for a particular judgment call all lead to similar predictive performance, then this judgment call clearly doesn’t have much influence on the predictive performance.**True**. Since we don’t want to include fits with poor performance in our ensemble or intervals, it is important to conduct predictability screening.