<Preview>

The general problems to be discussed: we have one response variable Y and a set of k predictor variables X1, X2, X3...Xk; and we want to determine the best subset of the k predictors and the corresponding best-fitting regression model for describing the relationship between Y and the X's. What exactly we mean by "best" depends in part on our overall goal for modeling.

One goal is to find a model that provides the best prediction of Y, given X1,X2, ..., Xk, for some new observation or for a batch of new observations.

Alongside the question of prediction is the question of validity- that is, of obtaining accurate estimates for one or more regression coefficient parameters in a model and then making inferences about these parameters of interest. The goal here is to quantify the relationship between one or more independent variables of interest and the dependent variable, controlling when necessary for other variables.

<Steps in Selecting the Best Regression Equation: Prediction Goal>

1. Specify the maximum model (defined in Section 16.3) to be considered.

2. Specify a criterion for selecting a model.

3. Specify a strategy for selecting variables.

4. Conduct the specified analysis.

5. Evaluate the reliability of the model chosen.

By following these steps, one can convert the global goal of finding the best predictors of Y into simple, concrete actions. Each step helped to ensure reliability and to reduce the work required.

- Recall that overfitting a model (including variables in the model with truly zero regression coefficients in the population) will not introduce bias when population regression coefficients are estimated if the usual regression assumptions are met. We must be careful, however, to ensure that overfitting does not introduce harmful collinearity (ch. 14). Underfitting (i.e. leaving important predictors out of the final model), however, will introduce bias in the estimated regression coefficients.

 

posted by sergeant