Multiple Regression Practice Problems
Stat 112
1. When, in 1982, average Scholastic Achievement Test (SAT) scores were first published on a state-by-state basis in the United States, the huge variation in the scores was a source of great pride for some states and of consternation for others. Average scores ranged from a low of 790 (out of a possible 1,600) in South Carolina to a high of 1,088 in Iowa. Two researchers set out to figure out how certain variables are associated with state SAT differences.1 The variable SAT is the average total SAT (verbal+quantitative) score in the state and the two explanatory variables considered are the following: Takers Expend
percentage of the total eligible students (high school seniors) in state who took the exam total state expenditure on secondary schools, expressed in hundreds of dollars per student
Output from a multiple regression analysis is shown below. Response SAT Whole Model Actual by Predicted Plot 1100 1050 SAT Actual
1000 950 900 850 800 750 750 800
850
900
950 1000 1050 1100
SAT Predicted P<.0001 RSq=0.81 RMSE=31.937
Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.808786 0.800472 31.93721 948.449 49
Analysis of Variance Source Model Error C. Total
DF 2 46 48
Sum of Squares 198456.79 46919.33 245376.12
Mean Square 99228.4 1020.0
F Ratio 97.2841 Prob > F <.0001
Parameter Estimates Term Intercept EXPEND 1
Estimate 932.41448 4.2985226
Std Error 22.16843 1.025343
t Ratio 42.06 4.19
Prob>|t| <.0001 0.0001
B. Powell and L.C. Steelman, “Variations in State SAT Performance: Meaningful or Misleading?,” Harvard Educational Review 54(4), 1984: 389-412.
TAKERS
-3.07411
0.2206
-13.94
<.0001
Effect Tests Source EXPEND TAKERS
Nparm 1 1
DF 1 1
Sum of Squares 17926.44 198071.21
F Ratio 17.5752 194.1902
Prob > F 0.0001 <.0001
Residual by Predicted Plot
SAT Residual
100
50
0
-50 750 800
850
900
950 1000 1050 1100
SAT Predicted
For questions (a)-(e), assume the ideal multiple linear regression model holds. (a) For Pennsylvania, SAT=885, TAKERS=50 and EXPEND=27.98. What would you predict Pennsylvania’s average SAT score to be based on knowing its TAKERS and EXPEND, but not knowing its SAT? What is the residual for Pennsylvania? (b) Is there strong evidence that the multiple regression model provides better predictions of SAT than just using the sample mean of SAT to predict SAT? Use a test at the .05 level to justify your answer. (c) Find an approximate 95% confidence interval for the coefficient on TAKERS. (d) Is there strong evidence that total state expenditures (EXPEND) helps to predict a state’s average SAT score once TAKERS has been taken into ? Use a test at the . 05 level to justify your answer. (e) The two states with the largest Cook’s distances are Alaska and South Carolina with Cook’s distances of 2.06 and 0.18 respectively and leverages of 0.44 and 0.09 respectively. For each state (Alaska, South Carolina), answer whether it would be justified to delete the state from the analysis and report that we omitted the state and that our conclusions only hold for a reduced range of explanatory variables, not including the explanatory variables of the state.
(f) Suppose we want to use either Takers or Log(Takers) in the multiple regression. On the basis of the below information, which of these two forms would you choose to use? Explain. Bivariate Fit of SAT By TAKERS
Linear Fit:
SAT = 1020.3062 - 2.7599621 TAKERS
1100 1050
SAT
1000 950 900 850 800 750 0
10
20
30
40
50
60
70
Linear Fit Transformed Fit to Log
TAKERS
Linear Fit SAT = 1020.3062 - 2.7599621 TAKERS
Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.735838 0.730335 36.79525 947.94 50
Analysis of Variance Source Model Error C. Total
DF 1 48 49
Sum of Squares 181024.09 64986.73 246010.82
Mean Square 181024 1354
F Ratio 133.7066 Prob > F <.0001
Parameter Estimates Term Intercept TAKERS
Estimate 1020.3062 -2.759962
Std Error 8.139082 0.238686
t Ratio 125.36 -11.56
Prob>|t| <.0001 <.0001
Residual Plot for Linear Fit Residual
100 50 0 -50 -100 0
10
20
30
40
TAKERS
Transformed Fit to Log SAT = 1112.2477 - 59.018822 Log(TAKERS)
50
60
70
Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.810762 0.80682 31.14298 947.94 50
Analysis of Variance Source Model Error C. Total
DF 1 48 49
Sum of Squares 199456.33 46554.49 246010.82
Mean Square 199456 970
F Ratio 205.6494 Prob > F <.0001
Parameter Estimates Term Intercept Log(TAKERS)
Estimate 1112.2477 -59.01882
Std Error 12.27496 4.11554
t Ratio 90.61 -14.34
Prob>|t| <.0001 <.0001
Residual Plot for Transformed Fit to Log
Residual
50 0 -50 -100 0
10
20
30
40
50
60
70
TAKERS
2. The number of car accidents on a particular stretch of highway seems to be related to the number of vehicles that travel over it and the speed at which they are traveling. A city alderman has decided to ask the county sheriff to provide him with statistics covering the last few years, with the intention of examining these data statistically so that he can (if possible) introduce new speed laws that will reduce traffic accidents. Using the number of accidents as the response variable, he obtains estimates of the number of cars ing along a stretch of road (subtracted from the mean number of cars ing along a stretch of the road) and their average speeds (in miles per hour, subtracted from the mean average speed) for 60 randomly selected days. (a) JMP output from simple linear regressions of (i) Accidents on Speed and (ii) Cars on Speed are shown below. Would you expect the estimated coefficient on Speed to increase, decrease or stay the same in a multiple linear regression of Accidents on Speed and Cars as compared to the estimated coefficient of Speed in the simple linear regression of Accidents on Speed. Justify your answer using the omitted variable bias formula.
Response Accidents Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.021001 0.004122 2.430355 7.033333 60
Parameter Estimates Term Intercept Speed
Estimate -8.018052 0.2508495
Std Error 13.49733 0.224888
t Ratio -0.59 1.12
Prob>|t| 0.5548 0.2693
t Ratio 1.92 -0.45
Prob>|t| 0.0603 0.6527
Response Cars Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.003515 -0.01367 1.222004 9.935 60
Parameter Estimates Term Intercept Speed
Estimate 13.003931 -0.051147
Std Error 6.786575 0.113076
(b) JMP output from a multiple linear regression of Accidents on Cars, Speed and Cars*Speed is shown below. Is there strong evidence of an interaction between Cars and Speed? Justify your answer using a test at the .05 level. Response Accidents Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.743622 0.729887 1.265725 7.033333 60
Analysis of Variance Source Model Error C. Total
DF 3 56 59
Sum of Squares 260.21801 89.71533 349.93333
Mean Square 86.7393 1.6021
F Ratio 54.1424 Prob > F <.0001
Parameter Estimates Term Intercept Cars Speed Cars*Speed
Estimate 7.1405117 0.4158119 0.0644162 1.0763228
Std Error 0.163638 0.136049 0.118519 0.087791
t Ratio 43.64 3.06 0.54 12.26
Prob>|t| <.0001 0.0034 0.5889 <.0001
(c) The alderman proposes decreasing the speed limit by 5 MPH. The number of cars on the road is higher on average on weekdays than the weekends. Assuming that the average number of cars will not be changed by decreasing the speed limit and that there are no confounding variables, would you expect the decrease in the speed limit to have a larger impact on the number of accidents during the weekends or the weekdays? 3. Car designers have been experimenting with ways to improve gas mileage for many years. An important element in this research is the way in which a car’s speed affects how quickly fuel is burned. Competitions whose objective is to drive the farthest on the smallest amount of gas have determined that low speeds and high speeds are inefficient. Designers would like to know which speed burns gas most efficiently. As an experiment, 50 identical cars are driven at different speeds and the gas mileage measured. (a) JMP output from a simple linear regression model of Mileage on Speed is shown below. Comment on the regression diagnostics – the residual plot, the histogram of the residuals and the boxplot of the Cook’s distances. If you see any problems, suggest what you would do next in the analysis to try to address those problems. Bivariate Fit of Mileage By Speed 40 35
Mileage
30 25 20 15 10 5 0
10 20 30 40 50 60 70 80 90 100 110 Speed
Linear Fit
Linear Fit Mileage = 23.266776 - 0.0012701 Speed
Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.000028 -0.02081 7.102586 23.202 50
Analysis of Variance Source Model Error C. Total
DF 1 48 49
Sum of Squares 0.0672 2421.4426 2421.5098
Mean Square 0.0672 50.4467
F Ratio 0.0013 Prob > F 0.9710
Parameter Estimates Term Intercept Speed
Estimate 23.266776 -0.00127
Std Error 2.039431 0.034802
t Ratio 11.41 -0.04
Prob>|t| <.0001 0.9710
Residual
10 0 -10 -20 0
10
20
30
40
50
60
70
80
90
100 110
Speed
Distributions Residual Mileage
-15
-10
-5
0
5
10
15
Distributions Cook's D Influence Mileage 0.2
0.15
0.1
0.05
0
(b) JMP output for a quadratic regression of mileage on speed and speed squared is shown below. Is there strong evidence that the quadratic regression provides better predictions of mileage based on speed than the simple linear regression? Justify your answer using a test at the .05 level.
Response Mileage Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.710249 0.697919 3.863732 23.202 50
Parameter Estimates Term Intercept Speed Speed Squared
Estimate 9.3413673 0.8021188 -0.007876
Std Error 1.70707 0.077207 0.000734
t Ratio 5.47 10.39 -10.73
Prob>|t| <.0001 <.0001 <.0001
Response Mileage Whole Model Actual by Predicted Plot 40
Mileage Actual
35 30 25 20 15 10 5 5
10
15
20
25
30
35
40
Mileage Predicted P<.0001 RSq=0.71 RMSE=3.8637
Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts)
0.710249 0.697919 3.863732 23.202 50
Analysis of Variance Source Model Error C. Total
DF 2 47 49
Sum of Squares 1719.8740 701.6358 2421.5098
Mean Square 859.937 14.928
F Ratio 57.6040 Prob > F <.0001
Parameter Estimates Term Intercept Speed Speed Squared
Estimate 9.3413673 0.8021188 -0.007876
Std Error 1.70707 0.077207 0.000734
t Ratio 5.47 10.39 -10.73
Prob>|t| <.0001 <.0001 <.0001
Residual by Predicted Plot
Mileage Residual
10
5
0
-5 5
10
15
20
25
30
35
40
Mileage Predicted
Speed Leverage Plot Mileage Leverage Residuals
40 35 30 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 Speed Leverage, P<.0001
Speed Squared Leverage Plot Mileage Leverage Residuals
40 35 30 25 20 15 10 5 0 1000
3000
5000
7000
9000
Speed Squared Leverage, P<.0001
(c) Suppose you are low on gas. Which speed does the quadratic regression model suggest that it is best to drive at – 20 MPH, 50 MPH or 70 MPH? Justify your answer.