This exercise uses the crime data from Agresti and Finlay, from the Statistical Abstract of the US for a recent year

ph 141 regression case study 2

This exercise uses the crime data from Agresti and Finlay, from the Statistical Abstract of the US for a recent year. There are 51 observations, one for each state and the District of Columbia.

The dataset is crime.dta is in bcourses.
Here is a brief description of the variables:
. desc

Contains data from C:\PH142B\CRIME.DTA
obs: 51 Agresti and Finlay crime data
vars: 8 14 Sep 1997 20:55
size: 1,785 (86.0% of memory free)

1. state str3 %9s
2. violent float %9.0g violent crime per 100,000
3. murder float %9.0g murders per 100,000
4. metro float %9.0g percent of pop living in metro
5. white float %9.0g percent white
6. hsgrad float %9.0g percent high school grad or mor
7. poverty float %9.0g percent of families in poverty
8. snglpar float %9.0g percent of single parent famili

In class, we are using the poverty rate as an outcome variable;
for this lab, use the violent crime rate as the outcome.

Use the examples in the reader as models for the commands.

Be sure to read all the questions, as there are some stata commands you need to plan on your own.

Fit the following regression models:

regr violent metro white hsgrad poverty snglpar

regr violent metro poverty white snglpar

regr violent metro poverty snglpar

regr violent poverty

regr violent white

regr violent hsgrad

regr violent metro
regr metro poverty

regr metro snglpar

Continue to explore the association between some of the X variables:

regr poverty hsgrad

regr poverty white

regr white metro poverty snglpar hsgrad

regr hsgrad metro povery snglpar white

Refit the model

regr violent metro poverty snglpar

and use Stata’s predict command to calculate the fitted values (call them viofit) and the standardized residuals (call them viores) so that you can check the model assumptions.

Do the assumption checking for question 1 at this point; before you drop any observations!

See question 1 to help plan your commands now!

List the observations with large standardized residuals:

list state metro poverty snglpar violent viofit viores if abs(viores) > 2

And get a summary of the variables in this model to help explore the outliers

summ violent metro poverty snglpar, detail
Just to see what the influence of these 3 observations are on the conclusions:

drop if state==”DC”

regr violent metro poverty snglpar

drop if abs(viores) > 2

regr violent metro poverty snglpar
Note: Once you have dropped an observation, it’s gone.

You may need to reopen the dataset to do the assumption checking for the model with all 51 observations.

Question 1. Using the results for all the states and DC, discuss the assumptions for the model:

regr violent metro poverty snglpar

Use the residual vs. fitted plot to discuss the functional form and the assumption of constant variance.

Use the box plot to look for outliers and to assess symmetry,
and the qnorm plot and the Shapiro-Wilk test to discuss the normality assumption.

No matter what you conclude here, interpret the tests with caution in the following questions.

Questions 2, 3, 4, and 5 all use the models with DC included.

Question 2. In the full model regr violent metro white hsgrad poverty snglpar

Interpret the t test for the variable white.

Interpret the t test for the variable poverty.
Question 3. Set up and carry out the restricted vs. full F test to compare these two models

regr violent metro white hsgrad poverty snglpar

regr violent metro poverty snglpar

Be sure to state the hypotheses, show the calculation of the F statistic from the SS residuals,
give the numerator and denominator degrees of freedom,
use Stata’s Ftail function to find the P value, and state your conclusion in words.

Question 4. Compare your conclusions about the association between percent high school graduates and violent crime from the models

regr violent hsgrad

regr violent metro white hsgrad poverty snglpar

(Note: this question is not asking for a test to compare the 2 models!)

Use the regression of hsgrad on the other predictor variables metro white poverty snglpar to explain why these models lead to different conclusions about the association between pecent hsgrad and violent crime. (This is an example of collinearity.)

Question 5. Take a look at the models
regr violent metro

regr violent metro poverty snglpar

Metro is significant in both models

Verify that the differences in the point estimate, standard error, and confidence interval
for metro are relatively large. This is an example of confounding.

For this to happen, snglpar and/or poverty
must be associated both with metro and with violent.

Check that this is the case.

Question 6. Compare the fitted and observed values for the District of Columbia (DC), Mississippi (MS) and Florida (FL) for the model using all the observations. Do these states have unusual values on the X variables? on the outcome variable?
Question 7. Make a table of the estimated coefficients and standard errors for the model

regr violent metro poverty snglpar

for all 51 observations, for the 50 states with DC dropped,
for the 48 states with the 2 outliers and DC dropped.

Also make a table of the R2 values and the root MSEs for the 3 models and compare them.

Which coefficients are sensitive to the points that did not fit well, and which are not?

(That is, which variables have coefficient estimates that are similar for all 3 sets of states,
and which variables have coefficient estimates that are different?)

What changes do you see in the standard errors?

(Notice that with DC omitted, the SS total is much smaller, which is why the R2 value is actually smaller for the model with DC dropped.)