Homework #5 STAT 505 Fall 2011
Due date: October 7, 2011
Write this up as a report.
Include all computer code as an appendix.
- Download the cancer rate data from the
Guardian Data blog web site. Your goal is to find some predictors
which will explain why cancer rates differ from country to country.
This spreadsheet gives 3 different cancer rates. I suggest that we
use the overall rate as our response, but if you strongly prefer to
look at a gender specific one, that is OK.
- You can get lots of data on countries from the same site, for
instance here they list data related to
Social Issues/Migration/Health .
Or you might try the
CIA world fact book site
Or the UN data site.
Gapminder has lots of
data on countries, but you'll have to pull off
the latests year, as everything is a time series.
-
Obtain at least 5 variables which you guess are related to
cancer rates.
- Explain why you picked each and, before doing any exploration
or analysis, write down the expected sign of the coefficient for
each predictor.
- Plot each predictor against the response.
-
Build a multiple regression model using all of them. Use the
suggestions in section 4.6 to evaluate the predictors. Which do
you keep? How well does your model work?
- Are there any problems with the usual assumptions?
- Does your model provide any understanding of where or how one
should live to reduce cancer risk?
Note: to combine datasets, use merge in R. By default it will look
for columns in the two datasets with the same name, so you need to
make sure the names match (case sensitive). If country is the only
common column, then it will match on country. You can
tell it how to handle situations where a country in set A has no match
in B and vice versa (see the help and examples on merge).
Author:
Jim Robison-Cox
Last Updated: