Homework #5 STAT 505 Fall 2011

Due date: October 7, 2011
Write this up as a report.
Include all computer code as an appendix.
  1. Download the cancer rate data from the Guardian Data blog web site. Your goal is to find some predictors which will explain why cancer rates differ from country to country. This spreadsheet gives 3 different cancer rates. I suggest that we use the overall rate as our response, but if you strongly prefer to look at a gender specific one, that is OK.
  2. You can get lots of data on countries from the same site, for instance here they list data related to Social Issues/Migration/Health .
    Or you might try the CIA world fact book site
    Or the UN data site.

      Gapminder has lots of data on countries, but you'll have to pull off the latests year, as everything is a time series.
    1. Obtain at least 5 variables which you guess are related to cancer rates.
    2. Explain why you picked each and, before doing any exploration or analysis, write down the expected sign of the coefficient for each predictor.
  3. Plot each predictor against the response.
  4. Build a multiple regression model using all of them. Use the suggestions in section 4.6 to evaluate the predictors. Which do you keep? How well does your model work?
  5. Are there any problems with the usual assumptions?
  6. Does your model provide any understanding of where or how one should live to reduce cancer risk?
Note: to combine datasets, use merge in R. By default it will look for columns in the two datasets with the same name, so you need to make sure the names match (case sensitive). If country is the only common column, then it will match on country. You can tell it how to handle situations where a country in set A has no match in B and vice versa (see the help and examples on merge).


Author: Jim Robison-Cox
Last Updated: