In this lab we'll look at examining some health statistics for every US county, from countyhealthrankings.org. We'll look for correlations between various statistics and the county's YPLL value: Years of Potential Life Lost. The latter is generally considered a good measure of overall population health; the question is what other health-related statistics are correlated with this.
Correlation is not causation. One of the stronger correlations, for example, is between YPLL and free school lunches. But nobody is suggesting that we stop free lunches; the point is that free school lunches are a good indication of poverty.
YPLL is calculated by taking the number of deaths in one year and adding up the number of years each death occurred before the person reached 75. That is, someone dying at 65 adds 10 to the sum; someone dying at 80 adds 0. The sum is divided by the population, and then multiplied by 100,000, so that it is in effect an early death rate per 100,000. The threshold of 75 is arbitrary; in some contexts, the age threshold is indicated as YPLL[75].
Note that the population here is the population of the entire county, not just those who have died. The worst YPLL value in our data is around 25,000. If 1/75 of the population dies each year, then in 100,000 people we might expect 1300 deaths, which would make a YPLL of 25,000 correspond to almost 20 years per death.
All the data is in the file healthdata.csv. The .csv means "comma-separated values"; you should be able to open this file in a spreadsheet. We can read this into a Python list of lists using the csv package (built in to python). I am also providing a file health.py, which includes the following:
def file2rows(filename):
file = open(filename, 'r', newline='')
return list(csv.reader(file))
This uses the csv package to return the file as a list of lists, one entry per line, and each line is represented as a list of twenty values each representing one of the attributes.The columns are as follows (these are defined in health.py):
FIPS=0
STATE=1
COUNTY=2
POPULATION=3
UNDER18=4
OVER65=5
AFRICANAMERICAN=6
FEMALE=7
RURAL=8
DIABETES=9
HIV=10
ACTIVITY=11
MENTAL=12
INCOME=13
HOUSING=14
FREELUNCH=15
CHILDLITERACY=16
DRIVEALONE=17
UNRELIABLE=18
YPLL=19
There is also a dictionary columnd that takes a column number and returns a string representing the name of that column. (I could have used a list also, instead of a dictionary.)
Finding the correlations can be done with a spreadsheet if the data is all organized. But there are issues. The issues we will have to deal with are:
As the data is initially read into the list healthrows, the entries all represent type string. You will have to convert them to float before you can do calculations on them. See extractcols() below; step 4 is the one where we consider the possibility that the data string is '17%'.
We will use Pearson's r statistic. The coefficient of correlation is often considered to be r2, so you should calculate that too. Of course, squaring eliminates the sign; a negative r is a measure of negative correlation. For example,there is a negative correlation between county average income and YPLL.
The value of r2, which is a number between 0 and 1, is sometimes interpreted as the fraction of the change in YPLL that is accounted for by the first statistic, though a more accurate way of putting this is that, if we draw the graph, r2 describes the portion of the change in YPLL versus the first statistic that is explained by the graph. Thus, if r2 = 0.25, as it is for CHILDLITERACY, then child literacy might be said to account for 25% of the YPLL reduction. Again, that does not mean that child literacy causes that reduction.
I am providing a function r(xlist, ylist) that takes two lists of numbers and calculates the correlation coefficient r. It relies on a helper function statsums(xlist,ylist), which takes the list and returns, via a tuple, six different calculated values. All are calculated in one pass, and the values are sufficient for r to calculate what it needs to calculate without looking at the lists again.
There are other statistical measures that can help determine the significance of the correlation; we are ignoring those.
Your main job here will be to get the lists of data values, eg xlist and ylist, extracting them from healthrows. The xlist values come from one of the columns 3-17 (Population - DriveAlone); the ylist values come from the YPLL column. You need to:
You could just get rid of all rows in which any entry is empty, but that's misleading. We want to keep, for each correlation, as much data as we can. The all-rows approach discards a third of the data.
I recommend writing extractcols(col1, col2), where col1 and col2 are the integer column tags; col1 is something in range(POPULATION, DRIVEALONE+1) and col2 is YPLL. You will have something like this, which returns two lists, via a tuple.
def
extractcols(col1, col2):
col1data=[]
col2data=[]
for row in healthrows:
val1= row[col1]
val2 = row[col2]
unreliable = row[UNRELIABLE]
# check stuff here
# if all checks out, do this:
fval1 =
float(val1) # convert to number
col1data.append(fval1)
# same for col2data
return ( col1data, col2data )
You will call this using something like this, for stat in range(POPULATION, DRIVEALONE+1):
def
getstats(stat):
(xlist,ylist) = extractcols(stat, YPLL)
# now calculate R, and R^2, and print it out
# The name of the statistic, stat, can be obtained as
columnd[stat]
I've also included a function scatterplot(stat, xlist, ylist), though it requires that you import graphics.
Overview
Cleaning and column extraction
stats
scatterplot demos