Search⌘ K
AI Features

US State Incomes and Literacy Rates

Learn to analyze the relationship between literacy rates and income levels across US states using Python. This lesson guides you through importing data, classifying incomes by quantiles, merging datasets, and creating plots. Understand why literacy may correlate with income rather than cause differences in economic groups.

US Literacy rates

Now let’s look at a smaller scale and compare literacy rates and income for each US state. The data for literacy rates of US states has been taken from the ThinkImpact website. Please run the following code to see it yourself.

Python 3.5
import pandas as pd
import matplotlib as plt
import numpy as np
df = pd.read_csv('US_literacy_rate_by_states.csv')
df.rename(columns={'Literacy Rate (%)':'Literacy Rate'}, inplace=True)
df = df[['State','Literacy Rate']]
df = df.dropna(subset=['Literacy Rate'])
print (df)

Incomes of each state

The US states income group dataset is taken from Wikipedia. The table may look intimidating to import into Python, but there’s a great online website called wikitable2csv.ggor.de that will do all the work for us. We need to paste the Wikipedia URL, click “Convert”, and the site does all the rest.

Python 3.5
import pandas as pd
import matplotlib as plt
import numpy as np
dfv = pd.read_csv('US_annual_income_by_states.csv')
dfv.rename(columns={'Stateor territory':'State'}, inplace=True)
dfv.rename(columns={'Mean wage in US$[4]':'Mean wage'}, inplace=True)
dfv = dfv[['State', 'Mean wage']]
dfv['Mean wage'] = dfv['Mean wage'].replace("No data",np.NaN)
dfv = dfv.dropna(how='all', subset=['Mean wage'])
print (dfv.head())

Classifying income of each state

In the income dataset, we have the mean wages of each state. Using the income data, we are going to classify the states into four groups. But first, we will convert the Mean wage column type to integer. Next, we will use the command dfv.quantile to find the three quantiles which will help us divide the data into four equal parts and group the wages into the following classes:

  • Low income: The first quarter (0% to 25%).
  • Lower middle income: The second quarter (25% to 50%).
  • Upper middle income: The third quarter (50% to 75%).
  • High income: The fourth quarter (75% to 100%).

Python 3.5
import pandas as pd
import matplotlib as plt
import numpy as np
dfv = pd.read_csv('US_annual_income_by_states.csv')
dfv.rename(columns={'Stateor territory':'State'}, inplace=True)
dfv.rename(columns={'Mean wage in US$[4]':'Mean wage'}, inplace=True)
dfv = dfv[['State', 'Mean wage']]
dfv['Mean wage'] = dfv['Mean wage'].str.replace("$","")
dfv['Mean wage'] = dfv['Mean wage'].str.replace(",","")
nan_value = float("NaN")
dfv['Mean wage'].replace("No data", nan_value, inplace=True)
dfv = dfv.dropna(subset=['Mean wage'])
dfv['Mean wage'] = dfv['Mean wage'].astype(int)
q = dfv.quantile([0.25, 0.50, 0.75])
dfv['Income group'] = float("NaN")
col = 'Mean wage'
for row in dfv.index:
if dfv[col][row]<q[col][0.25]:
dfv['Income group'][row] = 'Low income'
if ((dfv[col][row]>=q[col][0.25]) & (dfv[col][row]<q[col][0.50])):
dfv['Income group'][row] = 'Lower middle income'
if ((dfv[col][row]>=q[col][0.50]) & (dfv[col][row]<q[col][0.75])):
dfv['Income group'][row] = 'Upper middle income'
if dfv[col][row]>=q[col][0.75]:
dfv['Income group'][row] = 'High income'
dfv = dfv.dropna(subset=['Income group'])
dfv = dfv[['State', 'Income group']]
print (dfv)

Merging and plotting of the data

Just as we did with the international data earlier, we’ll create a merged_data data frame, this time referenced in the State column. Run the code below to merge the data and plot literacy vs. income group data.

Python 3.5
import pandas as pd
import matplotlib as plt
import numpy as np
import plotly.graph_objs as go
import plotly.express as px
df = pd.read_csv('US_literacy_rate_by_states.csv')
df.rename(columns={'Literacy Rate (%)':'Literacy Rate'}, inplace=True)
df = df[['State','Literacy Rate']]
df = df.dropna(subset=['Literacy Rate'])
dfv = pd.read_csv('US_annual_income_by_states.csv')
dfv.rename(columns={'Stateor territory':'State'}, inplace=True)
dfv.rename(columns={'Mean wage in US$[4]':'Mean wage'}, inplace=True)
dfv = dfv[['State', 'Mean wage']]
dfv['Mean wage'] = dfv['Mean wage'].str.replace("$","")
dfv['Mean wage'] = dfv['Mean wage'].str.replace(",","")
nan_value = float("NaN")
dfv['Mean wage'].replace("No data", nan_value, inplace=True)
dfv = dfv.dropna(subset=['Mean wage'])
dfv['Mean wage'] = dfv['Mean wage'].astype(int)
q = dfv.quantile([0.25, 0.50, 0.75])
dfv['Income group'] = float("NaN")
col = 'Mean wage'
for row in dfv.index:
if dfv[col][row]<q[col][0.25]:
dfv['Income group'][row] = 'Low income'
if ((dfv[col][row]>=q[col][0.25]) & (dfv[col][row]<q[col][0.50])):
dfv['Income group'][row] = 'Lower middle income'
if ((dfv[col][row]>=q[col][0.50]) & (dfv[col][row]<q[col][0.75])):
dfv['Income group'][row] = 'Upper middle income'
if dfv[col][row]>=q[col][0.75]:
dfv['Income group'][row] = 'High income'
dfv = dfv.dropna(subset=['Income group'])
dfv = dfv[['State', 'Income group']]
merged_data = pd.merge(dfv ,df, on='State')
print(merged_data)
fig = px.scatter(merged_data, x="Literacy Rate", y="Income group",
log_x=False,
hover_data=["Literacy Rate", "Income group", "State"])
fig.update_yaxes(categoryorder='array', categoryarray= ['Low income','Lower middle income','Upper middle income','High income'])
fig.write_image("output/graph.png")

We do not get much additional insight about the relationship between the income group of a states and its literacy rate because almost all the states have a high literacy rate. This indicates that the difference in the income group among the states might depend on factors other than the literacy rate. Therefore, this supports the idea that this relationship is actually a correlation rather than a causation.

Jupyter notebook in action

To see the above Python scripts in a notebook, click to launch the application.

Please login to launch live app!