Case Study: Seattle House Prices-I
Learn about the ModernDive with this case study with EDA.
We'll cover the following
Kaggle is a machine-learning and predictive-modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is “House Sales in King County, USA.” It consists of the sale prices of homes sold between May 2014 and May 2015 in King County, Washington, USA, which includes the greater Seattle metropolitan area. This dataset is in the house_prices
data frame included in the moderndive
package.
The dataset consists of 21,613 houses and 21 variables describing these houses (for a full list and description of these variables, see the help file by running ?house_prices
in the console). In this case study, we’ll create a multiple regression model where:
The outcome variable
is the sale price of houses. There are two explanatory variables:
A numerical explanatory variable
: House size sqft_living
is measured in square feet of living space. Note that 1 square foot is about 0.09 square meters.A categorical explanatory variable
: House condition
is a categorical variable with five levels where1
indicates poor and5
indicates excellent.
Exploratory data analysis
As we’ve said numerous times throughout, a crucial first step when presented with data is to perform an EDA. This can give us a sense of our data, help identify issues with our data, bring to light any outliers, and help inform model construction.
Recall the three common steps in an EDA:
Looking at raw data values
Computing summary statistics
Creating data visualizations
First, let’s look at the raw data using View()
to bring up RStudio’s spreadsheet viewer and the glimpse()
function from the dplyr
package:
Get hands-on with 1400+ tech skills courses.