Regression Analysis for Housing
Regression and machine learning are at the forefront of data science analysis. While not exactly new, these tools are extremely useful features that encompasses data manipulation for adaptation and prediction. Today, I wanted to share a project my team and I have been working on over the past couple of weeks.
We began this project by exploring a data set from King County housing in Washington. The data included 21 columns that highlighted key characteristics of each house and over +20,000 rows representing individual houses. While the data was quite extensive, we wanted to create a business idea that addressed existing issues around the world. That being said, the issue of low-income housing arose and thus began our project.
It was quickly identified that our dependent variable would be housing price and the other features would be our predictors. Before training a model, the data set had to be narrowed down in order to be used properly. Just by looking at the data, we had to remove outliers and redundant values. We took the approach of first minimizing the data to fit between 2 standard deviations from the mean (95% of the data). Doing some further exploration and helpful Python tools, there were many outliers in the predictor variables as well that had been removed using the .loc function, respectively.
To develop the most successful model, we took an iterative approach to help improve our model every step of the way.
Approach A: This was more of a baseline model and used the data cleaning that was performed above. We played around with a couple of predictors and performed a standard train-test split for our first vanilla model. Nothing special came from this model and certainly did not set a precedent but we managed to draw valuable insights for our future models and knew what steps to take. To put into perspective, our R2 value was around .10 with multiple predictors having p-values significantly greater than 0.05. Not exactly what we had been hoping for.
Approach B: Our second approach was more procedural. Starting with looking at data distributions looking for normality and satisfying linearity. This helped us clearly define predictor features and what we wanted to include in our model. Further, we discovered a heat map that could reveal collinear features and which ones should not be included together in the model. To address our problem, we examined some simple data statistics such as measures of central tendency, standard deviation and quantile ranges; responsibly narrowed our data set into lower priced houses starting from $154,000–$315,000. We also decided that it would be effective to create dummy variables for categorical features and noting the impact of specific columns.
The model improved but not at all to our delight. Amounting an R2 of .18 we knew there was something we could rethink. Of course, we examined the residuals to fulfill linear regression assumptions: homoscedastic and relatively normal residuals(QQ plots).
Approach C: There were still some options that we knew we could try to help improve our model performance. Before that, looking at the median prices of houses, we broadened our target interval to $134,000 — $435,000 based on the median.
Then we began to expand our options: log transform and scale some of our predictors. While some distributions were not normal, we knew if we log transformed some distributions, there was potential for model improvement. Furthermore, some predictors such as the square feet features (sqft_living, sqft_lot, sqft_basement, etc.) were on a completely different scale compared to our other features so we concluded that that may have a lasting effect. Hence, our min/max scaling for those features (making the values from 0–1). Double checking our assumptions we ran another model.
Similar results with an R2 hovering around .20, even removing the insignificant features with positive coefficients.
Approach D: This was our final approach and we did some broader brainstorming. Closely examining each feature, we decided to incorporate all of them and eliminate insignificant predictors. To our surprise, the location, specifically latitude had a large impact on the model. We wanted to inspect this visually and created a scatterplot map of the entire King County and represented it quite accurately.
It was clear that houses located on larger lines of latitude were more expensive and descending meant the houses were cheaper. So, incorporating the latitude was important. For organized the latitudes into unique ranges to create dummy variables and understand the p-values a bit better. Some ranges seemed to be more significant than others and therefore implemented into our model. Similarly, the years the houses were built also played a role in enhancing our model. As done with the latitude, ranges were created for the years built as well and filtering out some older houses using .loc. Dummy variables were also created for the year built ranges. Keep in mind we removed the previous dummy variables in Approach B as they seem to have no effect on our results.
With our final data frame ready. We modeled our train set and produced an R2 of .52. Which was way larger than what we had before and could be interpreted as our model accounting for 52% of the variance in housing prices.
I wanted to do further inference looking at the regression cost-function. Running cross validation (with 8 folds or iterations) and comparing the root mean squared errors from the model and the mean of the cross validation, both nearing $49,000.
So, to predict the value of a house with specified values for each predictor, the value would be accurate to an extent error of +/- $49,000.
From previous results this RMSE was improved as it was minimized in this final model. While typical regression coefficients produce higher R2, it was evident that a linear regression was not the best model for our data set and other ML models could be incorporated.
In the future, I hope perform more regression analysis to demonstrate my understanding of this prominent Machine Learning algorithm.