AirBnb Ratings Classification

Matt Zhang
5 min readJan 4, 2021

So often are public company’s success determined by customer satisfaction. In the case of AirBnb, this is directly influenced by customer experience. Today I hope to highlight some of the driving factors of customer experience through machine learning classification. More specifically, classifying a set of AirBnb listings based on their individual ratings. In order to accomplish this, I took a number of procedural data science steps to reach a successful model.

The data was taken from InsideAirBnb which is a website that allows users to access updated AirBnb data for unique cities around the world. The data includes a variety of features that pertain to the listed properties such as: host ID, number of accommodates, beds, communication score, etc. As you can tell, these will be the variables that impact our model. The first step I had to take before I began my modeling was to clean the data. Like many real world data sets, there were many null values for each individual column. I suspected that removing them would be too simple; moreover, the data set would be too small to run an algorithm over. For columns with few null values, it I figured the best way to address those would be to use the median of the present values. While other columns had marginally more null values, I had to deal with them differently. After subtle exploration of NumPy functions, I recalled the np.random.choice which generates a random sample. To solve my issue with null values, I generated a random sample from the existing distribution of values within the columns and filled it in with values that maintained that distribution. This way, nothing is effectively changed when creating a clean set of data. As for the issue of outliers, I looked at a number of box plots to find values outside of the lower and upper quantile regions. To make my data frame, I needed only numerical numbers; so this process includes converting True and False to binary and also creating dummy variables for categorical columns that might have an impact on my model. Lastly, I had to manually create my target classes to actually be predicted on. Of course, addressing my project purpose, this was performed on the ratings column. I decided to make my 3 splits based on measures of central tendency, this would later help me avoid class imbalances as well. The Subpar (0) class had around 9,000 values, the Good (1) class had around 10,000 values and the Best (2) class had around 11,000 values.

Moving to the exploratory data analysis (EDA) the main purpose here was to visualize the distribution of classes across each column/feature. I characteristically, wanted to see variance in features as that would support the model in identifying each class as uniquely as possible. Also, this would help filter out potential important features. I utilized Seaborn to illustrate necessary graphs of discrete and continuous variables across my 3 classes.

With the data processed and cleaned, I could finally move onto my modeling. I deeply explored 4 models (KNN, Decision Trees, Random Forests and more) throughout my iterative process. Each model took a similar step using Sci-Kit Learn and would have a baseline modeled being compared to a final model with hyper parameter tuning. While each model had unique parameters to tune, and different results, the ensemble methods performed most similarly. K-Nearest Neighbors (KNN), is a unique algorithm that looks at the k nearest training points to your point based on distance. Something interesting to note is that because it is a distance based algorithm, it must be scaled because differing values may be weighted due to the distance. Out of the 4 models, this algorithm performed the least well, with a testing accuracy of 58%. I was able to run a function that found the best k value considering 25 nearest neighbors. For Decision Trees, the baseline model performed quite well with an accuracy of 60%. However, the training was perfect 100%, clearly communicating signs of overfitting. I therefore tuned multiple parameters using GridSearchCV, which finds the best parameters for your model given a dictionary of key-values whilst running cross validation. I was able to raise the accuracy to 68% tuning new hyper parameters such as max_depth, min_samples_split, min_samples_leaf, criterion, etc. The most important explanation I have for these parameters is max_depth, which is how many splits your tree can perform without overfitting.

This is where my final project and primary model come in. Because it is an ensemble of decision trees, I utilized the Random Forest Classifier. In essence, this model is a collection of multiple decision trees to create resiliency to overfitting and improved overall performance. For specifics, it consists of bootstrap aggregation, or bagging, which takes a resampling of the data for each tree; and also, subspace sampling, which takes in a random subset of unique features for each tree. My baseline model performed just as well as the enhanced decision tree with a testing accuracy of 68%. Yet, like the decision tree, the training set performed perfectly, still hinting to signs of overfitting. I suspected this was because the number of estimators the random forest took in. After tuning the model, and the number of estimators as a hyper parameter reduced to 200, again running GridSearch, I received an accuracy of 70%! The classification report was looking great, but also the confusion matrix. The confusion matrix results from your model performance and is created looking at the classification report. It maps correctly classified and misclassified data into a visual matrix. For my classes, Subpar (0) 2,000/2,700 were predicted correctly. For Good(1) 1,600/2,400 were predicted correctly, and for Best(2) 1,500/2,100 were predicted correctly.

For conclusions, I realized to my standards, my model classified my engineered classes respectively well and is particularly useful for predicting the ratings of houses with few reviews. For my random forest, I also visualized feature importances based on percentages and discovered that communication, super host status, and cleanliness were driving factors for distinguishing the ratings. So recommendations could be made by improving on those 3 features accordingly.

In the future, I hope to explore the impact of the types of properties and even analyze different cities.

--

--