Predict a car's market using K-nearest neighbors

In this tutorial, we will explore the car’s market using K-nearest neighbors (KNN) regression algorithm.

KNN Overview

The k-nearest neighbors algorithm is based around the simple idea of predicting unknown values by matching them with the most similar known values. Suppose that we have 3 different types of cars. We know the name of the car, its horsepower, whether or not it has racing stripes, and whether or not it’s fast.

car,horsepower,racing_stripes,is_fast
Honda Accord,180,False,False
Yugo,500,True,True
Delorean DMC-12,200,True,True

Suppose that we now have another car, but we don’t know how fast it is:

car,horsepower,racing_stripes,is_fast
Chevrolet Camaro,400,True,Unknown

We want to figure out if the car is fast or not. In order to predict if it is with k nearest neighbors, we first find the most similar known car. The most similar is defined by the nearest Euclidean distance to that particular point.

In this case, we would compare the horsepower and racing_stripes values to find the most similar car, which is the Yugo. Since the Yugo is fast, we would predict that the Camaro is also fast. This is an example of 1-nearest neighbors – we only looked at the most similar car, giving us a k of 1.

If we performed a 2-nearest neighbors, we would end up with 2 True values (for the Delorean and the Yugo), which would average out to True. The Delorean and Yugo are the two most similar cars, giving us a k of 2.

If we did 3-nearest neighbors, we would end up with 2 True values and a False value, which would average out to True. If the predicting values are of type float rather than boolean, similarly, we would take an average of the numerical values.

The number of neighbors we use for k-nearest neighbors (k) can be any value less than the number of rows in our dataset. In practice, looking at only a few neighbors makes the algorithm perform better, because the less similar the neighbors are to our data, the worse the prediction will be.

The Data

For each car we have information about the technical aspects of the vehicle such as the motor’s displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. Read more about the data set here.

import pandas as pd
import numpy as np
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names=cols)
cars.head()
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

# numeric columns
continuous_values_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
numeric_cars = cars[continuous_values_cols]

Data Cleaning

numeric_cars = numeric_cars.replace('?', np.nan)
numeric_cars = numeric_cars.astype('float')
# dealing with missing values
numeric_cars.isnull().sum()
normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64
numeric_cars['normalized-losses'].shape
(205,)
# Because `price` is the column we want to predict, let's remove any rows with missing `price` values.
numeric_cars = numeric_cars.dropna(subset=['price'])
numeric_cars.isnull().sum()
normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64
# Replace missing values in other columns using column means.
numeric_cars = numeric_cars.fillna(numeric_cars.mean())
# Confirm that there's no more missing values!
numeric_cars.isnull().sum()
normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64
# Normalize the numeric ones so all values range from 0 to 1
price_col = numeric_cars['price'] # keep price as original
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col
numeric_cars.head()
normalized-losses wheel-base length width height curb-weight bore stroke compression-rate horsepower peak-rpm city-mpg highway-mpg price
0 0.298429 0.058309 0.413433 0.324786 0.083333 0.411171 0.664286 0.290476 0.1250 0.294393 0.346939 0.222222 0.289474 13495.0
1 0.298429 0.058309 0.413433 0.324786 0.083333 0.411171 0.664286 0.290476 0.1250 0.294393 0.346939 0.222222 0.289474 16500.0
2 0.298429 0.230321 0.449254 0.444444 0.383333 0.517843 0.100000 0.666667 0.1250 0.495327 0.346939 0.166667 0.263158 16500.0
3 0.518325 0.384840 0.529851 0.504274 0.541667 0.329325 0.464286 0.633333 0.1875 0.252336 0.551020 0.305556 0.368421 13950.0
4 0.518325 0.373178 0.529851 0.521368 0.541667 0.518231 0.464286 0.633333 0.0625 0.313084 0.551020 0.138889 0.157895 17450.0

Univariate Model

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor() #instaniate
    
    # Randomize order of rows in data frame.
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Split dataframe into train and test
    train_df = df.iloc[0:last_train_row]
    test_df = df.iloc[last_train_row:]
    
    # Fit the model, predict and calculate rmse
    knn.fit(train_df[[train_col]],train_df[target_col])
    predictions = knn.predict(test_df[[train_col]])
    rmse = mean_squared_error(predictions, test_df[target_col])**0.5
    return rmse
# Use this function to train and test univariate models using the different numeric columns in the data set.
cols = numeric_cars.columns.drop('price')
rmse_results = {}

for col in cols:
    rmse = knn_train_test(col, 'price', numeric_cars)
    rmse_results[col] = rmse
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

horsepower 4267.730361 highway-mpg 4628.793094 city-mpg 4814.778015 curb-weight 5166.828581 width 7110.412630 compression-rate 8096.301512 normalized-losses 8131.436882 length 8304.189346 stroke 9334.714914 peak-rpm 9759.209970 wheel-base 9969.243292 height 10839.693636 bore 13397.091693 dtype: float64

# Modify the function to accept k as a parameter
def knn_train_test(train_col, target_col, df, k):
    knn = KNeighborsRegressor(n_neighbors=k) #instaniate
    
    # Randomize order of rows in data frame.
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Split dataframe into train and test
    train_df = df.iloc[0:last_train_row]
    test_df = df.iloc[last_train_row:]
    
    # Fit the model, predict and calculate rmse
    knn.fit(train_df[[train_col]],train_df[target_col])
    predictions = knn.predict(test_df[[train_col]])
    rmse = mean_squared_error(predictions, test_df[target_col])**0.5
    return rmse
rmse_results = {}
k_values = [1,3,5,7,9]

for col in cols:
    k_results = {}
    for i in k_values:
        rmse = knn_train_test(col, 'price', numeric_cars, i)
        rmse_results[col] = rmse
        k_results[i] = rmse
    rmse_results[col] = k_results
rmse_results

{‘bore’: {1: 16502.858944335483, 3: 13895.111787987171, 5: 13397.091693481998, 7: 11075.156453540423, 9: 10178.905997122287}, ‘city-mpg’: {1: 5347.1502616620082, 3: 5210.2611302222185, 5: 4814.7780148494103, 7: 4575.9500050566039, 9: 4770.3441789226026}, ‘compression-rate’: {1: 8085.6051421555012, 3: 8137.9697256948321, 5: 8096.3015121133867, 7: 7896.6928707790858, 9: 7823.115528549677}, ‘curb-weight’: {1: 6566.7491754043158, 3: 5635.1847483924475, 5: 5166.8285806461754, 7: 5239.6312507047951, 9: 5244.5555635847895}, ‘height’: {1: 13032.276289928392, 3: 11411.019683044135, 5: 10839.693635873846, 7: 10041.327943738908, 9: 9313.3309652812659}, ‘highway-mpg’: {1: 5188.3334702021421, 3: 4655.0814815167259, 5: 4628.7930938146865, 7: 4112.3878029567513, 9: 4029.9622707968324}, ‘horsepower’: {1: 7027.6069712651306, 3: 5400.9297932358968, 5: 4267.7303610297877, 7: 3821.3765663687641, 9: 3461.132024333479}, ‘length’: {1: 10053.579063701594, 3: 8230.0502485409743, 5: 8304.1893462645621, 7: 8483.9289137342275, 9: 7655.12304417215}, ‘normalized-losses’: {1: 11628.904782718988, 3: 9578.7932451903052, 5: 8131.4368820724876, 7: 7441.8142534672079, 9: 7644.0837748147915}, ‘peak-rpm’: {1: 10914.812292757884, 3: 11280.739834196191, 5: 9759.2099697700633, 7: 9392.8298611313967, 9: 9423.9255454391023}, ‘stroke’: {1: 10925.953215320224, 3: 11848.331671515607, 5: 9334.714914185055, 7: 8255.3431097911271, 9: 7516.8591701514206}, ‘wheel-base’: {1: 8052.050206913359, 3: 9171.1538785611046, 5: 9969.2432917001752, 7: 8938.8088091337831, 9: 8637.3043859820991}, ‘width’: {1: 8044.1444455819001, 3: 7234.5582194328254, 5: 7110.4126300451044, 7: 6621.8483583166962, 9: 6531.4176381091274}}

Multivariate Model

# Modify the function to accept k as a parameter
def knn_train_test(train_col, target_col, df, k=5):
    knn = KNeighborsRegressor(n_neighbors=k) #instaniate
    
    # Randomize order of rows in data frame.
    np.random.seed(1)
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)
    
    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Split dataframe into train and test
    train_df = df.iloc[0:last_train_row]
    test_df = df.iloc[last_train_row:]
    
    # Fit the model, predict and calculate rmse
    knn.fit(train_df[train_col],train_df[target_col])
    predictions = knn.predict(test_df[train_col])
    rmse = mean_squared_error(predictions, test_df[target_col])**0.5
    return rmse
k_rmse_results = {}

two_best_features = ['horsepower', 'width']
rmse_val = knn_train_test(two_best_features, 'price', numeric_cars, 5)
k_rmse_results["two best features"] = rmse_val

three_best_features = ['horsepower', 'width', 'curb-weight']
rmse_val = knn_train_test(three_best_features, 'price', numeric_cars, 5)
k_rmse_results["three best features"] = rmse_val

four_best_features = ['horsepower', 'width', 'curb-weight', 'city-mpg']
rmse_val = knn_train_test(four_best_features, 'price', numeric_cars, 5)
k_rmse_results["four best features"] = rmse_val

five_best_features = ['horsepower', 'width', 'curb-weight' , 'city-mpg' , 'highway-mpg']
rmse_val = knn_train_test(five_best_features, 'price', numeric_cars, 5)
k_rmse_results["five best features"] = rmse_val

six_best_features = ['horsepower', 'width', 'curb-weight' , 'city-mpg' , 'highway-mpg', 'length']
rmse_val = knn_train_test(six_best_features, 'price', numeric_cars, 5)
k_rmse_results["six best features"] = rmse_val

k_rmse_results

{‘five best features’: 4472.2804078385598, ‘four best features’: 4700.6472093249722, ‘six best features’: 5276.4068591201612, ‘three best features’: 4667.0583479710576, ‘two best features’: 4101.8359934580176}

# For the top 3 models in the last step, vary the hyperparameter value from 1 to 25 and plot the resulting RMSE values.
two_best_features_results = {}
five_best_features_results = {}
three_best_features_results = {}
for k in range(1,26):
    two_best_rmse = knn_train_test(two_best_features, 'price', numeric_cars, k)
    three_best_rmse = knn_train_test(three_best_features, 'price', numeric_cars, k)
    five_best_rmse = knn_train_test(five_best_features, 'price', numeric_cars, k)
    two_best_features_results[k] = two_best_rmse
    three_best_features_results[k] = three_best_rmse
    five_best_features_results[k]  = five_best_rmse
# Combined:
k_rmse_results = {}
k_rmse_results['Two Best Features'] = two_best_features_results
k_rmse_results['Three Best Features'] = three_best_features_results
k_rmse_results['Five Best Features'] = five_best_features_results
import matplotlib.pyplot as plt
%matplotlib inline
for k,v in k_rmse_results.items():
    x = list(v.keys())
    y = list(v.values())
    plt.plot(x,y)
    plt.xlabel('K values')
    plt.ylabel('RMSE')
plt.legend(['Two','Three','Five'])
plt.show()

png

Next Steps:

  • Modify the knn_train_test() function to use k-fold cross validation instead of test/train validation.
  • Modify the knn_train_test() function to perform the data cleaning as well.
Written on December 13, 2017