Predicting Medical Costs using Multivariate Linear Regression in Python

Predicting Medical Costs using Multivariate Linear Regression in Python

Exploring the power of pandas, numpy, and sklearn in analyzing an insurance dataset and building a predictive model

Multivariate Linear Regression

Multivariate linear regression is a statistical method used to model the relationship between multiple independent variables and a single dependent variable. It is an extension of simple linear regression, which only involves one independent variable. In multivariate linear regression, the goal is to find the equation that best predicts the value of the dependent variable based on the values of the independent variables. The equation is in the form of Y = a + b1X1 + b2X2 + ... + bnXn, where Y is the dependent variable, X1, X2, ..., Xn are the independent variables, a is the constant term, and b1, b2, ..., bn are the coefficients that represent the relationship between each independent variable and the dependent variable.

What we do in this?

We accurately predict charges cost?

Columns present in dataset:

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9.

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

Importing Important libreries

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


import matplotlib.pyplot as plt

Reading files

Below code uses the read_csv() function from the pandas library to read in the medical insurence data from a csv file and assigns the resulting dataframe to a variable named df.

df = pd.read_csv('/kaggle/input/insurance/insurance.csv')
df.head()
agesexbmichildrensmokerregioncharges
019female27.9000yessouthwest16884.92400
118male33.7701nosoutheast1725.55230
228male33.0003nosoutheast4449.46200
333male22.7050nonorthwest21984.47061
432male28.8800nonorthwest3866.85520

Feature engineering

Next, we applies one-hot encoding to the sex, region, and smoker columns of the dataframe and assigns the resulting dataframe to a new variable df_encoded.

# Apply one-hot encoding to "color" column
df_encoded = pd.get_dummies(df, columns=['sex', 'region', 'smoker'])
df_encoded
agebmichildrenchargessex_femalesex_maleregion_northeastregion_northwestregion_southeastregion_southwestsmoker_nosmoker_yes
01927.900016884.9240010000101
11833.77011725.5523001001010
22833.00034449.4620001001010
33322.705021984.4706101010010
43228.88003866.8552001010010
.......................................
13335030.970310600.5483001010010
13341831.92002205.9808010100010
13351836.85001629.8335010001010
13362125.80002007.9450010000110
13376129.070029141.3603010010001

1338 rows × 12 columns

df_encoded.columns

Index(['age', 'bmi', 'children', 'charges', 'sex_female', 'sex_male', 'region_northeast', 'region_northwest', 'region_southeast', 'region_southwest', 'smoker_no', 'smoker_yes'], dtype='object')

Feature selection

Next, the code selects the relevant columns of the encoded dataframe to use as independent variables (X) and the dependent variable (y) for the linear regression model.

X = df_encoded[['age', 'bmi', 'children', 'sex_female', 'sex_male',
       'region_northeast', 'region_northwest', 'region_southeast',
       'region_southwest', 'smoker_no', 'smoker_yes']]
y = df_encoded['charges']

Preparing model

Below code splits the data into training and testing sets using the train_test_split function, fits the linear regression model using the training data and prints the MSE of the model.

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# create a linear regression model
model = LinearRegression()
# train the model on the training data
train_loss = []
test_loss = []

# train the model
for i in range(100):
    model.fit(X_train, y_train)
    train_loss.append(mean_squared_error(y_train, model.predict(X_train)))
    test_loss.append(mean_squared_error(y_test, model.predict(X_test)))
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
# predict the values for the training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Plot the prediction line
plt.scatter(y_train, y_train_pred,label='train')
plt.scatter(y_test, y_test_pred,label='test')
plt.legend()
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Prediction line")
plt.show()

prediction line

# Plot the residuals
plt.scatter(y_train_pred, y_train_pred - y_train,label='train')
plt.scatter(y_test_pred, y_test_pred - y_test,label='test')
plt.legend()
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

residuals

# Plot the loss
plt.plot(train_loss, label='train')
plt.plot(test_loss, label='test')
plt.legend()
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.title("Loss Plot")
plt.show()

Loss

Overall, this code is performing a linear regression analysis on an insurance dataset. It begins by importing the necessary libraries for the analysis, then reads in the data from a csv file using pandas, applies one-hot encoding to certain columns, selects the relevant columns to use in the model, and finally splits the data into training and testing sets and fits a linear regression model to the training data. The last line prints the MSE of the model as a measure of performance.

Main Post: Complete-Data-Science-Bootcamp

Buy Me A Coffee

Did you find this article valuable?

Support Anurag Verma by becoming a sponsor. Any amount is appreciated!