One of the most common applications of both Machine Learning and Deep Learning is to be able to make predictions. This is the first of a total of 4 posts were a multivariate case will be dealt with from start to end through both approaches. The final goal is to have an insight into the complexity of one over the other and their performance.
Case introduction
The case is based on the insurance dataset provided by Miri Choi on Kaggle. The objective is to be able to predict the individual medical costs billed by health insurance based on the features of the person to be insured.
X: (aka. features or predictors)
- age: Age of the primary beneficiary
- sex: Insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Whether or not the person to be insured smokes on a regular basis.
- region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
y: (aka. labels or response variable)
- charges: Individual medical costs billed by the health insurance company.
Exploratory Data Analysis
These are the python libraries that will be used in this phase:
from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = 'all' import matplotlib.pyplot as plt import plotly.express as px import plotly.graph_objects as go import seaborn as sns from datetime import datetime import pandas as pd import numpy as np from scipy import stats import pprint as pp
Since most likely we will want to recycle the code later, we can use:
y_name = 'charges'
Next up, we open the CSV file and create a DataFrame from it, so we can take a quick look and the info and how pandas is interpreting the type for each of the columns.
csvFilePath = './datasets_13720_18513_insurance.csv' with open (csvFilePath, 'rb') as file: data = pd.read_csv(file) pp.pprint(data.head()) pp.pprint(data.info())
age sex bmi children smoker region charges 0 19 female 27.900 0 yes southwest 16884.92400 1 18 male 33.770 1 no southeast 1725.55230 2 28 male 33.000 3 no southeast 4449.46200 3 33 male 22.705 0 no northwest 21984.47061 4 32 male 28.880 0 no northwest 3866.85520 <class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB None
Just out of curiosity, let’s see the possible values for the children
variable in the data set.
data['children'].unique() array([0, 1, 3, 2, 5, 4])
Now we know that we have 1,338 rows in our data set and there are no missing values in any of the variables. Nevertheless, we should change the data types of sex, smoker
and region
columns into categorical, since that’s what they really are.
Strictly speaking, children
is an integer predictor variable, but since it has only 6 possible values (0 to 5 children) we might just as well consider it as a categorical feature. The rest of the types of the columns are correctly interpreted by pandas automatically.
categorical_predictors = ['sex', 'smoker', 'children', 'region'] for cp in categorical_predictors: swarm_plot = sns.swarmplot(data=data, x=y_name, y=cp, palette ='muted') title = f'{y_name} by {cp}' swarm_plot.set(ylabel='', title=title) fig = swarm_plot.get_figure() file_name = title + ' ' + datetime.now().isoformat()[:19] fig.savefig(file_name, bbox_inches='tight') plt.figure()
age sex bmi children smoker region charges 0 19 female 27.900 0 yes southwest 16884.92400 1 18 male 33.770 1 no southeast 1725.55230 2 28 male 33.000 3 no southeast 4449.46200 3 33 male 22.705 0 no northwest 21984.47061 4 32 male 28.880 0 no northwest 3866.85520 <class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null category 2 bmi 1338 non-null float64 3 children 1338 non-null category 4 smoker 1338 non-null category 5 region 1338 non-null category 6 charges 1338 non-null float64 dtypes: category(4), float64(2), int64(1) memory usage: 37.2 KB None age bmi charges count 1338.000000 1338.000000 1338.000000 mean 39.207025 30.663397 13270.422265 std 14.049960 6.098187 12110.011237 min 18.000000 15.960000 1121.873900 25% 27.000000 26.296250 4740.287150 50% 39.000000 30.400000 9382.033000 75% 51.000000 34.693750 16639.912515 max 64.000000 53.130000 63770.428010
Labels
In order to have an idea of how the response variable is distributed across our data set, we can use a histogram.
pfig0 = px.histogram(data, x=y_name, histnorm='probability density', marginal='rug', title=f'Histogram and Rug Plot of {y_name}') pfig0.show()
It shows a very high positive skewness as well as a significant dispersion. An indicator of the later is a CV coefficient of 0.9126.
Categorical features
Since we don’t have that many categorical predictors we can afford the luxury of making a plot for each of them to try to identify their effect on the response variable (charges).
categorical_predictors = ['sex', 'smoker', 'children', 'region'] for cp in categorical_predictors: swarm_plot = sns.swarmplot(data=data, x=y_name, y=cp, palette ='muted') title = f'{y_name} by {cp}' swarm_plot.set(ylabel='', title=title) fig = swarm_plot.get_figure() file_name = title + ' ' + datetime.now().isoformat()[:19] fig.savefig(file_name, bbox_inches='tight') plt.figure()
From the graphs, it looks like the only categorical predictor that independently affects the charges
is smoke
. Now we can try to identify if there are second-grade interactions of the categorical predictors affecting the response variable:
for cp2 in categorical_predictors: for cp in categorical_predictors: if cp != cp2: plt.figure(figsize=(12.8,8.16)) swarm_plot = sns.swarmplot(data=data, x=y_name,y=cp, hue=cp2, palette ='muted') title = f'{y_name} by {cp} and {cp2}' swarm_plot.set(ylabel='', title=title) fig = swarm_plot.get_figure() file_name = title + ' ' + datetime.now().isoformat()[:19] fig.savefig(file_name, bbox_inches='tight') plt.figure()
There are no clear differences in the charges
distribution among the categories of the second dimension being analyzed in each of the above graphs. Therefore, it is quite unlikely that any interaction between the categorical features is a significant predictor for our response variable.
Numerical features
Moving on with the analysis, we try to identify the effects of our numerical predictors and their interactions on the charges
.
pfig0 = px.scatter(data, x='bmi', y='charges', title='Scatterplot of bmi vs charges', color_discrete_sequence = px.colors.qualitative.D3) pfig0.show()
On its own, bmi
seem to have a direct relation (at a linear level at least) with the charges. We continue by checking for a possible significant interaction between bmi and age. As per the following 3D graph, this is far-fetched.
pfig2 = go.Figure(data=[go.Mesh3d(x=data['bmi'].to_numpy(), y=data['age'].to_numpy(), z=data['charges'].to_numpy(), opacity=0.5, color='rgba(244,22,100,0.6)' ) ] ) pfig2.update_layout(scene=dict( xaxis=dict(nticks=4, range=[-0,60],), yaxis=dict(nticks=5, range=[0,70],), zaxis=dict(nticks=6, range=[0,70_000],), xaxis_title='bmi', yaxis_title='age', zaxis_title='charges' ), width=700, margin=dict(r=20, l=10, b=10, t=10) ) pfig2.show()
Interaction between categorical and numerical variables
numerical_predictors = ['age', 'bmi'] categorical_predictors = ['sex', 'smoker', 'children', 'region'] for np in numerical_predictors: for cp in categorical_predictors: plt.figure(figsize=(12.8,8.16)) swarm_plot = sns.swarmplot(data=data, x=y_name, y=cp, hue=cp2, palette ='muted') title = f'{y_name} by {np} and {cp}' swarm_plot.set(ylabel='', title=title) fig = swarm_plot.get_figure() file_name = title + ' ' + datetime.now().isoformat()[:19] fig.savefig(file_name, bbox_inches='tight') plt.figure()
The above graphs suggest significant relevancy of the interactions of age-smoker
and bmi-smoker
as predictors for the charges
. Although, in the case of the later, it looks like there’s a third variable in the interaction. We will check for bmi-smoker-sex
first:
pfig3 = px.scatter(data, x='age', y=y_name, color='smoker', facet_col='sex', title='charges by age-smoker per sex ', color_discrete_sequence = px.colors.qualitative.D3) pfig3.show()
We can hardly see any difference between the interaction age-smoker
when separating males and females. Therefore It would be better to look somewhere else for a possible third-grade interaction explaining the 3 different cost groups that are shown in the scatter charges by age and smoker. The interaction age-bmi-smoker
could also explain that, so we will inspect it.
pfig4 = px.scatter_3d( data, x='age', y='bmi', z='charges', color='smoker', title=f'3D scatter of age-bmi-sex ', opacity=0.4, color_discrete_sequence = px.colors.qualitative.D3 ) pfig4.show()
It looks that the interaction we were looking for is age-bmi-smoker
. Naturally, the older a person gets, the higher the insurance cost. Additionally, smokers pay significantly more than non-smokers. We knew both things from the scatter charges vs age and smoker. The novelty here is that low bmi smokers (smokers with a bmi
around 17 to 25) are much likely to get a significantly lower insurance cost that smokers with a bmi
above that range.
Conclusions from the Exploratory Data Analysis
After a thorough Data Analysis, we can conclude that children
and region
are irrelevant. In the other hand the features and interactions that suggest being significant predictors for our response variable are: smoker
, sex
, bmi
, age
, bmi-smoker
, age-smoker
, age-smoker-bmi
.
All of the before-mentioned second and third-grade interactions showed by themselves to be relevant but left a significant room for improvement for the variation explanation on the charges
. Therefore, we induce that a fourth-grade interaction would potentially be a better predictor for our response variable. Because the only variable not used in the third-grade interaction was sex
, the fourth-grade interaction would have to be age-smoker-bmi-sex
which makes a lot of sense, since the bmi is considerably different between men and women.
While it required a significant effort, it has helped us determine which variables and interactions are worth considering as features in order to predict the charges
under both Machine Learning Regression and Neural Network approaches.
Be First to Comment