Skip to content

Multivariate prediction: Machine Learning vs Neural Network – Case Introduction and Exploratory Data Analysis (Post 1 of 4)

One of the most common applications of both Machine Learning and Deep Learning is to be able to make predictions. This is the first of a total of 4 posts were a multivariate case will be dealt with from start to end through both approaches. The final goal is to have an insight into the complexity of one over the other and their performance.

Case introduction

The case is based on the insurance dataset provided by Miri Choi on Kaggle. The objective is to be able to predict the individual medical costs billed by health insurance based on the features of the person to be insured.

X: (aka. features or predictors)

  • age: Age of the primary beneficiary
  • sex: Insurance contractor gender, female, male
  • bmi: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Whether or not the person to be insured smokes on a regular basis.
  • region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

y: (aka. labels or response variable)

  • charges: Individual medical costs billed by the health insurance company.

 

Exploratory Data Analysis

These are the python libraries that will be used in this phase:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from datetime import datetime
import pandas as pd
import numpy as np
from scipy import stats
import pprint as pp

Since most likely we will want to recycle the code later, we can use:

y_name = 'charges'

Next up, we open the CSV file and create a DataFrame from it, so we can take a quick look and the info and how pandas is interpreting the type for each of the columns.

csvFilePath = './datasets_13720_18513_insurance.csv'
with open (csvFilePath, 'rb') as file:
    data = pd.read_csv(file)

pp.pprint(data.head())
pp.pprint(data.info())
age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337

Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None

Just out of curiosity, let’s see the possible values for the children variable in the data set.

data['children'].unique()
array([0, 1, 3, 2, 5, 4])

Now we know that we have 1,338 rows in our data set and there are no missing values in any of the variables. Nevertheless, we should change the data types of sex, smoker and region columns into categorical, since that’s what they really are.

Strictly speaking, children is an integer predictor variable, but since it has only 6 possible values (0 to 5 children) we might just as well consider it as a categorical feature. The rest of the types of the columns are correctly interpreted by pandas automatically.

categorical_predictors = ['sex', 'smoker', 'children', 'region']

for cp in categorical_predictors:
    swarm_plot = sns.swarmplot(data=data, x=y_name, y=cp, palette ='muted')
    title = f'{y_name} by {cp}'
    swarm_plot.set(ylabel='', title=title)
    fig = swarm_plot.get_figure()
    file_name = title + ' ' + datetime.now().isoformat()[:19]
    fig.savefig(file_name, bbox_inches='tight')
    plt.figure()
age     sex     bmi children smoker     region      charges
0   19  female  27.900        0    yes  southwest  16884.92400
1   18    male  33.770        1     no  southeast   1725.55230
2   28    male  33.000        3     no  southeast   4449.46200
3   33    male  22.705        0     no  northwest  21984.47061
4   32    male  28.880        0     no  northwest   3866.85520
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337

Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1338 non-null   int64   
 1   sex       1338 non-null   category
 2   bmi       1338 non-null   float64 
 3   children  1338 non-null   category
 4   smoker    1338 non-null   category
 5   region    1338 non-null   category
 6   charges   1338 non-null   float64 
dtypes: category(4), float64(2), int64(1)
memory usage: 37.2 KB
None

               age          bmi       charges
count  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397  13270.422265
std      14.049960     6.098187  12110.011237
min      18.000000    15.960000   1121.873900
25%      27.000000    26.296250   4740.287150
50%      39.000000    30.400000   9382.033000
75%      51.000000    34.693750  16639.912515
max      64.000000    53.130000  63770.428010

Labels

In order to have an idea of how the response variable is distributed across our data set, we can use a histogram.

pfig0 = px.histogram(data, x=y_name,
                        histnorm='probability density',
                        marginal='rug',
                        title=f'Histogram and Rug Plot of {y_name}')
pfig0.show()

It shows a very high positive skewness as well as a significant dispersion. An indicator of the later is a CV coefficient of 0.9126.

Categorical features

Since we don’t have that many categorical predictors we can afford the luxury of making a plot for each of them to try to identify their effect on the response variable (charges).

categorical_predictors = ['sex', 'smoker', 'children', 'region']

for cp in categorical_predictors:
    swarm_plot = sns.swarmplot(data=data, x=y_name, y=cp, palette ='muted')
    title = f'{y_name} by {cp}'
    swarm_plot.set(ylabel='', title=title)
    fig = swarm_plot.get_figure()
    file_name = title + ' ' + datetime.now().isoformat()[:19]
    fig.savefig(file_name, bbox_inches='tight')
    plt.figure()

From the graphs, it looks like the only categorical predictor that independently affects the charges is smoke. Now we can try to identify if there are second-grade interactions of the categorical predictors affecting the response variable:

for cp2 in categorical_predictors:
    for cp in categorical_predictors:
        if cp != cp2:
            plt.figure(figsize=(12.8,8.16))
            swarm_plot = sns.swarmplot(data=data, x=y_name,y=cp, 
                                       hue=cp2, palette ='muted')
            title = f'{y_name} by {cp} and {cp2}'
            swarm_plot.set(ylabel='', title=title)
            fig = swarm_plot.get_figure()
            file_name = title + ' ' + datetime.now().isoformat()[:19]
            fig.savefig(file_name, bbox_inches='tight')
            plt.figure()

 












There are no clear differences in the charges distribution among the categories of the second dimension being analyzed in each of the above graphs. Therefore, it is quite unlikely that any interaction between the categorical features is a significant predictor for our response variable.

 

Numerical features

Moving on with the analysis, we try to identify the effects of our numerical predictors and their interactions on the charges.

pfig0 = px.scatter(data,
                    x='bmi',
                    y='charges',
                    title='Scatterplot of bmi vs charges',
                    color_discrete_sequence = px.colors.qualitative.D3)

pfig0.show()

On its own, bmi seem to have a direct relation (at a linear level at least) with the charges. We continue by checking for a possible significant interaction between bmi and age. As per the following 3D graph, this is far-fetched.

pfig2 = go.Figure(data=[go.Mesh3d(x=data['bmi'].to_numpy(),
                                    y=data['age'].to_numpy(),
                                    z=data['charges'].to_numpy(),
                                    opacity=0.5,
                                    color='rgba(244,22,100,0.6)'
                                )
                            ]
                        )

pfig2.update_layout(scene=dict(
                                xaxis=dict(nticks=4, range=[-0,60],),
                                yaxis=dict(nticks=5, range=[0,70],),
                                zaxis=dict(nticks=6, range=[0,70_000],),
                                xaxis_title='bmi',
                                yaxis_title='age',
                                zaxis_title='charges'
                            ),
                    width=700,
                    margin=dict(r=20, l=10, b=10, t=10) 
                    )

pfig2.show()


 

Interaction between categorical and numerical variables

numerical_predictors = ['age', 'bmi']
categorical_predictors = ['sex', 'smoker', 'children', 'region']

for np in numerical_predictors:
    for cp in categorical_predictors:
            plt.figure(figsize=(12.8,8.16))
            swarm_plot = sns.swarmplot(data=data, x=y_name, y=cp,
                                       hue=cp2, palette ='muted')
            title = f'{y_name} by {np} and {cp}'
            swarm_plot.set(ylabel='', title=title)
            fig = swarm_plot.get_figure()
            file_name = title + ' ' + datetime.now().isoformat()[:19]
            fig.savefig(file_name, bbox_inches='tight')
            plt.figure()








The above graphs suggest significant relevancy of the interactions of age-smoker and bmi-smoker as predictors for the charges. Although, in the case of the later, it looks like there’s a third variable in the interaction. We will check for bmi-smoker-sex first:

pfig3 = px.scatter(data,
                    x='age',
                    y=y_name,
                    color='smoker',
                    facet_col='sex',
                    title='charges by age-smoker per sex ',
                    color_discrete_sequence = px.colors.qualitative.D3)
pfig3.show()

We can hardly see any difference between the interaction age-smoker when separating males and females. Therefore It would be better to look somewhere else for a possible third-grade interaction explaining the 3 different cost groups that are shown in the scatter charges by age and smoker. The interaction age-bmi-smoker could also explain that, so we will inspect it.

pfig4 = px.scatter_3d(
    data, x='age', y='bmi', z='charges',
    color='smoker',
    title=f'3D scatter of age-bmi-sex ',
    opacity=0.4,
    color_discrete_sequence = px.colors.qualitative.D3
)
pfig4.show()

It looks that the interaction we were looking for is age-bmi-smoker. Naturally, the older a person gets, the higher the insurance cost. Additionally, smokers pay significantly more than non-smokers. We knew both things from the scatter charges vs age and smoker. The novelty here is that low bmi smokers (smokers with a bmi around 17 to 25) are much likely to get a significantly lower insurance cost that smokers with a bmi above that range.

 

Conclusions from the Exploratory Data Analysis

After a thorough Data Analysis, we can conclude that children and region are irrelevant. In the other hand the features and interactions that suggest being significant predictors for our response variable are: smoker, sex, bmi , age, bmi-smoker, age-smoker, age-smoker-bmi.

All of the before-mentioned second and third-grade interactions showed by themselves to be relevant but left a significant room for improvement for the variation explanation on the charges. Therefore, we induce that a fourth-grade interaction would potentially be a better predictor for our response variable. Because the only variable not used in the third-grade interaction was sex, the fourth-grade interaction would have to be age-smoker-bmi-sex which makes a lot of sense, since the bmi is considerably different between men and women.

While it required a significant effort, it has helped us determine which variables and interactions are worth considering as features in order to predict the charges under both Machine Learning Regression and Neural Network approaches.

Published inBlogCase

Be First to Comment

Hey There! Let me know what you think about this post...