US Medical Insurance Data: Connection between BMI and Smoking?

jid · May 9, 2021, 2:42am

OrestesNZ/SmokingBMI/blob/main/Smoker-Bmi.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Is there a connection between BMI and Smoking from US Insurance data provided?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using Medical Data gathered from 2008 US Insuranace companies. We seek to break down the information into segments such as, Body Mass Index based upon the CDC guidelines for quantiles. Whether or not someone is a Smoker or Non Smoker, and which particular region they are from. We will then try to understand if there is connection or not, or a higher likely hood that someone from a particular BMI range is going to be a smoker or not. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [

This file has been truncated. show original

Any feed back would be great, i realize its quite verbose in terms of the amount of code i used, and will one day proof read any spelling errors etc. Am also looking into breaking it down in Class’s so as to lessen the amount of code required. Although im quite happy with the results, which is more about the process of solidifying some of the skills i have learned from codecacademy so far.

Thanken ye all…

lisalisaj · May 9, 2021, 2:39pm

A thought…Rather than write out the code to get the stats for each region (mean, std, median)…you could first break out the regions using .iloc[()].values like so:

southwest = df.iloc[(df['region']=='southwest').values]

And then just use the .describe() method on the bmi column:

southwest['bmi'].describe()

which results in:

count    325.000000
mean      30.596615
std        5.691836
min       17.400000
25%       26.900000
50%       30.300000
75%       34.600000
max       47.600000
Name: bmi, dtype: float64

See the documentation here.

lisalisaj · May 9, 2021, 3:10pm

imo you don’t need to create classes. Pandas is quite powerful in its own right (in addition to using scipy.stats & math libraries). If you can write a function you can run two-tailed t-test for statistical significance and figure out the strength of the relationship between the variables using Cohen’s d too.