Hi all,
Currently working on the portfolio project of US Medical Insurance.
I’ve created a two-way table using the Pandas crosstab function:
average_cost_by_sex_and_region = pd.crosstab(index=pd_df["sex"],
columns=pd_df["region"])
average_cost_by_sex_and_region.index = ["male","female"]
average_cost_by_sex_and_region
At the moment, the values in this table are “count” values - the number of females in the Northwest, for example, and what it want it to be are the mean insurance cost for them. I’ve generated those figures here:
pd_df.groupby(["region", "sex"])["charges"].mean() # Average cost split by biological sex in area
But I can’t work out how to get those figures into the two-way table - does that make sense?
I also can’t get the syntax of “margins” to work such that I get row and column totals, although the documentation suggests this should be fairly easy - any suggestions on that?
Thanks in advance
Do you mean something like this? (Also, maybe look at median rather than mean b/c there are outliers that pull the mean).
sum in each region:
df.groupby(['region', 'sex'])['charges'].sum().round(2)
>>region sex
northeast female 2085465.71
male 2258202.88
northwest female 2046698.75
male 1989013.25
southeast female 2362442.12
male 3001247.65
southwest female 1826454.62
male 2186300.02
Or, the median charges for both sexes in each region?:
df.groupby(['region', 'sex'])['charges'].median().round(2)
>>region sex
northeast female 10197.77
male 9957.72
northwest female 9614.07
male 8413.46
southeast female 8582.30
male 9504.31
southwest female 8530.84
male 9391.35
>>
Or, the number of men & women in each region?
df.groupby(['region', 'sex'])['sex'].count()
>>region sex
northeast female 161
male 163
northwest female 164
male 161
southeast female 175
male 189
southwest female 162
male 163
1 Like
Hi,
Thanks - some really good learning points there that I can definitely use! The median point is a good one, and I’ll def incorporate that.
I still can’t work out how to get it into 2 x 4 crosstab table though, although that’s more from the point of view of a data vis point of view, as I can’t do anything like Chi squared on it if it’s not frequencies anyway (I think!).
1 Like
No, you can’t do chi squared here.
Maybe a viz? something like,
sns.barplot(x='sex', y='charges', hue='region', estimator= np.median, data=df)
1 Like