Hi @roeystern8114604813, your project is interesting and I would like to share some comments and improvements. I will go by parts to be clearer.
Import part code:
Not much to say, only that throughout your code you convert to integers or floats the elements of lists corresponding to numbers. You even repeat the children list three times (as @netmarlon already pointed out), which you implement for this conversion. I suggest that you convert from this part, inside the with
context manager.
#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?
import csv
with open("insurance.csv") as insurance:
insurance_data = csv.DictReader(insurance)
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []
for row in insurance_data:
age.append(int(row['age'])) # Convert to number type from this point instead of converting every time you need it
sex.append(row['sex'])
bmi.append(float(row['bmi'])) # Same
children.append(int(row['children'])) # Same
smoker.append(row['smoker'])
region.append(row['region'])
charges.append(float(row['charges'])) # Same
Objective 1:
I think you should use return if the output of the function is going to be assign to a variable, but if the function prints something out to the terminal then suppress the return
keyword.
def average_age(lst):
total_ages = 0
#total_people = 0 --> It is unnecessary, you can use len(lst) to get the total people
for x in lst:
total_ages += x
average_age = total_ages/len(lst)
print('Out of {} participants, the average age was {}'.format(len(lst),round(average_age,2)))
average_age(age)
Other approximation for the previous function is using the sum
built-in function.
def average_age(lst):
average_age = sum(lst)/len(lst)
print('Out of {} participants, the average age was {}'.format(len(lst),round(average_age,2)))
average_age(age)
#Objective 2:
Besides the function for this objective, in the rest of your whole code you do not use the list r_and_c
, so I think it is better to use directly in the function definition. Sincerely, I do not know which is more efficient, but in this way you do not need two parameters to call your function. In addition, the zip
function is iterable so there is no need to convert to list.
def region_average(region_name):
region_count = 0
region_total_charges = 0
for x,y in zip(region,charges): # You do not need to enclose with parenthesis (just visual change), Python will automatically unpack each iterable
if x == region_name:
region_count += 1 # Contrary to total_people in the past function, this counter is necessary to increment just when the region is the appropiate
region_total_charges += y # Already converted to float in the with context manager
region_average = region_total_charges/region_count
return round(region_average,2)
Just as list comprehension there exist dictionary comprehension. I suggest to create the next dictionary:
region_dict = {name : None for name in region}
It is a dictionary with regions as keys, the values do not matter so I assign them all to None keyword. I have two reasons for this: firstly, to print the average cost per region you manually define the four variables and then a print statement with four lines; and secondly, you need to know the number (and name) of the different regions beforehand, what if the csv file has millions of rows?
Why a dictionary? Because the keys are unique so that it does not have repeated region names, the list of regions has all the regions but they are repeated many times. With the previous dictionary it is possible to write.
# Python code
# Python code
region_dict = sorted(region_dict) # It is optional to sort by key in order to print the information alphabetically
for region_name in region_dict: # It iterates over the keys
region_avg = region_average(region_name)
print('The average insurance charges of the {} region is {}.'.format(region_name.capitalize(),region_avg))
Another approach to your function of objective 2 is below. With list comprehension it is avoided the instantiation of the other variables because the len of this list is the region_count variable and the sum of its elements is the total charges for some region.
def region_average(region_name):
charges_by_region = [y for x,y in zip(region,charges) if x == region_name]
region_average = sum(charges_by_region)/len(charges_by_region)
return round(region_average,2)
Objective 3:
Not much to say, as before you do not need to enclose your temporal variables in the for
loop with parenthesis, you do not need to convert to float each element of bmi
list since it was done in the with
context manager and suppress the return
keyword because the function will print out to the terminal.
I think that when you use multiple line string ''' String '''
you should not have to indent every line, just the line where the print
keyword is, if you indent every line of the string it will be printed out to the terminal indented and you may not want that, at least that is what happened to me.
s_and_bmi = zip(smoker,bmi) # The same comment as for the list r_and_c in the previous objective. This time I will let this
# list out the function but I think it is better to use inside because you do not use it again. In any case, since zip is
# an iterable object it is unnecessary to convert to list
def smoker_bmi(lst):
total_smokers = 0
total_smoker_bmi = 0
total_nonsmoker_bmi = 0
total_nonsmokers = 0
for x,y in lst: # I delete the parenthesis
if x == 'yes':
total_smokers += 1
total_smoker_bmi += y # Already converted to float in the with context manager
elif x == 'no':
total_nonsmokers += 1
total_nonsmoker_bmi += y # Already converted to float in the with context manager
average_smokers_bmi = total_smoker_bmi/total_smokers
averagee_nonsmokers_bmi = total_nonsmoker_bmi/total_nonsmokers
print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(total_smokers,round(average_smokers_bmi,2),total_nonsmokers,round(averagee_nonsmokers_bmi,2))) # Suppress return keyword
smoker_bmi(s_and_bmi)
Other way is as follow.
s_and_bmi = zip(smoker,bmi)
def smoker_bmi(lst):
smoker_list = []
nonsmoker_list = []
for x,y in lst:
if x == 'yes':
smoker_list.append(y)
else:
nonsmoker_list.append(y)
average_smokers_bmi = sum(smoker_list)/len(smoker_list)
average_nonsmokers_bmi = sum(nonsmoker_list)/len(nonsmoker_list)
print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(len(smoker_list),round(average_smokers_bmi,2),len(nonsmoker_list),round(average_nonsmokers_bmi,2)))
smoker_bmi(s_and_bmi)
Objective 4:
Here, your children list already has integer values.
children_and_charges = zip(children,charges) # zip function is iterable, do not need to convert to list
def at_least_1_child(lst):
total_atleast_1 = 0
total_atleast_1_charges = 0
for x,y in lst:
if x >= 1:
total_atleast_1 += 1
total_atleast_1_charges += y # Already converted to float in the with context manager
average_atleast1_charges = total_atleast_1_charges/total_atleast_1
print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(total_atleast_1,round(average_atleast1_charges,2))) # Suppress return keyword
at_least_1_child(children_and_charges)
And with list comprehesion we have.
children_and_charges = zip(children,charges)
def at_least_1_child(lst):
atleast1_charges = [y for x,y in lst if x >= 1]
average_atleast1_charges = sum(atleast1_charges)/len(atleast1_charges)
print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(len(atleast1_charges),round(average_atleast1_charges,2)))
at_least_1_child(children_and_charges)
Objective 5 function has pretty the same observations and the function can be simplified by list comprehension:
# Python
[e for a,b,c,d,e in zip(sex,smoker,children,region,charges) if a == 'female' and b == 'yes' and c == 0 and d == 'southeast']
To sum up, your code with some modifications would be the next 101 lines code:
#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?
import csv
with open("insurance.csv") as insurance:
insurance_data = csv.DictReader(insurance)
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []
for row in insurance_data:
age.append(int(row['age']))
sex.append(row['sex'])
bmi.append(float(row['bmi']))
children.append(int(row['children']))
smoker.append(row['smoker'])
region.append(row['region'])
charges.append(float(row['charges']))
#Objective 1:Average Age
def average_age(lst):
total_ages = 0
for x in lst:
total_ages += x
average_age = total_ages/len(lst)
print('Out of {} participants, the average age was {}\n'.format(len(lst),round(average_age,2)))
average_age(age)
#Objective 2: Region Averages
def region_average(region_name):
region_count = 0
region_total_charges = 0
for x,y in zip(region,charges):
if x == region_name:
region_count += 1
region_total_charges += y
region_average = region_total_charges/region_count
return round(region_average,2)
region_dict = {name : None for name in region}
region_dict = sorted(region_dict)
for region_name in region_dict:
region_avg = region_average(region_name)
print('The average insurance charges of the {} region is {}.'.format(region_name.capitalize(),region_avg))
#Objective 3: Average BMI of Smokers vs non Smokers.
s_and_bmi = zip(smoker,bmi)
def smoker_bmi(lst):
total_smokers = 0
total_smoker_bmi = 0
total_nonsmoker_bmi = 0
total_nonsmokers = 0
for x,y in lst:
if x == 'yes':
total_smokers += 1
total_smoker_bmi += y
elif x == 'no':
total_nonsmokers += 1
total_nonsmoker_bmi += y
average_smokers_bmi = total_smoker_bmi/total_smokers
averagee_nonsmokers_bmi = total_nonsmoker_bmi/total_nonsmokers
print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(total_smokers,round(average_smokers_bmi,2),total_nonsmokers,round(averagee_nonsmokers_bmi,2)))
smoker_bmi(s_and_bmi)
#Objective 4: Average charges for people with at least 1 child
children_and_charges = zip(children,charges)
def at_least_1_child(lst):
total_atleast_1 = 0
total_atleast_1_charges = 0
for x,y in lst:
if x >= 1:
total_atleast_1 += 1
total_atleast_1_charges += y
average_atleast1_charges = total_atleast_1_charges/total_atleast_1
print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(total_atleast_1,round(average_atleast1_charges,2)))
at_least_1_child(children_and_charges)
#Objective 5: Average cost of female smokers with no children from the southeast.
fscrc = zip(sex,smoker,children,region,charges)
def specifics(lst):
total = 0
count = 0
for a,b,c,d,e in lst:
if a == 'female' and b == 'yes' and c == 0 and d == 'southeast':
count += 1
total += e
average = total/count
print('There are {} female smokers with no children from the southeast. Their average insurance costs are {}'.format(count,round(average,2))) # Suppress return keyword
specifics(fscrc)
And one approximation based on your code would be the next 76 lines code:
#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?
import csv
with open("insurance.csv") as insurance:
insurance_data = csv.DictReader(insurance)
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []
for row in insurance_data:
age.append(int(row['age']))
sex.append(row['sex'])
bmi.append(float(row['bmi']))
children.append(int(row['children']))
smoker.append(row['smoker'])
region.append(row['region'])
charges.append(float(row['charges']))
#Objective 1:Average Age
def average_age(lst):
average_age = sum(lst)/len(lst)
print('Out of {} participants, the average age was {}\n'.format(len(lst),round(average_age,2)))
average_age(age)
#Objective 2: Region Averages
def region_average(region_name):
charges_by_region = [y for x,y in zip(region,charges) if x == region_name]
region_average = sum(charges_by_region)/len(charges_by_region)
return round(region_average,2)
region_dict = {name : None for name in region}
region_dict = sorted(region_dict)
for region_name in region_dict:
region_avg = region_average(region_name)
print('The average insurance charges of the {} region is {}.'.format(region_name.capitalize(),region_avg))
#Objective 3: Average BMI of Smokers vs non Smokers.
def smoker_bmi():
smoker_list = []
nonsmoker_list = []
for x,y in zip(smoker,bmi):
if x == 'yes':
smoker_list.append(y)
else:
nonsmoker_list.append(y)
average_smokers_bmi = sum(smoker_list)/len(smoker_list)
average_nonsmokers_bmi = sum(nonsmoker_list)/len(nonsmoker_list)
print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(len(smoker_list),round(average_smokers_bmi,2),len(nonsmoker_list),round(average_nonsmokers_bmi,2)))
smoker_bmi()
#Objective 4: Average charges for people with at least 1 child
def at_least_1_child():
atleast1_charges = [y for x,y in zip(children,charges) if x >= 1]
average_atleast1_charges = sum(atleast1_charges)/len(atleast1_charges)
print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(len(atleast1_charges),round(average_atleast1_charges,2))) # Suppress return keyword
at_least_1_child()
#Objective 5: Average cost of female smokers with no children from the southeast.
def specifics():
subtotal = [e for a,b,c,d,e in zip(sex,smoker,children,region,charges) if a == 'female' and b == 'yes' and c == 0 and d == 'southeast']
average = sum(subtotal)/len(subtotal)
print('There are {} female smokers with no children from the southeast. Their average insurance costs are {}'.format(len(subtotal),round(average,2))) # Suppress return keyword
specifics()
Note: Shorter code is not necessarily better or more efficient 
Objective 6?:
The reason I did not say anything about objective 6 is because of the following: I understand your code and what it does, but I do not think it meets the objective you set out to achieve. Think about it, with what you did you conclude something that in a way is obvious, if you compare a person with N children and another person with the same children but who smokes the cost will be higher for the one who smokes because smoking increases the charges. And that is all you do, you conclude that smoking increases charges but it is not possible to conclude whether smoking is more influential than having children. For this analysis I think you should compare individuals with similar parameters (but not necessarily equal, other parameters different than smoker and children) and see how smoking influences in contrast to not smoking but having a certain number of children; because you do not know how your data are and maybe in the CSV file the data are biased and those who smoke are (by chance) people with high BMI, which would be raising the charges, but we do not know that. But I think that for this it is already necessary to make graphs. A rather ambitious goal if I may say so, I do not know how to do it right away.
Another way to do this would be that in the with
context manager instead of defining all these lists, a single dictionary could be defined with keys the name of each category and the values would be the lists. Maybe with this approach it would be possible to make a general averaging function as mentioned @netmarlon.
Congratulations on your project, it is very interesting.