Here's my Portfolio Project, appreciate any feedback! :)

Thank you for clicking on my post! I enjoyed this project a lot and it showed me that what stuck with me the most from the fundamentals were the lists and list comprehensions. Enjoy :slight_smile:

#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?

import csv
with open("insurance.csv") as insurance:
    insurance_data = csv.DictReader(insurance)
    age = []
    sex = []
    bmi = []
    children = []
    smoker = []
    region = []
    charges = []
    for row in insurance_data:
        age.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])
#print(len(sex))
#We have 1,338 rows.

#Objective 1:Average Age
def average_age(lst):
    total_ages = 0
    total_people = 0
    for x in lst:
        total_people += 1
        total_ages += float(x)
    average_age = total_ages/total_people
    return print('Out of {} participants, the average age was {}'.format(total_people,round(average_age,2)))
average_age(age)    

#Objective 2: Region Averages
r_and_c = list(zip(region,charges))
def region_average(lst,region_name):
    region_count = 0
    region_total_charges = 0 
    for (x,y) in lst:
        if x == region_name:
            region_count += 1
            region_total_charges += float(y)
    region_average = region_total_charges/region_count
    return round(region_average,2)

northwest_avg = region_average(r_and_c,'northwest') 
northeast_avg = region_average(r_and_c,'northeast')
southwest_avg = region_average(r_and_c,'southwest')
southeast_avg = region_average(r_and_c,'southeast')

print('''
The average insurance charges of the Northeast region is {}.
The average insurance charges of the Northwest region is {}.
The average insurance charges of the Southwest region is {}.
The average insurance charges of the Southeast region is {}.
'''.format(northeast_avg,northwest_avg,southwest_avg,southeast_avg))

#Objective 3: Average BMI of Smokers vs non Smokers.
s_and_bmi = list(zip(smoker,bmi))
def smoker_bmi(lst):
    total_smokers = 0
    total_smoker_bmi = 0
    total_nonsmoker_bmi = 0
    total_nonsmokers = 0
    for (x,y) in lst:
        if x == 'yes':
            total_smokers += 1
            total_smoker_bmi += float(y)
        elif x == 'no':
            total_nonsmokers += 1
            total_nonsmoker_bmi += float(y)
    average_smokers_bmi = total_smoker_bmi/total_smokers
    averagee_nonsmokers_bmi = total_nonsmoker_bmi/total_nonsmokers
    return print('''
    Out of {} Smokers, the average BMI was {}.
    Out of {} Non-Smokers, the average BMI was {}.
    '''.format(total_smokers,round(average_smokers_bmi,2),total_nonsmokers,round(averagee_nonsmokers_bmi,2)))
smoker_bmi(s_and_bmi)
            
#Objective 4: Average charges for people with at least 1 child
children = [int(x) for x in children]
children_and_charges = list(zip(children,charges))
def at_least_1_child(lst):
    total_atleast_1 = 0
    total_atleast_1_charges = 0
    for (x,y) in lst:
        if x >= 1:
            total_atleast_1 += 1
            total_atleast_1_charges += float(y)
    average_atleast1_charges = total_atleast_1_charges/total_atleast_1
    return print('''
    {} people have at least 1 child. Their average insurance cost is {}.
    '''.format(total_atleast_1,round(average_atleast1_charges,2)))
at_least_1_child(children_and_charges)

#5th: Average cost of female smokers with no children from the southeast.
children = [int(x) for x in children]
fscrc = list(zip(sex,smoker,children,region,charges))
def specifics(lst):
    total = 0
    count = 0
    for (a,b,c,d,e) in lst:
        if a == 'female' and b == 'yes' and c == 0 and d == 'southeast':
            count += 1
            total += float(e)
    average = total/count
    return print('There are {} female smokers with no children from the southeast. Their average insurance costs are {}'.format(count,round(average,2)))
specifics(fscrc)

#6th: What influences costs more; Number of children or smoking?
s_c_c = list(zip(children, smoker, charges))

def lst_generator(number_of_children,smoker):
    new_lst = []
    for (a,b,c) in s_c_c:
        if a == number_of_children and b == smoker:
            new_lst.append((a,b,c))
    return new_lst

non_smoker_0_children = lst_generator(0,'no')
non_smoker_1_children = lst_generator(1,'no')
non_smoker_2_children = lst_generator(2,'no')
non_smoker_3_children = lst_generator(3,'no')
non_smoker_4_children = lst_generator(4,'no')
non_smoker_5_children = lst_generator(5,'no')
smoker_0_children = lst_generator(0,'yes')
smoker_1_children = lst_generator(1,'yes')
smoker_2_children = lst_generator(2,'yes')
smoker_3_children = lst_generator(3,'yes')
smoker_4_children = lst_generator(4,'yes')
smoker_5_children = lst_generator(5,'yes')

def avg_lst(lst):
    count = len(lst)
    total = 0
    for (a,b,c) in lst:
        total += float(c)
    average = round(total/count,2)
    return average
avg_non_smoker_0_children = avg_lst(non_smoker_0_children)
avg_non_smoker_1_children = avg_lst(non_smoker_1_children)
avg_non_smoker_2_children = avg_lst(non_smoker_2_children)
avg_non_smoker_3_children = avg_lst(non_smoker_3_children)
avg_non_smoker_4_children = avg_lst(non_smoker_4_children)
avg_non_smoker_5_children = avg_lst(non_smoker_5_children)
avg_smoker_0_children = avg_lst(smoker_0_children)
avg_smoker_1_children = avg_lst(smoker_1_children)
avg_smoker_2_children = avg_lst(smoker_2_children)
avg_smoker_3_children = avg_lst(smoker_3_children)
avg_smoker_4_children = avg_lst(smoker_4_children)
avg_smoker_5_children = avg_lst(smoker_5_children)

print('''
Looking at the both values and their effects on insurance charges we see that:
For non-smokers:
People with 0 children pay on average: {}
People with 1 children pay on average: {}
People with 2 children pay on average: {}
People with 3 children pay on average: {}
People with 4 children pay on average: {}
People with 5 children pay on average: {}
For Smokers:
People with 0 children pay on average: {}
People with 1 children pay on average: {}
People with 2 children pay on average: {}
People with 3 children pay on average: {}
People with 4 children pay on average: {}
People with 5 children pay on average: {}

It is clear from the results that smoking has a bigger influence on insurance charges than a number of children. 
Smokers' costs are significantly bigger than non-smokers', and they do not fluctuate much with the increase of the 
number of children.
'''.format(
avg_non_smoker_0_children,avg_non_smoker_1_children,avg_non_smoker_2_children,avg_non_smoker_3_children,
    avg_non_smoker_4_children,avg_non_smoker_5_children, avg_smoker_0_children,avg_smoker_1_children, 
    avg_smoker_2_children,avg_smoker_3_children,avg_smoker_4_children,avg_smoker_5_children))
``'
1 Like

Hiya! Thanks for sharing your code! I’m not a very experienced coder so I don’t feel like I have the authority or understanding to make any deep criticisms about your solution. But there are a few things which I really liked about your code that I wanted to feedback to you.

I found the code readability to be excellent, the comments were informative and concise, and the code itself was very understandable. It’s clear that you were thorough during the scoping stage as your 6 objectives are well defined and informative.
Using the zip function to combine categories into one list, was something which I didn’t think to do, but actually makes a lot of sense!

A few of your functions have fairly similar structures in terms of finding the average insurance cost within particular categories, I wonder if these could be merged into one function which takes the category as an argument, and outputs the average for the different groups?
Another thing I noticed is that you defined the children list 3 times - once at the start and once with integer values before the children function and the specifics function, this could have been done just once right at the start, as I don’t think there is any need for the number of children to ever be a list of strings?

Thanks again for sharing, best wishes from Edinburgh

Hi @roeystern8114604813, your project is interesting and I would like to share some comments and improvements. I will go by parts to be clearer.

Import part code:

Not much to say, only that throughout your code you convert to integers or floats the elements of lists corresponding to numbers. You even repeat the children list three times (as @netmarlon already pointed out), which you implement for this conversion. I suggest that you convert from this part, inside the with context manager.

#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?

import csv
with open("insurance.csv") as insurance:
    insurance_data = csv.DictReader(insurance)
    age = []
    sex = []
    bmi = []
    children = []
    smoker = []
    region = []
    charges = []
    for row in insurance_data:
        age.append(int(row['age'])) # Convert to number type from this point instead of converting every time you need it
        sex.append(row['sex'])
        bmi.append(float(row['bmi'])) # Same
        children.append(int(row['children'])) # Same
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(float(row['charges'])) # Same

Objective 1:

I think you should use return if the output of the function is going to be assign to a variable, but if the function prints something out to the terminal then suppress the return keyword.

def average_age(lst):
    total_ages = 0
    #total_people = 0 --> It is unnecessary, you can use len(lst) to get the total people
    for x in lst:
        total_ages += x
    average_age = total_ages/len(lst)
    print('Out of {} participants, the average age was {}'.format(len(lst),round(average_age,2)))
average_age(age)

Other approximation for the previous function is using the sum built-in function.

def average_age(lst):
    average_age = sum(lst)/len(lst)
    print('Out of {} participants, the average age was {}'.format(len(lst),round(average_age,2)))
average_age(age)

#Objective 2:

Besides the function for this objective, in the rest of your whole code you do not use the list r_and_c, so I think it is better to use directly in the function definition. Sincerely, I do not know which is more efficient, but in this way you do not need two parameters to call your function. In addition, the zip function is iterable so there is no need to convert to list.

def region_average(region_name):
    region_count = 0
    region_total_charges = 0 
    for x,y in zip(region,charges): # You do not need to enclose with parenthesis (just visual change), Python will automatically unpack each iterable
        if x == region_name:
            region_count += 1 # Contrary to total_people in the past function, this counter is necessary to increment just when the region is the appropiate
            region_total_charges += y # Already converted to float in the with context manager
    region_average = region_total_charges/region_count
    return round(region_average,2)

Just as list comprehension there exist dictionary comprehension. I suggest to create the next dictionary:

region_dict = {name : None for name in region}

It is a dictionary with regions as keys, the values do not matter so I assign them all to None keyword. I have two reasons for this: firstly, to print the average cost per region you manually define the four variables and then a print statement with four lines; and secondly, you need to know the number (and name) of the different regions beforehand, what if the csv file has millions of rows?

Why a dictionary? Because the keys are unique so that it does not have repeated region names, the list of regions has all the regions but they are repeated many times. With the previous dictionary it is possible to write.

# Python code
# Python code

region_dict = sorted(region_dict) # It is optional to sort by key in order to print the information alphabetically
for region_name in region_dict: # It iterates over the keys
    region_avg = region_average(region_name)
    print('The average insurance charges of the {} region is {}.'.format(region_name.capitalize(),region_avg))

Another approach to your function of objective 2 is below. With list comprehension it is avoided the instantiation of the other variables because the len of this list is the region_count variable and the sum of its elements is the total charges for some region.

def region_average(region_name):
    charges_by_region = [y for x,y in zip(region,charges) if x == region_name]
    region_average = sum(charges_by_region)/len(charges_by_region)
    return round(region_average,2)

Objective 3:

Not much to say, as before you do not need to enclose your temporal variables in the for loop with parenthesis, you do not need to convert to float each element of bmi list since it was done in the with context manager and suppress the return keyword because the function will print out to the terminal.

I think that when you use multiple line string ''' String ''' you should not have to indent every line, just the line where the print keyword is, if you indent every line of the string it will be printed out to the terminal indented and you may not want that, at least that is what happened to me.

s_and_bmi = zip(smoker,bmi) # The same comment as for the list r_and_c in the previous objective. This time I will let this
# list out the function but I think it is better to use inside because you do not use it again. In any case, since zip is
# an iterable object it is unnecessary to convert to list
def smoker_bmi(lst):
    total_smokers = 0
    total_smoker_bmi = 0
    total_nonsmoker_bmi = 0
    total_nonsmokers = 0
    for x,y in lst: # I delete the parenthesis
        if x == 'yes':
            total_smokers += 1
            total_smoker_bmi += y # Already converted to float in the with context manager
        elif x == 'no':
            total_nonsmokers += 1
            total_nonsmoker_bmi += y # Already converted to float in the with context manager
    average_smokers_bmi = total_smoker_bmi/total_smokers
    averagee_nonsmokers_bmi = total_nonsmoker_bmi/total_nonsmokers
    print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(total_smokers,round(average_smokers_bmi,2),total_nonsmokers,round(averagee_nonsmokers_bmi,2))) # Suppress return keyword

smoker_bmi(s_and_bmi)

Other way is as follow.

s_and_bmi = zip(smoker,bmi)
def smoker_bmi(lst):
    smoker_list = []
    nonsmoker_list = []
    for x,y in lst:
        if x == 'yes':
            smoker_list.append(y)
        else:
            nonsmoker_list.append(y)
    average_smokers_bmi = sum(smoker_list)/len(smoker_list)
    average_nonsmokers_bmi = sum(nonsmoker_list)/len(nonsmoker_list)
    print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(len(smoker_list),round(average_smokers_bmi,2),len(nonsmoker_list),round(average_nonsmokers_bmi,2)))

smoker_bmi(s_and_bmi)

Objective 4:

Here, your children list already has integer values.

children_and_charges = zip(children,charges) # zip function is iterable, do not need to convert to list
def at_least_1_child(lst):
    total_atleast_1 = 0
    total_atleast_1_charges = 0
    for x,y in lst:
        if x >= 1:
            total_atleast_1 += 1
            total_atleast_1_charges += y # Already converted to float in the with context manager
    average_atleast1_charges = total_atleast_1_charges/total_atleast_1
    print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(total_atleast_1,round(average_atleast1_charges,2))) # Suppress return keyword

at_least_1_child(children_and_charges)

And with list comprehesion we have.

children_and_charges = zip(children,charges)
def at_least_1_child(lst):
    atleast1_charges = [y for x,y in lst if x >= 1]
    average_atleast1_charges = sum(atleast1_charges)/len(atleast1_charges)
    print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(len(atleast1_charges),round(average_atleast1_charges,2)))

at_least_1_child(children_and_charges)

Objective 5 function has pretty the same observations and the function can be simplified by list comprehension:

# Python
[e for a,b,c,d,e in zip(sex,smoker,children,region,charges) if a == 'female' and b == 'yes' and c == 0 and d == 'southeast']

To sum up, your code with some modifications would be the next 101 lines code:

#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?

import csv
with open("insurance.csv") as insurance:
    insurance_data = csv.DictReader(insurance)
    age = []
    sex = []
    bmi = []
    children = []
    smoker = []
    region = []
    charges = []
    for row in insurance_data:
        age.append(int(row['age']))
        sex.append(row['sex'])
        bmi.append(float(row['bmi']))
        children.append(int(row['children']))
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(float(row['charges']))

#Objective 1:Average Age
def average_age(lst):
    total_ages = 0
    for x in lst:
        total_ages += x
    average_age = total_ages/len(lst)
    print('Out of {} participants, the average age was {}\n'.format(len(lst),round(average_age,2)))
average_age(age)

#Objective 2: Region Averages
def region_average(region_name):
    region_count = 0
    region_total_charges = 0 
    for x,y in zip(region,charges):
        if x == region_name:
            region_count += 1
            region_total_charges += y
    region_average = region_total_charges/region_count
    return round(region_average,2)

region_dict = {name : None for name in region}
region_dict = sorted(region_dict)
for region_name in region_dict:
    region_avg = region_average(region_name)
    print('The average insurance charges of the {} region is {}.'.format(region_name.capitalize(),region_avg))

#Objective 3: Average BMI of Smokers vs non Smokers.
s_and_bmi = zip(smoker,bmi)
def smoker_bmi(lst):
    total_smokers = 0
    total_smoker_bmi = 0
    total_nonsmoker_bmi = 0
    total_nonsmokers = 0
    for x,y in lst:
        if x == 'yes':
            total_smokers += 1
            total_smoker_bmi += y
        elif x == 'no':
            total_nonsmokers += 1
            total_nonsmoker_bmi += y
    average_smokers_bmi = total_smoker_bmi/total_smokers
    averagee_nonsmokers_bmi = total_nonsmoker_bmi/total_nonsmokers
    print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(total_smokers,round(average_smokers_bmi,2),total_nonsmokers,round(averagee_nonsmokers_bmi,2)))
smoker_bmi(s_and_bmi)

#Objective 4: Average charges for people with at least 1 child 
children_and_charges = zip(children,charges)
def at_least_1_child(lst):
    total_atleast_1 = 0
    total_atleast_1_charges = 0
    for x,y in lst:
        if x >= 1:
            total_atleast_1 += 1
            total_atleast_1_charges += y
    average_atleast1_charges = total_atleast_1_charges/total_atleast_1
    print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(total_atleast_1,round(average_atleast1_charges,2)))
at_least_1_child(children_and_charges)

#Objective 5: Average cost of female smokers with no children from the southeast.
fscrc = zip(sex,smoker,children,region,charges)
def specifics(lst):
    total = 0
    count = 0
    for a,b,c,d,e in lst:
        if a == 'female' and b == 'yes' and c == 0 and d == 'southeast':
            count += 1
            total += e
    average = total/count
    print('There are {} female smokers with no children from the southeast. Their average insurance costs are {}'.format(count,round(average,2))) # Suppress return keyword
specifics(fscrc)

And one approximation based on your code would be the next 76 lines code:

#1st Objective : Find average age of participants
#2nd Objective: Find the average cost per region
#3rd Objectibe: Find the average bmi of smokers vs non-smokers
#4th Objective: Find the average charges for people with at least 1 child.
#5th: Average cost of female smokers with no children from the southeast.
#6th: What influences costs more; Number of children or smoking?

import csv
with open("insurance.csv") as insurance:
    insurance_data = csv.DictReader(insurance)
    age = []
    sex = []
    bmi = []
    children = []
    smoker = []
    region = []
    charges = []
    for row in insurance_data:
        age.append(int(row['age']))
        sex.append(row['sex'])
        bmi.append(float(row['bmi']))
        children.append(int(row['children']))
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(float(row['charges']))

#Objective 1:Average Age
def average_age(lst):
    average_age = sum(lst)/len(lst)
    print('Out of {} participants, the average age was {}\n'.format(len(lst),round(average_age,2)))
average_age(age)

#Objective 2: Region Averages
def region_average(region_name):
    charges_by_region = [y for x,y in zip(region,charges) if x == region_name]
    region_average = sum(charges_by_region)/len(charges_by_region)
    return round(region_average,2)

region_dict = {name : None for name in region}
region_dict = sorted(region_dict)
for region_name in region_dict:
    region_avg = region_average(region_name)
    print('The average insurance charges of the {} region is {}.'.format(region_name.capitalize(),region_avg))

#Objective 3: Average BMI of Smokers vs non Smokers.
def smoker_bmi():
    smoker_list = []
    nonsmoker_list = []
    for x,y in zip(smoker,bmi):
        if x == 'yes':
            smoker_list.append(y)
        else:
            nonsmoker_list.append(y)
    average_smokers_bmi = sum(smoker_list)/len(smoker_list)
    average_nonsmokers_bmi = sum(nonsmoker_list)/len(nonsmoker_list)
    print('''
Out of {} Smokers, the average BMI was {}.
Out of {} Non-Smokers, the average BMI was {}.
'''.format(len(smoker_list),round(average_smokers_bmi,2),len(nonsmoker_list),round(average_nonsmokers_bmi,2)))
smoker_bmi()

#Objective 4: Average charges for people with at least 1 child 
def at_least_1_child():
    atleast1_charges = [y for x,y in zip(children,charges) if x >= 1]
    average_atleast1_charges = sum(atleast1_charges)/len(atleast1_charges)
    print('''
{} people have at least 1 child. Their average insurance cost is {}.
'''.format(len(atleast1_charges),round(average_atleast1_charges,2))) # Suppress return keyword
at_least_1_child()

#Objective 5: Average cost of female smokers with no children from the southeast.
def specifics():
    subtotal = [e for a,b,c,d,e in zip(sex,smoker,children,region,charges) if a == 'female' and b == 'yes' and c == 0 and d == 'southeast']
    average = sum(subtotal)/len(subtotal)
    print('There are {} female smokers with no children from the southeast. Their average insurance costs are {}'.format(len(subtotal),round(average,2))) # Suppress return keyword
specifics()

Note: Shorter code is not necessarily better or more efficient :wink:

Objective 6?:

The reason I did not say anything about objective 6 is because of the following: I understand your code and what it does, but I do not think it meets the objective you set out to achieve. Think about it, with what you did you conclude something that in a way is obvious, if you compare a person with N children and another person with the same children but who smokes the cost will be higher for the one who smokes because smoking increases the charges. And that is all you do, you conclude that smoking increases charges but it is not possible to conclude whether smoking is more influential than having children. For this analysis I think you should compare individuals with similar parameters (but not necessarily equal, other parameters different than smoker and children) and see how smoking influences in contrast to not smoking but having a certain number of children; because you do not know how your data are and maybe in the CSV file the data are biased and those who smoke are (by chance) people with high BMI, which would be raising the charges, but we do not know that. But I think that for this it is already necessary to make graphs. A rather ambitious goal if I may say so, I do not know how to do it right away.

Another way to do this would be that in the with context manager instead of defining all these lists, a single dictionary could be defined with keys the name of each category and the values would be the lists. Maybe with this approach it would be possible to make a general averaging function as mentioned @netmarlon.

Congratulations on your project, it is very interesting.

1 Like

Thank you for your feedback @netmarlon !

I’m a beginner coder myself so I appreciate any feedback from anyone at any level, we all have different perspectives and view things differently. Your feedback was really beneficial and pointed out some things that on a second look I noticed myself.

You are probably right about making one function that can take the category in and generate the average insurance cost. That can be a new objective for me haha.

Also about defining the children’s list numerous times:
For some reason when I ran the code I get an error that said I couldn’t perform the action due to one variable being a string and not an integer after I already defined the list once. When I defined it again it worked so I just left it as such.

Thank you again for your time and thoughts!
Hope to see some of your work soon,

Roey

1 Like

@giovanni_alonso Thank you for your detailed feedback! I already feel I improved as a coder just from reading your comments! I appreciate the time you dedicated and I agree that I have some unnecessary lines of codes that can be fixed by some more efficient commands.

Regarding objective 6, my initial goal was to perform a sort of linear regression. However, I lack the knowledge yet to produce a graph and perform the necessary calculation so I provided an educated assumption (should have stated that with my closing comments), once I gain the necessary skills I will tackle this task again and perform a significant statistical test to generate a better result.

Thank you again for your time and wisdom!
Roey

1 Like