Reggie_Linear_Regression (Data Science)

Project: Linear Regression

Reggie is a mad scientist who has been hired by the local fast food joint to build their newest ball pit in the play area. As such, he is working on researching the bounciness of different balls so as to optimize the pit. He is running an experiment to bounce different sizes of bouncy balls, and then fitting lines to the data points he records. He has heard of linear regression, but needs your help to implement a version of linear regression in Python.

Linear Regression is when you have a group of points on a graph, and you find a line that approximately resembles that group of points. A good Linear Regression algorithm minimizes the error, or the distance from each point to the line. A line with the least error is the line that fits the data the best. We call this a line of best fit.

We will use loops, lists, and arithmetic to create a function that will find a line of best fit when given a set of data.

Part 1: Calculating Error

The line we will end up with will have a formula that looks like:

y = m*x + b

m is the slope of the line and b is the intercept, where the line crosses the y-axis.

Fill in the function called get_y() that takes in m, b, and x. It should return what the y value would be for that x on that line!

def get_y(m, b, x):
  return m*x+b

print(get_y(1, 0, 7) == 7)
print(get_y(5, 10, 3) == 25)

True
True

Reggie wants to try a bunch of different m values and b values and see which line produces the least error. To calculate error between a point and a line, he wants a function called calculate_error(), which will take in m, b, and an [x, y] point called point and return the distance between the line and the point.

To find the distance:

  1. Get the x-value from the point and store it in a variable called x_point
  2. Get the y-value from the point and store it in a variable called y_point
  3. Use get_y() to get the y-value that x_point would be on the line
  4. Find the difference between the y from get_y and y_point
  5. Return the absolute value of the distance (you can use the built-in function abs() to do this)

The distance represents the error between the line y = m*x + b and the point given.

def calculate_error(m, b, point):
    x_point = point[0]
    y_point = point[1]
    y_line = get_y(m, b, x_point)
    return abs(y_line-y_point)

Let’s test this function!

#this is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))
#the point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))
#the point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))
#the point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))
0
1
1
5

Great! Reggie’s datasets will be sets of points. For example, he ran an experiment comparing the width of bouncy balls to how high they bounce:

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]

The first datapoint, (1, 2), means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters.

As we try to fit a line to this data, we will need a function called calculate_all_error, which takes m and b that describe a line, and points, a set of data like the example above.

calculate_all_error should iterate through each point in points and calculate the error from that point to the line (using calculate_error). It should keep a running total of the error, and then return that total after the loop.

def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error

Let’s test this function!

#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))

#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))

#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))


#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))
0
4
4
18

Great! It looks like we now have a function that can take in a line and Reggie’s data and return how much error that line produces when we try to fit it to the data.

Our next step is to find the m and b that minimizes this error, and thus fits the data best!

Part 2: Try a bunch of slopes and intercepts!

The way Reggie wants to find a line of best fit is by trial and error. He wants to try a bunch of different slopes (m values) and a bunch of different intercepts (b values) and see which one produces the smallest error value for his dataset.

Using a list comprehension, let’s create a list of possible m values to try. Make the list possible_ms that goes from -10 to 10 inclusive, in increments of 0.1.

Hint (to view this hint, either double-click this cell or highlight the following white space): you can go through the values in range(-100, 100) and multiply each one by 0.1

possible_ms = [i/10-10 for i in range(201)]

Now, let’s make a list of possible_bs to check that would be the values from -20 to 20 inclusive, in steps of 0.1:

possible_bs = [i/10-20 for i in range(401)]

We are going to find the smallest error. First, we will make every possible y = m*x + b line by pairing all of the possible ms with all of the possible bs. Then, we will see which y = m*x + b line produces the smallest total error with the set of data stored in datapoint.

First, create the variables that we will be optimizing:

  • smallest_error — this should start at infinity (float("inf")) so that any error we get at first will be smaller than our value of smallest_error
  • best_m — we can start this at 0
  • best_b — we can start this at 0

We want to:

  • Iterate through each element m in possible_ms
  • For every m value, take every b value in possible_bs
  • If the value returned from calculate_all_error on this m value, this b value, and datapoints is less than our current smallest_error,
  • Set best_m and best_b to be these values, and set smallest_error to this error.

By the end of these nested loops, the smallest_error should hold the smallest error we have found, and best_m and best_b should be the values that produced that smallest error value.

Print out best_m, best_b and smallest_error after the loops.

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
smallest_error = float("inf")
best_m = 0
best_b = 0
for m in possible_ms:
    for b in possible_bs:
        error = calculate_all_error(m, b, datapoints)
        if error < smallest_error:
            best_m = m
            best_b = b
            smallest_error = error
print(best_m, best_b, smallest_error)
0.3000000000000007 1.6999999999999993 5.0

Part 3: What does our model predict?

Now we have seen that for this set of observations on the bouncy balls, the line that fits the data best has an m of 0.3 and a b of 1.7:

y = 0.3x + 1.7

This line produced a total error of 5.

Using this m and this b, what does your line predict the bounce height of a ball with a width of 6 to be?
In other words, what is the output of get_y() when we call it with:

  • m = 0.3
  • b = 1.7
  • x = 6
print(get_y(0.3, 1.7, 6))
3.5

Our model predicts that the 6cm ball will bounce 3.5m.

Now, Reggie can use this model to predict the bounce of all kinds of sizes of balls he may choose to include in the ball pit!

def get_bounce_height(x):
    return get_y(0.3, 1.7, x)
print(get_bounce_height(6))
3.5
2 Likes

This is wrong. I used the same code and got:
0 0 inf

Strange, i just ran the file in my Notebook and it works.
Make sure to run the all code.
This cell calls calculate_all_error from a previous cell, who calls calculate_error.
And calculate_error calls get_y
May be check these two functions in your code.

Make sure your code is indented the same way too.

Where did you get the 201 and 401? Also, can anyone explain to me why it is being done this way? What I came up with on my own was:

possible_ms = []
range(-10.0,10.0,.1)

Why doesn’t something like this work? It’s creating the range of values, right?

Have you tested it? You’ve probably found that it throws an error as range cannot work with floats. As for why you often find it with an additional +1 to the value it is because it does not include the end value. Take for example trying to get a count up to 10. Using list(range(10)) would provide you with values from 0 to 9 but it does not include 10. Hence the additonal +1 to get 11.

In the given exmaple for values from -10 to 10 can you see why it is slightly higher than the end value? It’s worth noting that whilst range doesn’t work with floats it can work with negative numbers which may simplify the expressions being evaluated.

The numpy package also has methods of working with a range of floats if this is something you find yourself doing often.

Hi,

The solution to part 2 is built around:

possible_ms = [i/10-10 for i in range(201)]
possible_bs = [i/10-20 for i in range(401)]

I was not able to get the “correct answer” using my own method:

possible_ms = [m/10 for m in range(-100, 101)]
possible_bs = [b/10 for b in range (-200, 201)]

If you compare the lists created by the given solution and the solution devised by me, you will see that my solution provides a list of values in line with the given parameters. While the given solution provides some numbers with absurd decimals that are not in line with the task at hand.

For example, the solution by Xavier give out lists that contain values such as -0.09999999999999964, 0.0, 0.09999999999999964, 0.1999999999999993, 0.3000000000000007, 0.40000000000000036. These are not increments of 0.1.

I just wanted to post this as a heads up for anyone that might be bamboozled by the project using values that are not in accordance with the project’s own requirements.

Good luck and have fun!

1 Like

Unfortunately your solution would also have issues with floating point arithmetic on most systems. Sadly it’s just something you always need to be conscious of. The following is a link to the Python docs discussing it-

Thanks for your response and pointing out an issue. I still don’t see how that fixes the issues with this project. The final result is built around not having the correct increments in two lists. Maybe the project designer should have taken this floating point arithmetic into account and adjusted his solution accordingly? For a learner it seems, that the project solutions were hastily put together as they rely on the increments on being something other than 0.1.

I’d be interested to see how to modify the code below, so that it would output exact increments of 0.1 and not something else. I am not well versed in Python but in excel function output can easily be rounded with round(). Why not do the same here?

possible_ms = [i/10-10 for i in range(201)]
possible_bs = [i/10-20 for i in range(401)]

EDIT:

If you want to use the given solution, you might want to round the values to gain the exact increments as stated in the assignment. You will be getting the correct answers but not the ones wanted by the assignment. Good luck to all :smiley:

possible_ms = [round(i/10-10), 1) for i in range(201)]

possible_bs = [round(i/10-20, 1) for i in range(401)] 

You can certainly get values with the appearance or representation of being proper floating points as per the round() function but the math will always be subject to error (a very small error but an error nonetheless). If you have the time do spend a while looking into floating point errors as it’s a far bigger issue than just Python and something you always have to consider. Wherever possible try to work with integers for this reason.

For the sake of representation the Python docs link to floating points issues also mentions two of the standard libraries which can be very useful when working with floating points namely fractions and decimal (they display floats in a readable way)-


I think the biggest issue with this project isn’t floating point issues (though it would be nice if the guidance for this project mentioned this fact) but that the dataset you have to fit to is terrible. Folks wind up with multiple different answers in this project because you’re trying to fit a straight line to what amounts to little more than a random scatter. A bad dataset is a bad dataset.

That was an interesting way to do the possible_ms and possible_bs.

An alternative would be the following:

possible_ms = [integer*0.1 for integer in range(-100, 101)]
possible_ms = [integer*0.1 for integer in range(-200, 201)]

This way, -100 becomes -10.0, -99 becomes -9.9 and so on until you get to 100 --> 10.0

The reason you need to add 1 to the upper bound of the range is that in Python, the range() function is exclusive of the second number. for example range(1, 10) would include 1, but exclude 10 in the range of numbers returned.

Thanks, and happy coding!

Please help. I have no idea why I’m getting an error here. I am using the exact code that is given in the solution (which I had already come up with on my own, but just copied and pasted from there just to see if it would work. It doesn’t). I can’t figure out why I’m getting this error. Any help greatly appreciated.

Great! Reggie's datasets will be sets of points. For example, he ran an experiment comparing the width of bouncy balls to how high they bounce:

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
The first datapoint, (1, 2), means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters.

As we try to fit a line to this data, we will need a function called calculate_all_error, which takes m and b that describe a line, and points, a set of data like the example above.

calculate_all_error should iterate through each point in points and calculate the error from that point to the line (using calculate_error). It should keep a running total of the error, and then return that total after the loop.

def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error
#Write your calculate_all_error function here
def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error
Let's test this function!

def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error
​
#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))
​
#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))
​
#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))
​
​
#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12168/846644277.py in <module>
      7 #every point in this dataset lies upon y=x, so the total error should be zero:
      8 datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
----> 9 print(calculate_all_error(1, 0, datapoints))
     10 
     11 #every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:

~\AppData\Local\Temp/ipykernel_12168/846644277.py in calculate_all_error(m, b, points)
      2     total_error = 0
      3     for point in points:
----> 4         total_error += calculate_error(m, b, point)
      5     return total_error
      6 

~\AppData\Local\Temp/ipykernel_12168/1611861288.py in calculate_error(m, b, points)
      3     total_error = 0
      4     for point in points:
----> 5         total_error += calculate_error(m, b, point)
      6     return total_error

~\AppData\Local\Temp/ipykernel_12168/1611861288.py in calculate_error(m, b, points)
      2 def calculate_error(m, b, points):
      3     total_error = 0
----> 4     for point in points:
      5         total_error += calculate_error(m, b, point)
      6     return total_error

TypeError: 'int' object is not iterable

I think you’re missing the calculate_error function in the code above.
(spoiler below)

here’s one version of it:

def calculate_error(m, b, point):
    x = point[0]
    y = point[1]   # actual y-value
    y_predicted = m*x + b
    error = y_predicted - y   # error = predicted - actual
    return abs(error)

The code above seems to run otherwise.

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)] #The first datapoint, (1, 2), means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters. #As we try to fit a line to this data, we will need a function called calculate_all_error, which takes m and b that describe a line, and points, a set of data like the example above. #calculate_all_error should iterate through each point in points and calculate the error from that point to the line (using calculate_error). It should keep a running total of the error, and then return that total after the loop. def calculate_error(m, b, point): x = point[0] y = point[1] y_predicted = m*x + b error = y_predicted - y return abs(error) #Write your calculate_all_error function here def calculate_all_error(m, b, points): total_error = 0 for point in points: total_error += calculate_error(m, b, point) return total_error #Let's test this function! def calculate_all_error(m, b, points): total_error = 0 for point in points: total_error += calculate_error(m, b, point) return total_error #every point in this dataset lies upon y=x, so the total error should be zero: datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)] print(calculate_all_error(1, 0, datapoints)) #every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4: datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)] print(calculate_all_error(1, 1, datapoints)) #every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4: datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)] print(calculate_all_error(1, -1, datapoints)) #the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be # 1 + 5 + 9 + 3 = 18 datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)] print(calculate_all_error(-1, 1, datapoints))

Hello,

I’ve run the exercise twice now and keep getting it wrong because I just can’t understand the indentation. If anyone would be so kind to explain it to me it would be much appreciated. Thank you.

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
smallest_error = float("inf")
best_m = 0
best_b = 0
for m in possible_ms:
    for b in possible_bs:
        error = calculate_all_error(m, b, datapoints)
        if error < smallest_error:
            best_m = m
            best_b = b
            smallest_error = error
print(best_m, best_b, smallest_error)

Not so sure how to embed the code properly. However I have found that the ranging list will difer by adding funny digits randomly.

Here my working code:

Task 1

Write your get_y() function here

def get_y(m,b,x):
return m*x+b

Uncomment each print() statement to check your work. Each of the following should print True

print(get_y(1, 0, 7) == 7)
print(get_y(5, 10, 3) == 25)

Tasks 2 and 3

Write your calculate_error() function here

def calculate_error(m,b,point):
x_point=point[0]
y_point=point[1]
return abs(get_y(m,b,x_point)-y_point)

Task 4

Uncomment each print() statement and check the output against the expected result

this is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:

print(calculate_error(1, 0, (3, 3)))

the point (3, 4) should be 1 unit away from the line y = x:

print(calculate_error(1, 0, (3, 4)))

the point (3, 3) should be 1 unit away from the line y = x - 1:

print(calculate_error(1, -1, (3, 3)))

the point (3, 3) should be 5 units away from the line y = -x + 1:

print(calculate_error(-1, 1, (3, 3)))

Task 5

Write your calculate_all_error() function here

def calculate_all_error(m,b,points):
total_error=0
for point in points:
total_error+=calculate_error(m,b,point)
return total_error

Task 6

Uncomment each print() statement and check the output against the expected result

datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]

every point in this dataset lies upon y=x, so the total error should be zero:

print(calculate_all_error(1, 0, datapoints))

every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:

print(calculate_all_error(1, 1, datapoints))

every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:

print(calculate_all_error(1, -1, datapoints))

the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be

1 + 5 + 9 + 3 = 18

print(calculate_all_error(-1, 1, datapoints))

Tasks 8 and 9

possible_ms = [m*0.1 for m in range(-100,101)]
possible_bs = [b*0.1 for b in range(-200,201)]

Task 10

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
smallest_error=float(“inf”)
best_m=0
best_b=0

Tasks 11 and 12

i=0

while i<len(possible_ms):
j=0
while j<len(possible_bs):
current_error=calculate_all_error(possible_ms[i],possible_bs[j],datapoints)
if current_error<smallest_error:
best_m=possible_ms[i]
best_b=possible_bs[j]
smallest_error=current_error
j+=1
i+=1
print(“Y = {}x +{} will yield with the smallest error of: {}”.format(best_m,best_b,smallest_error))

Task 13

print(“A 6 cm ball will bounce :”+str(get_y(best_m,best_b,6))+" m.")

Hey everyone!
I’m on part 2 of the project and I’m confused as to why the code I wrote doesn’t create the list
I expected?.
Here’ what I get [-10, -9.9, -9.8, -9.700000000000001, -9.600000000000001, -9.500000000000002, -9.400000000000002, -9.300000000000002 […]]

Anyone understands what’s up with the decimals?
Thanks!! :slight_smile:

possible_ms = [-10] while possible_ms[-1] < 10: possible_ms.append(possible_ms[-1] + 0.1) print(possible_ms)