FAQ: Linear Regression - Review

This community-built FAQ covers the “Review” exercise from the lesson “Linear Regression”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

Machine Learning

FAQs on the exercise Review

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

import codecademylib3_seaborn
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Boston housing dataset
boston = load_boston()

df = pd.DataFrame(boston.data, columns = boston.feature_names)

# Set the x-values to the nitrogen oxide concentration:
X = df[['NOX']]
# Y-values are the prices:
y = boston.target

# Can we do linear regression on this?




plt.scatter(X, y, alpha=0.4)
# Plot line here:

plt.title("Boston Housing Dataset")
plt.xlabel("Nitric Oxides Concentration")
plt.ylabel("House Price ($)")
plt.show()
line_fitter = LinearRegression()
y_predicted = line_fitter.predict(X)
plt.plot(X,y_predicted)
plt.show()

Can anyone tell me why my linear regression line doesn’t show up in the graph? I am really confused.

Hi eleseeu,

The line line_fitter.fit(X,y) is missing before the call to the method .predict().

1 Like
import codecademylib3_seaborn
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Boston housing dataset
boston = load_boston()

df = pd.DataFrame(boston.data, columns = boston.feature_names)

# Set the x-values to the nitrogen oxide concentration:
X = df[['NOX']]
# Y-values are the prices:
y = boston.target

# Can we do linear regression on this?
line_fitter = LinearRegression()
line_fitter.fit(X, y)
concentration_predict = line_fitter.predict(X)


plt.scatter(X, y, alpha=0.4)
# Plot line here:
plt.plot(X, y)
plt.title("Boston Housing Dataset")
plt.xlabel("Nitric Oxides Concentration")
plt.ylabel("House Price ($)")
plt.show()

I got a bird’s nest of lines, not sure if I did it right. :thinking::sweat_smile:

You gotta plot(X, concentration_predict) instead of (X, y)

1 Like

import codecademylib3_seaborn
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

Boston housing dataset

boston = load_boston()

df = pd.DataFrame(boston.data, columns = boston.feature_names)

Set the x-values to the nitrogen oxide concentration:

X = df[[‘NOX’]]

Y-values are the prices:

y = boston.target

Can we do linear regression on this?

plt.scatter(X, y, alpha=0.4)

Plot line here:

line_fitter = LinerRegression()
line_fitter.fit(X,y)
y_predict = line_fitter.predict(x)
plt.plot(X,y_predict)
plt.title(“Boston Housing Dataset”)
plt.xlabel(“Nitric Oxides Concentration”)
plt.ylabel(“House Price ($)”)
plt.show()

I don’t know why doesn’t the line appear!! Ive read your previous replies but I can’t seem to manage it. Pls help!

it’s better to have the plots in consecutive lines, I guess that’s the issue with your code.
If you arrange these lines at the bootom of your code -
plt.scatter(X, y, alpha=0.4)
plt.plot(X,y_predict)
plt.title(“Boston Housing Dataset”)
plt.xlabel(“Nitric Oxides Concentration”)
plt.ylabel(“House Price ($)”)
plt.show()

then hopefully it’s gonna work.

ALSO, There’s a typo in your code where you wrote-
y_predict = line_fitter.predict(x)
the “x” should be capital.

1 Like

It’s Linear and you have written Liner

1 Like

Review: Introduction to Linear Regression

We have seen how to implement a linear regression algorithm in Python, and how to use the linear regression model from scikit-learn. We learned:

  • We can measure how well a line fits by measuring loss.
  • The goal of linear regression is to minimize loss.
  • To find the line of best fit, we try to find the b value (intercept) and the
    m value (slope) that minimize loss.
  • Convergence refers to when the parameters stop changing with each iteration.
  • Learning rate refers to how much the parameters are changed on each
    iteration.
  • We can use Scikit-learn’s LinearRegression() model to perform linear
    regression on a set of points.

These are important tools to have in your toolkit as you continue your exploration of data science.

Instructions

Find another dataset, maybe in scikit-learn’s example datasets. Or on Kaggle, a great resource for tons of interesting data.

Try to perform linear regression on your own! If you find any cool linear correlations, make sure to share them!

As a starter, we’ve loaded in the Boston housing dataset. We made the X values the nitrogen oxides concentration (parts per 10 million), and the y values the housing prices. See if you can perform regression on these houses!

text = """.. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"""
print(text)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

matplotlib.rcdefaults()
plt.rcParams["figure.dpi"] = 140

# Boston housing dataset
boston = load_boston()

df = pd.DataFrame(boston.data, columns=boston.feature_names)

# Set the x-values to the nitrogen oxide concentration:
X = df[["NOX"]]
# Y-values are the prices:
y = boston.target
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
df.describe()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000
# Can we do linear regression on this?
line_fitter = LinearRegression()
line_fitter.fit(X, y)
y_predict = line_fitter.predict(X)


plt.scatter(X, y, alpha=0.4)
# Plot line here:
plt.plot(X, y_predict, color="red", linestyle="dashed", label="Trend")
plt.title("Boston Housing Dataset")
plt.xlabel("Nitric Oxides Concentration")
plt.ylabel("House Price ($)")
plt.show()

print(f"R^2 score: {line_fitter.score(X, y)}")
R^2 score: 0.182603042501699

Community Forums

  1. Can linear regression apply to more than two variables?

    Yes, in this lesson we only considered linear regression between one
    dependent variable and one independent variable. However, linear regression
    can also be applied between one dependent variable and two or more
    independent variables. This is known as multiple linear regression, which
    you can learn more about in later lessons in the “Machine Learning” course.

    Similar to linear regression, multiple linear regression finds a
    relationship between the dependent variable and the independent variables.

    For an example, linear regression between one dependent and one independent
    variable would apply in a situation to find the relationship between the
    market price of a house with the square footage. However, with multiple
    linear regression, we may find the relationship of the market price, number
    of rooms, and location with the square footage.

1 Like

why do we need to use upper-case “X” in our
#Set the x-values to the nitrogen oxide concentration ?

import codecademylib3_seaborn

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.datasets import load_boston

Boston housing dataset

boston = load_boston()

df = pd.DataFrame(boston.data, columns = boston.feature_names)

Set the x-values to the nitrogen oxide concentration:

X = df[[‘NOX’]]

Y-values are the prices:

y = boston.target

Can we do linear regression on this?

line_fitter = LinearRegression()

line_fitter.fit(X, y)

NOX_predict = line_fitter.predict(X)

plt.scatter(X, y, alpha=0.4)

Plot line here:

plt.plot(X, NOX_predict,color=“red”)

plt.title(“Boston Housing Dataset”)

plt.xlabel(“Nitric Oxides Concentration”)

plt.ylabel(“House Price ($)”)

plt.show()

This code works for me!!

Why we did not do y.reshape(-1,1)??? thanks for your answer:)

Probably you’ve moved on from this question already, but let me try to answer this to see if I have understood everything correctly:

What reshape does is take the X values in a list, which you can see as a row in a table, and pivots it into a column structure, so that the linear regression model is able to use it to fit/predict y.

For example:

# apparently I can't import numpy here? import numpy as np list_rowshape = np.array([1,2,3,4,5,6]) list_columnshape = list.reshape(-1,1) print(list_rowshape) # prints: # array([1,2,3,4,5,6]) print(list_columnshape) # prints: # array([[1], # [2], # [3], # [4], # [5], # [6]])

In the example of the Boston Housing assignment, the X values are taken from a pandas dataframe.

X = df[[‘NOX’]]

Since the origin is already a table with the X values in the ‘shape’ of a column, there is no need to reshape. It already has the appropriate shape!

1 Like

thanks you for your insightful answer:)