Review: Introduction to Linear Regression
We have seen how to implement a linear regression algorithm in Python, and how to use the linear regression model from scikit-learn. We learned:
- We can measure how well a line fits by measuring loss.
- The goal of linear regression is to minimize loss.
- To find the line of best fit, we try to find the
b
value (intercept) and the
m
value (slope) that minimize loss.
- Convergence refers to when the parameters stop changing with each iteration.
- Learning rate refers to how much the parameters are changed on each
iteration.
- We can use Scikit-learn’s
LinearRegression()
model to perform linear
regression on a set of points.
These are important tools to have in your toolkit as you continue your exploration of data science.
Instructions
Find another dataset, maybe in scikit-learn’s example datasets. Or on Kaggle, a great resource for tons of interesting data.
Try to perform linear regression on your own! If you find any cool linear correlations, make sure to share them!
As a starter, we’ve loaded in the Boston housing dataset. We made the X
values the nitrogen oxides concentration (parts per 10 million), and the y
values the housing prices. See if you can perform regression on these houses!
text = """.. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:** \n\n :Number of Instances: 506 \n\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n :Attribute Information (in order):\n - CRIM per capita crime rate by town\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n - INDUS proportion of non-retail business acres per town\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n - NOX nitric oxides concentration (parts per 10 million)\n - RM average number of rooms per dwelling\n - AGE proportion of owner-occupied units built prior to 1940\n - DIS weighted distances to five Boston employment centres\n - RAD index of accessibility to radial highways\n - TAX full-value property-tax rate per $10,000\n - PTRATIO pupil-teacher ratio by town\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n - LSTAT % lower status of the population\n - MEDV Median value of owner-occupied homes in $1000's\n\n :Missing Attribute Values: None\n\n :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980. N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems. \n \n.. topic:: References\n\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"""
print(text)
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
matplotlib.rcdefaults()
plt.rcParams["figure.dpi"] = 140
# Boston housing dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# Set the x-values to the nitrogen oxide concentration:
X = df[["NOX"]]
# Y-values are the prices:
y = boston.target
df.head()
CRIM |
ZN |
INDUS |
CHAS |
NOX |
RM |
AGE |
DIS |
RAD |
TAX |
PTRATIO |
B |
LSTAT |
0.00632 |
18.0 |
2.31 |
0.0 |
0.538 |
6.575 |
65.2 |
4.0900 |
1.0 |
296.0 |
15.3 |
396.90 |
4.98 |
0.02731 |
0.0 |
7.07 |
0.0 |
0.469 |
6.421 |
78.9 |
4.9671 |
2.0 |
242.0 |
17.8 |
396.90 |
9.14 |
0.02729 |
0.0 |
7.07 |
0.0 |
0.469 |
7.185 |
61.1 |
4.9671 |
2.0 |
242.0 |
17.8 |
392.83 |
4.03 |
0.03237 |
0.0 |
2.18 |
0.0 |
0.458 |
6.998 |
45.8 |
6.0622 |
3.0 |
222.0 |
18.7 |
394.63 |
2.94 |
0.06905 |
0.0 |
2.18 |
0.0 |
0.458 |
7.147 |
54.2 |
6.0622 |
3.0 |
222.0 |
18.7 |
396.90 |
5.33 |
df.describe()
|
CRIM |
ZN |
INDUS |
CHAS |
NOX |
RM |
AGE |
DIS |
RAD |
TAX |
PTRATIO |
B |
LSTAT |
CRIM |
ZN |
INDUS |
CHAS |
NOX |
RM |
AGE |
DIS |
RAD |
TAX |
PTRATIO |
B |
LSTAT |
count |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
count |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
506.000000 |
mean |
3.613524 |
11.363636 |
11.136779 |
0.069170 |
0.554695 |
6.284634 |
68.574901 |
3.795043 |
9.549407 |
408.237154 |
18.455534 |
356.674032 |
mean |
3.613524 |
11.363636 |
11.136779 |
0.069170 |
0.554695 |
6.284634 |
68.574901 |
3.795043 |
9.549407 |
408.237154 |
18.455534 |
356.674032 |
12.653063 |
std |
8.601545 |
23.322453 |
6.860353 |
0.253994 |
0.115878 |
0.702617 |
28.148861 |
2.105710 |
8.707259 |
168.537116 |
2.164946 |
91.294864 |
std |
8.601545 |
23.322453 |
6.860353 |
0.253994 |
0.115878 |
0.702617 |
28.148861 |
2.105710 |
8.707259 |
168.537116 |
2.164946 |
91.294864 |
7.141062 |
min |
0.006320 |
0.000000 |
0.460000 |
0.000000 |
0.385000 |
3.561000 |
2.900000 |
1.129600 |
1.000000 |
187.000000 |
12.600000 |
0.320000 |
min |
0.006320 |
0.000000 |
0.460000 |
0.000000 |
0.385000 |
3.561000 |
2.900000 |
1.129600 |
1.000000 |
187.000000 |
12.600000 |
0.320000 |
1.730000 |
25% |
0.082045 |
0.000000 |
5.190000 |
0.000000 |
0.449000 |
5.885500 |
45.025000 |
2.100175 |
4.000000 |
279.000000 |
17.400000 |
375.377500 |
25% |
0.082045 |
0.000000 |
5.190000 |
0.000000 |
0.449000 |
5.885500 |
45.025000 |
2.100175 |
4.000000 |
279.000000 |
17.400000 |
375.377500 |
6.950000 |
50% |
0.256510 |
0.000000 |
9.690000 |
0.000000 |
0.538000 |
6.208500 |
77.500000 |
3.207450 |
5.000000 |
330.000000 |
19.050000 |
391.440000 |
50% |
0.256510 |
0.000000 |
9.690000 |
0.000000 |
0.538000 |
6.208500 |
77.500000 |
3.207450 |
5.000000 |
330.000000 |
19.050000 |
391.440000 |
11.360000 |
75% |
3.677083 |
12.500000 |
18.100000 |
0.000000 |
0.624000 |
6.623500 |
94.075000 |
5.188425 |
24.000000 |
666.000000 |
20.200000 |
396.225000 |
75% |
3.677083 |
12.500000 |
18.100000 |
0.000000 |
0.624000 |
6.623500 |
94.075000 |
5.188425 |
24.000000 |
666.000000 |
20.200000 |
396.225000 |
16.955000 |
max |
88.976200 |
100.000000 |
27.740000 |
1.000000 |
0.871000 |
8.780000 |
100.000000 |
12.126500 |
24.000000 |
711.000000 |
22.000000 |
396.900000 |
max |
88.976200 |
100.000000 |
27.740000 |
1.000000 |
0.871000 |
8.780000 |
100.000000 |
12.126500 |
24.000000 |
711.000000 |
22.000000 |
396.900000 |
37.970000 |
# Can we do linear regression on this?
line_fitter = LinearRegression()
line_fitter.fit(X, y)
y_predict = line_fitter.predict(X)
plt.scatter(X, y, alpha=0.4)
# Plot line here:
plt.plot(X, y_predict, color="red", linestyle="dashed", label="Trend")
plt.title("Boston Housing Dataset")
plt.xlabel("Nitric Oxides Concentration")
plt.ylabel("House Price ($)")
plt.show()
print(f"R^2 score: {line_fitter.score(X, y)}")
R^2 score: 0.182603042501699
Community Forums
-
Can linear regression apply to more than two variables?
Yes, in this lesson we only considered linear regression between one
dependent variable and one independent variable. However, linear regression
can also be applied between one dependent variable and two or more
independent variables. This is known as multiple linear regression, which
you can learn more about in later lessons in the “Machine Learning” course.
Similar to linear regression, multiple linear regression finds a
relationship between the dependent variable and the independent variables.
For an example, linear regression between one dependent and one independent
variable would apply in a situation to find the relationship between the
market price of a house with the square footage. However, with multiple
linear regression, we may find the relationship of the market price, number
of rooms, and location with the square footage.