Continuing the discussion from What are some reasons that we square the differences?:
Hi, from my perspective, how to define a proper loss function is an important topic in statistical modeling. In the context of linear regression, we do not have to use the square operation because the key point is to define a “distance” (which should be non-negative) between the predicted y and the actual y. The distance between y_predicted = 2 and y = 0 should be equal to y_predicted = 0 and y = 2. Following this idea,
- Using a square operation is one candidate way (which will be more sensitive to large differences between y_predicted and y). This becomes one kind of linear regression called the least-square regression (also called L2 regression in a more technical context). Due to the property of the square, it will be sensitive to outliers in the data (corresponding to those large deviations between y_predicted and y.)
- We can also use absolute value, which will be less sensitive to those large deviations and become more robust to outliers. This is related to the least-absolute regression (also called L1 regression). If you are interested, please turn to the following link:
Of course, we are not saying L1 regression must be better than L2 regression due to its robustness. There are always some trade-offs involved and sometimes we still prefer to use L2 regression (e.g. it has nicer properties when coming to the form of the solution to the regression). Furthermore, this is not the end of the story, we can also define other distances to form other regression models. Therefore, we should not restrict our imagination and feel free to explore these variations.
Also, feel free to leave your comments and thoughts on this topic.