Elementary Linear Regression

5 min readFeb 20, 2021

What is elementary linear regression?

When we say “elementary” linear regression, what we mean is that we’re taking the simplest principles, the building blocks of some idea (in this case linear regression) and exploring these building blocks in their absolute, simplest form.

For this reason, it’s more commonly referred to as simple linear regression.

Our application of linear regression is an attempt to not only explore but to quantify (put a number to) the relationship between a quantitative response variable and one or more explanatory variables.

Before moving forward, let’s quickly review a couple definitions:

Qualitative variables are non-numerical and categorical (ie. eye color) whereas quantitative variables are numerical (ie. height).
Response variables are dependent output or “y” variables whereas explanatory variables are independent input or “X” variables. Typically we use the response (output) variable to assess the affect of changing our explanatory (input) variable.

When we take the application of linear regression to its elementary form, we’re considering a 1-to-1 relationship. Meaning that we explore the strength and character of the relationship between our response variable and ONE explanatory variable.

As an equation, this relationship could be represented as the following simple linear equation: y = mX + b, where our variables are represented as the following:

y - response (output) variable,
m - slope of line,
X - explanatory (input) variable, and
b - y-intercept of line.

When do we apply linear regression?

Applying linear regression in this sort of simple manner for assessing the strength of the relationship between two continuous variables. With regard to its application, simple linear regression might be applied to understand the impact of marketing spending on sales, rainfall amounts and erosion, or the number of base hits on number of wins (in baseball).

In order to gain a greater understanding of linear regression, we’ll apply it to this last example. We’re going to explore strength of the relationship between Base Hits by Batters and Number of Wins.

Upon loading and performing light EDA on our dataset (possibly the subject of a future article?), we’re interested in visualizing and assessing the strength of correlation between Base Hits and Target Wins. For these purposes, we utilize a number of R’s built-in functions:

R code used to produce scatter plot overlaid with (blue) regression line and summary statistics.

plot() to generate a simple scatter plot of Base Hits (our X variable) vs. Target Wins (our y variable).
lm() to fit a linear model to our variables (defined as y ~ X).
abline() to add our straight /regression line to the plot.
summary() to observe observe summary statistics.

First we’ll visit the scatterplot produced from the above R code:

Base Hits vs. Wins: scatterplot overlayed with (blue) regression line

We can make a number of observations from the plot above:

Judging by the number of data points, this is a large dataset with quite a bit of variation in values. While basic, this is worth noting.
Our residuals appear to have quite a range to them. A residual is the measure from an actual data point to the estimated representation of our line, and while this may be indicative of a poor fit, we’ll want to consider our summary statistics before jumping to any conclusions.
The upward slope of our line (recall the “m” variable from earlier) is positive, indicating a positive correlation between Base Hits and Wins. We can loosely interpret this as “the more base hits a baseball team has, the more likely they are to win”.

With these observations in mind, we can move on to the summary statistics:

Summary statistics for our simple linear regression model.

While the summary statistics are by no means comprehensive, they provide quite a bit of insight regarding the performance and accuracy of our simple linear regression model. From these statistics, we can draw the following:

Our linear equation: TARGET_WINS = (0.042 * TEAM_BATTING_H) + 18.562. To find these values, we use the “Estimate” column of the “Coefficients” section.
From this equation (and our Coefficients), we see (1) there is a positive correlation between Base Hits and Wins and (2) our model is flawed. If we were to plug in 0 for TEAM_BATTING_H, which would infer 0 hits for the entire season, the team would end up with 18.5 wins anyway. Obviously this is impossible and thus we can make a rather early mental note on the presence of error in our model (more on this later).
From the p-value(s), we can derive that there is a relationship between our variables. Thus, we can throw out the null hypothesis (that there is no relationship between a team’s base hits and their wins).
While the residual standard error (RSE) is by no means mountainous, it is indicative of error. What this means, in short, is that we can improve our model’s fit / accuracy. The RSE provides a measure of our residual’s average distance from our fit line. Values closer to zero indicate a better fit.
Our R-squared value of 0.151 indicates that our model does not provide an accurate representation of the data it was fit to and thus Base Hits alone are not enough to predict Wins for a baseball team.

The points from above are motivation enough for our throwing out the simple linear regression model we’ve been interpreting. While there are benefits to the application of simple linear regression and there are times that the simple model is enough to represent the data it was generated from, this was not one of those cases. This pours very nicely into our closing …

We see that a simple linear regression model is relatively easy to apply, interpret, and explain. This is one of the major benefits of applying linear regression.

In this case though, we were dealing with a larger, more complex dataset and thus electing one independent variable simply did not provide a strong enough representation of the set it was generated from.

While simplicity for simplicity’s sake is alluring, it isn’t perfect. Just like “more” isn’t always the solution, neither is “less”. Sometimes the best option is somewhere in between.

This may not mean electing the most complex of models, it might just mean incorporating a few more independent variables. It might just mean, applying multi-linear regression. And like Sherlock Holmes says when “on the trail” of a detective case, this next step is “Elementary my dear Watson”.