Multi Regression: The Next Step
A simple linear regression model is the easiest model to interpret.
Point. Blank. Period.
With this said, fitting data on a one-to-one variable basis limits our model’s representative ability and is not applicable in many cases.
Sometimes, to improve our model and better represent the data under consideration, we’ve got to incorporate more variables.
This leads, naturally, to the next step …
Multiple linear regression, or multi regression for short, doesn’t overcome all the weaknesses of a simple linear regression model, but it does expand our capability of representation.
Accounting for more independent variables enables us to better represent the data at hand. We’re no longer hand-cuffed to picking the ONE best variable, twisting up our fingers, and hoping for the best.
If we recall, from Elementary Linear Regression:
Our application of linear regression is an attempt to not only explore but to quantify (put a number to) the relationship between a quantitative response variable and one or more explanatory variables.
Here, we’ll focus on the “more” portion. The difference between multi regression and simple linear regression’s equations is captured below:
- SLR: y = mX + b
- MLR: y = (m_1 * x_1) + (m_2 * x_2) + … (m_n * x_n) + b
Both equations include a “y” and a “b” variable. These are representative of our dependent variable and y-intercept, respectively.
The difference, which is rather apparent, is that one equation is longer than the other. The SLR (simple linear regression) equation includes one “X” variable with one “m” coefficient, whereas the MLR (multiple linear regression) equation includes multiple “x” variables and corresponding “m” coefficients.
To summarize what each variable within the MLR equation means:
- y-response (output) variable,
- m_1, m_2 … m_n-coefficient associated with explanatory variable of the same #,
- x_1, x_2 … x_n-explanatory (input) variables, and
- b-y-intercept of line.
When do we apply multiple linear regression?
We apply multi regression when we believe that our simple linear regression model could be improved by considering multiple independent variables rather than just one to predict our outcome / dependent variable.
By increasing the number of variables under consideration we increase the complexity of our model (a perceived “con”), while simultaneously increasing the breadth of its representative ability (a perceived “pro”).
Multi regression might be applied to predict a student’s GPA given their age, gender, and IQ. It might be used to predict cholesterol levels given a patient’s age, height, and weight. Or it might be applied to predict the number of wins for a given baseball team given their base hits, stolen bases, and fielding errors.
In order to gain a greater understanding of multi linear regression, we’ll apply it to the last example. We’re going to explore the strength of the relationship between base hits, stolen bases, and fielding errors on number of wins (in baseball).
If you read, Elementary Linear Regression, you may recall that the predictive model had quite a bit of room for improvement. We’re going to attempt to improve upon that model by accounting for more variables.
Typically we’d perform EDA (exploratory data analysis) and some data preparation (dealing with NAs, outliers, etc.) prior to building our model but we’ll bypass that code for simplicity’s sake and instead keep the focus on our model. On including more variables within our model to then compare and contrast summary statistics with the simple linear regression model (linked to above))

With TARGET_WINS as our dependent variable and TEAM_BATTING_H, TEAM_BASERUN_SB, and TEAM_FIELDING_E as our independent variables, we apply regression via R’s built-in lm() function, assign the result to multi_model and observe corresponding summary statistics via R’s built-in summary() function.

Our coefficients (in the 1st column on the “Coefficients”” section) all make sense. If we think in the context of a team winning a baseball game, positive coefficients for TEAM_BATTING_H and TEAM_BASERUN_SB and a negative coefficient for TEAM_FIELDING_E mean that having more hits and stolen bases and fewer fielding errors are all indicators of a team winning. This makes perfect sense.
From the statistics above we can also interpret our model’s summary statistics vs. those of our simple linear regression model (from Elementary Linear Regression):
- p-values: if we go to the “Pr(>|t|)” column and observe the value for each independent variable, we see that all values are “< 2e-16”. This is well beneath 0.05, which is often used as a selection threshold. Thus, all variables play a significant role in predicting wins.
- RSE: 14.52→12.47 is an improvement. Being that the residual standard error provides indication of how far (on average) our residuals are from the fit line, a lower value is better here.
- F-statistic: 404.9→288 confirms that considering more variables introduces greater variability to our model. The f-statistic provides a measure of variability between groups over variability within each group, where a higher value is evidence in favor of a model’s efficacy.
- Adj. R²: 0.1508→0.2865 is an improvement. Typically values near 1 are indicative of a strong model. Our relatively low value is concerning and indicative of a weak model but it may be the best we can do given our dataset. Further exploration and visualizations would be required to make any sortof “final ruling” on the efficacy of the model.
As a next step, we’ll introduce and explore corresponding residual plots.
A residual is a measure of the vertical distance between each data point and our fit line. Every data point has a residual. If the point sits directly on our fit line then its residual is 0. That’s a rare case though and what’s more likely is that there’s some distance between each point and our best fit line.
Typically, we plot and interpret our residuals as a part of validating our model. Being that the subject of this article is to compare the application of simple and multi linear regression model’s, I’ll keep the residual interpretation “high level”. For more depth, check out this article.
Plotting residuals in R is rather simple:

We set the display layout of our plots to be that of a 2 x 2 matrix, output in a [1 2, 3 4] manner, and then we call R’s built-in plot() function upon our model to generate residual plots.
What follows is the result:

The residuals vs. fitted and scale-location show that homoscedasticity (same variance from fit line) and non-equal dispersion (a consistent data spread throughout the plot range) may be of concern. With this said, there doesn’t appear to be a non-linear pattern to our data and thus our output is OK and may be improved by dealing with influential outliers (noted below).
On a more positive note, our normal Q-Q and residuals vs. leverage plots are promising. The Q-Q plot show our data following a straight line as desired, while the leverage plot have Cook’s distance lines (the red dashed lines) toward the edges of the plot, which is a positive indicator, and influential cases clearly marked (859, 1821, 2031).
Based on our summary statistics and residual plots, we see that our multi regression model is an improvement over the simple regression model. Considering more than one variable improved the performance of our model and introduced greater predictive capacity.
With that said, our r-squared value is a mark for concern and proof still that this model holds room for improvement. These improvements may come in the form of considering additional variables, dealing with outliers, or automated feature engineering (forward, backward, or step-wise).
In short, while a step in the right direction, this step also shows how much further there is to go and how important it is that the data we input is good and clean.