Logistic Regression and … Penguins?

6 min readMar 31, 2021

What is logistic regression?

It’s a classification algorithm. A means of grouping observations based on shared characteristics.

More specifically, in the case of logistic regression, we classify based on whether something does or does not happen. It’s black or white, with no room for grey. While explanatory variables may vary, the outcome variable must be binary (0/1, True/False, Yes/No, Pass/Fail, etc.).

Thus, when we classify our data, we consider outcome variables that tend toward extreme ends and then makes a logarithmic line to aid in distinguishing between them.

For a refresher on variable definitions CLICK HERE.

Note the tie between logistic and logarithmic. When we use the term “logistic”, we aren’t using it to mean “the coordination of goods or procurement”. We’re using it to define the shape of our regression (or fit) line. It’s a squiggly sortof, “tamed S” shape and can end up looking much like the image on the right below:

We see an abrupt change from 0 to 1. The data’s binary.

In order to apply linear regression, it’s required that the dependent variable be continuous. In the case above, the dependent variable is discrete (not continuous) and so a regression line is NOT the best option. This is a case where logistic regression, which is better suited to binary data, is superior.

If we look at the logistic regression line above, we may note that low X-Axis values are classified as Y=0, high X-Axis values are classified as Y=1, and those in between are classified by how far up the logarithmic line they fall (ie. nearness to Y=0 vs. Y=1). One way to think about it would be to treat our X-axis value as our probability (that Y=1): higher X-axis values having a higher probability of being Y=1 while lower values have a lower probability.

When do we apply logistic regression?

With regard to its application, logistic regression can be applied for (email) spam detection, image segmentation and categorization, to predict mortality in injured patients, or to predict the species of an bird (for instance).

In order to gain a greater understanding of logistic regression, we’ll apply it to a super-simplified version of this last example. I’m going to load in the Palmer penguin dataset and then we’re going to apply logistic regression to predict whether or not penguins are of the `Adelie` species.

Upon loading and performing light EDA on our dataset, we’re interested in building 2 models to later assess accuracy and strength of model. For these purposes, we utilize a number of R’s built-in functions:

R code used to build logistic regression model (with all vars) and produce summary statistics.

glm() to fit a logistic model to our variables (defined as y~x). “adelie~.” just means that I took adelie as our dependent variable and all other variables as independent variables.
summary() to observe observe summary statistics.

Next we’ll visit the summary statistics produced from the above R code:

For continuous variables, the interpretation is as follows:

For every unit increase in bill_length_mm, the log odds of being Adelie decreased by 2.251 *10¹.
For every unit increase in bill_depth_mm, the log odds of being Adelie increased by 2.930 * 10¹.
For every unit increase in flipper_length_mm, the log odds of being Adelie decreased by 1.109 * 10⁰.
For every unit increase in body_mass_g, the log odds of being Adelie increased by 3.695 * 10^-2.
For every unit increase in year, the log odds of being Adelie increased by 7.962 * 10⁰.

For categorical variables, the interpretation is as follows:

If the island is Dream, the log odds of being Adelie decreased by -4.543 * 10⁰.
If the island is Torgersen, the log odds of being Adelie increased by 4.460 * 10¹.
If the sex is male, the log odds of being Adelie increased by 1.161 * 10¹.

In addition to the variable interpretations above, we can pull from the summary statistics. The summary statistics provide us with an AIC value of 18 and p-values of ~1.

From these statistics (which we’ll dig into in more depth later), we determine that there’s likely a better model, and so … we utilize a built-in R function to help us out:

stepAIC() performs stepwise model selection based on AIC value optimization.

Being that the Akaike information criterion (AIC) is used as a means of estimating prediction error, we’ll keep model_1’s AIC score in mind when visiting the score for model_2 (which was AIC optimized):

Optimizing based on AIC (Akaike Information Criteria) value resulted in a model that:

was optimized from eight to just three variables (bill_length_mm, bill_depth_mm, and sexmale),
had an AIC value improved from 18 (for model_1) to 8, and
maintained p-values ~1.

While I was initially concerned regarding the high p-values of this model, I elected to proceed to assessing the model due to the predictive promise that a lower AIC score is indicative of.

As a final step, we dig into the accuracy of our model via Confusion Matrix:

We add a Prediction variable to our pen (penguins) dataset, which retains its value by assessing fitted / predicted values from model_2 and assigning the corresponding observation as “+ve” (Adelie) if greater than 0.5 and“-ve” (not Adelie) otherwise.

We use R’s built in table() function to build a contingency table (our Confusion Matrix) which tallies 1 if our predicted value matches the actual value and 0 otherwise.

We then assess the overall model accuracy by summing our diagonal (True Positive + True Negative). In short, this value captures how many of our predictions were correct (ie. 99/100 → 99% accuracy). For more on Reading a Confusion Matrix.

Based on the Confusion Matrix, our results are as follows:

True Positive Result (TPR): 146 / 146 = 100%
False Positive Result (FPR): 0 / 146 = 0%
True Negative Result (TNR): 187 / 187 = 100%
False Negative Result (FNR): 0 / 187 = 0%

We then take the sum of the diagonal of our Confusion Matrix (our True values) and put that over the sum of all values. As you may have expected, the output is 1.0 or 100% accuracy.

Based on the Confusion Matrix, optimizing based on AIC score worked out in our favor. We’ve produced a model with 100% accuracy :)

Photo by Claudio Schwarz | @purzlbaum on Unsplash

The black and white of logistic regression classified the black and white coats of our Palmer penguins perfectly. It appears to have been the perfect match :)

This is not always the case though. There are times where we’ll have to dig deeper into our bag of tricks. Where we’ll have to lean on more advanced classification techniques. That’s a tale for another day though …

Logistic Regression and … Penguins?

What is logistic regression?

When do we apply logistic regression?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Magnus PS

No responses yet