Introduction to Logistic Regression

Learn how logistic regression is implemented from scratch

Yassine EL KHAL
6 min readAug 25, 2020

Logistic regression is a classification algorithm, which is pretty popular in some communities especially in the field of biostatistics, bioinformatics and credit scoring. It’s used to assign observations a discrete set of classes(target).

Why not logistic classification?

Logistic regression is strictly not a classification algorithm on its own. Because its output is a probability(a continuous number between 0 and 1), it will be only a classification algorithm in combination with a decision boundary which is generally fixed at 0.5

For example, if we have a logistic regression that has to predict whether an email is a spam or not, the output of the function will be 0.2 or 0.7. By default, the logistic regression makes it easier for us by assigning 0(not a spam) for the one who got 0.2 and assigning 1(a spam) for the latter. The threshold is 0.5 but we can manage to change it.

Comparison to linear regression

Let’s suppose you have data on time spent studying, playing and exam score.

Linear Regression: because it’s a regression, meaning that the output is continuous, it could help us predict the student test score between a certain range.

Logistic Regression: this one could help us predict whether the student passed the exam or not. Its output is binary but we can also see the probability given for each student (after all the probability is the real output).

Types of logistic regression

  • Binary (ex: malignant or benign tumor)
  • Multi (ex: Animal classification)
  • Ordinal: (ex: Low, Medium, High)

Binary logistic regression

In this article we will treat binary logistic regression. In a nutshell, logistic regression is a sigmoid of a linear regression.

First of all let’s see how linear regression works.

The formula of Linear Regression (source: Image by author)

where:

  • xᵢ are the features we have
  • ωᵢ are the coefficient of those features
  • ω₀ is the intercept(a sort of error calibration)

The sigmoid function

Here comes the sigmoid function so we can switch from the linear regression principle to a logistic one.

The sigmoid function applied to linear regression output (source: Image by author)

In order to map predicted values to probabilities, we use the sigmoid function. So every real value is converted to a continuous one between 0 and 1.

Sigmoid function (source: wikicommons)

Decision Boundary

We already talked about the decision boundary or threshold. So, in order to map the output of the sigmoid which is a value between 0 and 1, we can choose a threshold to say whether the observation is from class A or class B. By default, the boundary is 0.5 but it can be really dangerous in some fields like bioinformatics to not decrease it.

Cost function

Unfortunately we cannot use the means square error(MSE) used in the linear regression. Why ? good question. Well now because our sigmoid function is not linear, squaring it would not let the problem be a convex one, thus the gradient descent will not work correctly. It may find a local minimum which can be very far from the global one.

Instead we can use the Cross-Entropy also known as Log-Loss. It’s a convex simple function with which the gradient descent feels good to be applied to.

Mathematical formula of Cross-Entropy (source: Image by author)

where:

  • Omega is the set of coefficients
  • m is the number of observations

Gradient Descent

Now we have our loss function, we need a way to minimize it which is equivalent to maximize the probability we want to have. For that, we use the gradient descent just like in other machine learning algorithms like neural networks.

The computation of gradient descent (source: Image by author)

We can improve the computation by adding a step alpha and an optimizer to our gradient.

Regularization

Regularization is a sort of calibration to not overfit the training set but are also a type of optimization. Regularization does NOT improve the performance on the training data, however it can improve the generalization performance which means when we’ll test our model in a different dataset, it has more chance to perform well as it tries as much to not overfit the training data.

L2 Regularization (Ridge)

In ridge regularization, the loss function is bit changed by adding a penalty equivalent to square of the magnitude of the coefficients.

So here we can see that lambda is playing the coefficients regulator, i.e the higher lambda is, the more large values are penalized and if λ = 0, it’s just the simple loss function. The lower λ is, the model will be similar to the one without a regularization.

L1 Regularization (Lasso)

In lasso regularization, the loss function is added by a penalty equivalent to the absolute value of the magnitude of the coefficients. And the same explanation for Ridge goes here.

We can see below how the conditions change the true optimal point which is the black one to the point where the red curve and the blue area are intersecting.

In the left(lasso regularization), we can clearly see that the optimal point gives us one feature and totally penalise the other. And here it comes the optimisation of the complexity. So in a higher dimension space, we can really earn time by penalizing some features.

Should I standardize my data?

Standardization or normalization isn’t required for logistic regression. It’s just to make convergence in the optimization faster otherwise you can run your model without any standardization BUT if you are using Lasso or Ridge regularization you SHOULD apply it first since regularization is based on the magnitude of the coefficient i.e features with large coefficients will be more penalized.

The ridge/Lasso solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving.[1]

Standardization also helps us interpreting the coefficients.

Coefficient Analysis

After scaling the data, it’s simple to interpret and explain the sign of the coefficient:

  • The positive one means the feature is correlated positively with the target
  • The negative one means it’s correlated negatively with the target.

However, because of outliers, unbalanced data or sometimes the nature of the feature, we can have different results if we just ignore an observation. So one way to make sure your logistic regression is robust and consistent, is to do a bootstrapping method so we can see the behavior of the coefficients and assert that they are very good or not.

If it’s a good coefficient, we’ll have a Gaussian distribution.

The distribution of a logReg coefficient by using bootstrap (source: Image by author)

Remember, your coefficients should not:

  • Change their sign, otherwise something is wrong because it will no longer be interpretable.
  • Have a different distribution other than Gaussian one or at least the skewed normal distribution.

Conclusion

What you need to keep from this article are:

  • Logistic regression is a sigmoid function on top of linear regression
  • The loss function for logistic regression is the cross-entropy and not the mean square error
  • We can add regulators to have performance on different set and sometimes for optimisation
  • Do not forget to do a small analysis of your coefficients so you can be sure your model is robust

You can find the source code of logistic regression from scratch here.

[1] T. Hastie, R. Tibshirani and J Friedman, The Elements of Statistical Learning (2008), Data Mining, Inference, and Prediction p.63

--

--