Descriptive statistics are used to describe the basic information about the data. They help us understand some features of the data by giving short summaries about the sample. It’s like the first impression of what the data shows us. In a nutshell, descriptive statistics are metrics and quantitative analysis to briefly describe our sample. And they’re broken down into measures of central tendency and measures of variability. In this article, we’ll discuss the latter and we’ll clear the fog on famous questions about variance and standard deviation.
Suppose that we have a random variable, by definition, it can take different…
Machine learning classification metrics are not that hard to think about if the data are quite clean, neat and balanced. We can just compute the accuracy with the division of the true predicted observations by the total observation. This is not the case in general. In fact, a lot of problems in machine learning have imbalanced data (spam detection, fraud detection, detection of rare diseases …).
Say we want to create a model to detect spams and our dataset has 1000 emails where 10 are spams and 990 are not. So, we have chosen Logistic Regression to do this task…
Principal Component Analysis (PCA) is a method of dimensionality reduction, it can be used for feature extraction or representation learning. It transforms the data from a d-dimensional space into a new coordinate system of p dimensions(p≤d), and extracting the most important q variables(q << d)
First of all we need to know that PCA works only with continuous variables. So if you have a mixture of categorical and continuous variables you have to select only the non-discrete ones.
We can use PCA:
Logistic regression is a classification algorithm, which is pretty popular in some communities especially in the field of biostatistics, bioinformatics and credit scoring. It’s used to assign observations a discrete set of classes(target).
Logistic regression is strictly not a classification algorithm on its own. Because its output is a probability(a continuous number between 0 and 1), it will be only a classification algorithm in combination with a decision boundary which is generally fixed at 0.5
For example, if we have a logistic regression that has to predict whether an email is a spam or not, the output of the function…
A software and machine learning engineer