Probability and Statistics in Data Science

Victoria Akintomide
3 min readMar 13, 2021

Data Science has been ranked as the topmost profession by Glassdoor. The article by Harvard Business Review also spells it as the sexiest job of the 21st century. So what are the most important tools to becoming a good data scientist? Apart from programming, basic mathematical concepts, probability and statistics are equally important. A good understanding of these mathematical concepts will help in the journey of becoming a professional data scientist.

Probability

Probability calculates how likely an event will happen. It answers the question “What is the chance this event will happen?”. It is used intuitively without even realizing it. When we ask what are chances, it will rain in March, we are seeking the probability of rain falling in March.

Uncertainty is expected in the world we live in. Probability can help in understanding and knowing the chances of events occurring. From studying patterns and trends in past collected data, probability can assist in making informed decisions about the likelihood of events happening.

Some key concepts in the probability include:

  1. Conditional Probability: This is the probability of another event (event B) occurring given that another event (event A) has occurred. For example, given that it is raining today, what is the probability that NEPA would supply power?
Figure: Formula for calculating conditional probability

2. Random Variables: A random variable is a variable whose value is a numerical outcome of a random experiment. Random variables are divided into two:

  • Continuous Random Variable: In a continuous random variable, its possible values contain a whole interval of numbers. An example is the height of students in a statistics class.
  • Discrete Random Variable: A discrete random number has a countable number of possible values. The number of students in a statistics class is an example of a discrete random variable.

3. Probability Distribution: Probability distribution is a function that describes all the possible likelihoods and values that can be taken by a random variable within a given range.

For a continuous random variable, the probability distribution is described by the probability density function. And for a discrete random variable, it’s a probability mass function that defines the probability distribution.

There are different probability distributions each catering to different data generation process and purposes. An example is the binomial distribution which evaluates the probability of a particular event occurring many times over a given number of trials as well as given the probability of the event in each trial.

Statistics

Statistics is a branch of mathematics that concerns the collection, interpretation, organization, analysis and presentation of data.

Commonly used terminologies in the statistics field include:

  • Population: The complete set of data items to be investigated.
  • Sample: A subset of the population.
  • Parameter: A characteristic of the population.

Statistics can be divided into two main categories:

  1. Descriptive Statistics: Here, data is used to provide descriptions of the population. The characteristics of the data are what are being investigated in this analysis.
  • Measure of Center: These measures attempts to describe the set of data by identifying central positions. The mean is the average of the data values. The median is the central value of the set of data and the mode is the most occurring value in the data set.
  • Measure of Spread: This measures how varied the set of observations are. The range is the difference between the maximum and minimum value in the set of data. The variance describes how much a random variable differs from its expected value. The standard deviation is a measure of the dispersion of a set of data from its mean.

2. Inferential Statistics: Inferential statistics makes inferences and predictions about a population based on a sample of data taken from the population in question.

  • Hypothesis testing is an inferential statistical technique that determines whether there is enough evidence in a data sample to infer that a certain condition holds true for an entire population.

A good comprehension of the mathematical concepts in data science helps a data scientist gain deeper insights into the data and the best approach to take to solve problems presented.

--

--