Non-Normal Distribution in Statistics – Skewness and Kurtosis (3-9)

Non-Normal Distribution in Statistics – Skewness and Kurtosis (3-9)

July 12, 2019 8 By Kailee Schamberger



Now that I have explained to you the
ubiquity of the normal distribution, Its regular appearance in human measurements,
you may begin to hope or even expect that all of the distributions that we
will encounter will be normal curves, but if that is your expectation, you will
have to get used to disappointment. (Princess bride reference there) Because
many curves, perhaps most curves, are not normal distributions, we need a way to
talk about the shape of distributions when they differ from normality. The
first difference that we may find is that the scores in the distribution are
more spread out than we would have expected, or we may find the scores are
more closely packed together than we expected. The name for the peaked or
flatness of a curve is called kurtosis. When the scores are very close together,
then the curve becomes peaked. We call this a "leptokurtic" curve think of the
scores leaping up – leptokurtic. When the scores are very spread out, the curve
becomes flat like a plate, we call this platykurtic. "Plat" rhymes with flat.
Platykurtic is a flattened curve in the shape of a plate. A normal curve is
mesokurtic. It's kurtosis is medium. So kurtosis can be measured as leptokurtic –
tall, platykurtic – flat, or mesokurtic – medium. Kurtosis is caused by the
variability in the distribution. Another thing that can happen to a curve is when
the scores are pulled out in only one direction. When the scores are dragged down (or
rather, out) in only one direction, this creates a skew in our curve. Therefore,
we need to talk about the skewness of our distribution. Negatively skewed
distributions have a higher than expected frequency of high or extreme
scores on the right, and the tail is pulled out to the left end of the number
line on the x-axis. For example, if we were interested in the running speeds of
football players, we might find a lot of very fast players – high scores, but only a
few slower runners – low scores. Skewness is always caused by outliers in the
direction of the tail. In a positively skewed distribution, the higher than
expected frequencies are on the low end of the curve. The tail is pulled back on
the right or positive end of the number line. If we were measuring reaction time,
we would expect to have a large number of very quick responses – low scores, and
only a few slower responses, taking more time, further up the positive end of that
scale. Skewed distributions are not normal. How can you remember which
direction is positive or negative when we talk about skewness? Stats cow tells
us that the skew is in the tail. Skewness is caused by outliers, extreme scores in
the tail of the distribution, the direction that the tail is pulled out,
(positive or negative) is the direction of the skew. Here are two curves. This first
one is positively skewed, and the second is negatively skewed the top curve is
positively skewed because the tail is pulled out on the right, or the positive
direction of the number line. The bottom curve, is negatively skewed, the tail is
pulled out on the negative, or left end of the
number line. Both of these curves show us what happens to the mean and the median
in the case of kurtosis. In both of these curves, you can see what happens to the
mean and the median in the case of skewness. Both of them are pulled in the
direction of the outlier but the mean is pulled further. That is because the mean
is more susceptible to the outlier that is causing the skewness. Mathematically
we can calculate a measure of skewness by comparing the mean and the median and this will give us a value that we can use to quantify the skewness of our
curve. But there are other things that can go wrong with our normal curve!
Instead of having one peak sometimes we have two peaks. This occurs when there is more than one most frequently occurring score we call this type of curve bimodal. A curve can be bimodal when there really are two most frequently occurring scores.
For instance when is the best time to go fishing? At what time of day will you
catch the most fish? Probably early in the morning, and then in the evening when the sun is going down. In the middle of the day, when the Sun is at its height,
you will catch fewer fish. So if we plot the number of fish caught, we will see a
peak in the morning at dawn, and another peak in the evening at dusk. This would
be a true bimodal distribution. On the other hand, we might have a bimodal
distribution when there are actually two distributions overlying each other. When
we had both males and females on the football field and we were comparing
heights, we saw that there was a distribution for males and another
distribution for females. The distributions overlapped – some females
were taller than some males – but the average heights were taller for males.
They really were two distinct distributions that should be separated
before being analyzed. A multimodal distribution has three or more most
frequently occurring scores. You may wonder why we don't call it a
trimodal distribution or a quadrimodal distribution – four peaks – the answer is
that when we start getting three, four, five, modes, there is something very wrong in our data set. Three or more modes is multimodal, and it's messed up. We need to figure out what is going on before we try to analyze those data. Rectangular
distributions have the same frequency for all scores. If you roll a single die
100 times, how many times do you expect to get a one? About one-sixth of the time,
in fact you would expect to get each of the scores, one through six, approximately one sixth of a time. That is a rectangular distribution. Once you add a
second die, however your distribution will begin to look more normal.
Rectangular distributions have exactly the same frequency for all scores, and do
not have tails. Before we conclude, there is – one more thing – that I want to tell
you about the normal curve, and that is that the normal curve can be overlaid
with a number line, and this is where things get really interesting and quite
useful. If we have a normal curve, we can add the value of the mean right in the
middle where it belongs, and in this example we're going to imagine that our
mean is 50, so then we could lay out a number line with four point delineations.
Half of our scores will always be above the mean, or above 50. The remaining half
of the scores will always be below 50. That is what a measure of central
tendency tells us. It is the point at which half of the scores fall above and
half of the scores fall below. The next thing that we could do is measure
the proportion of the scores that fall within a certain range, above or below
the mean. The next thing that we could do is measure the proportion of the scores
that fall within a certain range above or below the mean. The
proportion is the total area under the normal curve that corresponds to the
relative frequency of those scores. To better understand this, let's return to
our picture of the people standing on the football field. Remember that
everyone (100%) are standing below the rope that represents
our distribution we want to know the proportion of people who are between
five foot six and five foot nine inches tall. We ask everyone who is in those
rows, five foot six, seven, eight, and nine to stay where they are, everyone else
please leave the field. So how many people are in these four rows? Divide the
number of people in the four rows by the total number of people and you have a
proportion. This is the proportion of people who are in that range underneath
the distribution. It would also be the relative frequency of the number of
people in that range, and this is going to become a very useful technique when
we talk about z-scores. But for now, just remember what we've learned about the
frequency table, and specifically how the relative frequency relates to what we
know about the normal curve.