Introductory Statistics – Chapter 3: Measuring data

Introductory Statistics – Chapter 3: Measuring data

July 24, 2019 0 By Kailee Schamberger



welcome to the video summary series for the Disco's introductory statistics textbook in addition to chapter summary videos such as this one introductory statistics also offers podcasts virtual tutor a learning homework activities with anti-cheat and auto grade functionality and detailed instructor resources find out more at pedis co.com forward slash intro stats for now over to the author hi again I'm Michelle Thompson and welcome to the next summary in the / Disko introductory statistics series in this one we're going over measuring data in particular we'll be covering measures of center measures of variation and measures of relationship so far we've looked at data and how it can be presented when you've collected a bunch of raw data like this one for example the raw data probably won't help you much but you can present the data providing a visual summary of it for example that raw data could be summarized in this histogram and now that's much better we now have a better feeling for the data but feelings can be subjective and pictures of data as presentable as they are are not susceptible to rigorous statistical analysis all fine is that we need to measure data measurements of data are less subjective and can be mathematically manipulated in rigorous ways measuring the center of data is all about numerically summarizing where data values are so a measure of center is typically a number and that number gives us some sort of indication of where the data values lie a very common measure of center is the mean the mean of a set of numerical data is just the statistical average of all the numbers in that data so say for example we have a sample of 5 test scores then 2 cat ate the mean we add the 5 values together and divide by 5 the number of data values so in this case the mean is 11 at this point it's worth noting that the mean takes into account all five data values if you change any one data value in that set you will definitely affect the mean so the mean could be said to be a measure of center that is sensitive to changes in the data another common measure is the media this is the data value that is ranked in the middle when the data values are placed in numerical order so that data set we just saw we can rearrange it so that the values are in order like this the middle value is the third value 12 and so this is the median the final measure of center we have is the mode and this is the most commonly occurring data value in the set so in the data set of five values the value 12 occurred twice which is more than any other values so the mode in this case is 12 so that's the three main measures of center the mean the median and the mode the mean and the median are the most commonly used measures and these two measures will of course provide different summaries of your data as we just saw in the example we were just looking at and sometimes they'll be very different and this is most pronounced with skewed data for example for the data in this histogram the mean will be less than the median and this is because as we mentioned earlier the mean is sensitive to all the data and in this case is sensitive to the very low values which dragged the mean down so when you use measures of center the measure you choose to use can have an impact on the sort of summary you're going to provide for that data whereas measures of center tell you about where the data values are measures of variation that tell you about how spread out they are this is particularly important because values that are spread out are less consistent and values that are not spread out that is values will vary more from sample to sample and this is important because it means we can't be as confident in the data if there's more variation so variation will tell you something about how much you can use your data to infer things the crudest measure of variation is the range the range is simply the difference between the maximum and minimum value in the data set for example let's say we now have a sample of seven test scores then the maximum value is 99 and the minimum value here is 41 so the range is the difference between these two values which is 58 the maximum and minimum values are two important values when measuring variation three other important values are the three quartiles in the data the three quartiles divide the data up into four quarters hence the name quartile the second quartile is the value ranked in the middle that is it's the median it divides the data set in two halves the first quartile is then the middle value in the bottom half and the third quartile is the middle value in the top half for the seven test scores we just saw the quartiles split the data up like this the three quartiles together with the minimum and maximum form what is called at the five number summary the five number summary can be represented graphically with a thing called a box plot now this talk is just a summary so I won't go into detail here on exactly how the box plot is constructed if you want to read more about how to construct or interpret a box plot you can read all about them in the per Disko introductory statistics textbook the difference between the third quartile and the first quartile is known as the interquartile range so for the seven test scores this interquartile range is the difference between 87 and 52 which is 35 in this case so now we've looked at Rangers and interquartile ranges and box plots they all provide a fairly rudimentary measure of variation in data but the most common measures of variation are the variance and standard deviation now these measures are more complicated like the mean in their formulas that take into account all the values in a data set the standard deviation is actually just the square root of the variance so for now we'll spend our time focusing on the formula for the variance and where that formula comes from the idea is that the variance measures the average distance from each point to the middle of the data set and here we use the mean as our measure of the middle however it turns out that the best way to measure this average distance is to actually square the difference between each data value and the mean so the variance is calculated by squaring the difference between H value and the mean adding these squared differences up and then dividing the sum by n minus 1 so you divide by one less than the number of data values we won't here going to exactly Y divided by n minus one instead of by n to get the average distance but rest assure that the formula does work better that way as I mentioned earlier the standard deviation is the square root of the variance and really in terms of measures of variation the standard deviation is the single most commonly used Mehta we've looked at measuring the center and variation in a set of data for one variable now let's turn our attention to data for two variables and measuring the relationship between such variables recall the way we present the relationship between two numerical variables is through a scatter plot but like with data of one variable such a presentation can be subjective and not particularly rigorous what we'll do here is show a way of measuring the strength of any linear relationship between two numerical variables and also measuring the nature of that relationship the strength of linear relationship between two variables is measured by a thing called correlation here is the formula for the correlation between two sets of data it looks fairly complicated and it is but the important thing is not to be able to calculate a correlation we usually get software to do that for us the important thing is to be able to interpret what the correlation means and the important things to note for correlation are the correlation will always be somewhere between minus 1 and 1 if it's positive that means that there is a positive association between the two variables this kind of plot slopes up to the right a negative correlation means a negative association the scatter plot slopes down to the right and the closer the correlation is to either of the two extremes that is the closer it is to minus 101 the stronger the linear relationship is and a correlation close to zero is just a weak linear relationship now to get a feeling for the significance of the correlation as a measure of linear relationship let's try a question from the per disco a workbook in this question there is an investigation into any relationship between the salary paid to a CEO and the productivity of that CEO a scatter plot of collected data is given and based on this you're being asked to nominate a suitable value for the correlation of the options given the only possible correlation here is 0.67 so we'll submit that and now we get personalized feedback and an explanation for the question so that's measuring the strength of linear relationship between two variables as for the nature of such a relationship we can always derive a linear equation called the line of best fit that's a line that takes into account all the points on the scatterplot and comes up with a line that best fits them hence the name you actually work out what the line is and the formula for the line of best fit is here the important thing to note is that like any straight line the equation for the line of best fit has two coefficients and the coefficients for this line are determined by all of your scatter plot data to get an idea of how all the line of best fit does describe the relationship here are some examples of some scatter plots and the lines for them so that was measuring data the key topics were measures of center measures of variation and measures of relationship