# Data distribution representation

Watch
Announcements
Thread starter 5 years ago
#1
Hi I'm not really sure how to word this so sorry if it's quite hard to understand. I don't understand when you use the standard deviation or mean or median to represent the distribution of data? Thank you in advance 0
reply
5 years ago
#2
(Original post by Meggy moo 1)
Hi I'm not really sure how to word this so sorry if it's quite hard to understand. I don't understand when you use the standard deviation or mean or median to represent the distribution of data? Thank you in advance This is difficult question both to ask and to answer precisely!

The general idea of these so-called "summary statistics" is to give a simple characterization of a data distribution using as few numbers as possible. So both the mean and the median give some idea of the location of the "centre" of the data and both the standard deviation and inter-quartile range give some idea of how spread out the data is around the centre.

So why would we favour one simplified representation (median + inter-quartile range) over another (mean + standard deviation)?

The answer comes down to what the data looks like in the first place. Let's go through a few possibilities.

If your data looks like it has been drawn from a normal distribution, then we know that the mean and the standard deviation together exactly specify that distribution and so the sample mean and standard deviation would provide a good summary of that data sample.

What about if your data looks as if it has come from a Poisson distribution? We know then that the mean characterizes the distribution precisely (with the standard deviation equal to the square root of the mean). You only really then need one number to characterize your data.

On the other hand, what if you have no clue where your data comes from (in terms of probability distributions) and it looks horrible? It might not be symmetric about the mean (i.e. it is skew); it might have extreme outliers; it might have very fat tails (i.e. a lot of the data in the extremes of the distribution). It's then that you have to think carefully about how to provide a compact data summary!

If outliers are not a problem, then you might consider using the mean together with the standard deviation and the skewness and the kurtosis to characterize the distribution - but notice that your summary is becoming much more complex!

If outliers are a problem, then notice that they can have a big effect on the mean and the standard deviation; you have to ask yourself whether the mean and the standard deviation are giving a good representation of the majority of the data. If not, then notice that the median and quartiles are not hugely affected by the values of outliers; they may give a much better data summary in this case. Equally, you may wish to use another quantile (such as a decile) to communicate what is going on in the tails of the data whilst being protected from the effect of extreme outliers.

Best of all, of course, is a graphical summary of the data such as a box and whisker plot, or a rug, or a histogram.
0
reply
Thread starter 5 years ago
#3
(Original post by Gregorius)
This is difficult question both to ask and to answer precisely!

The general idea of these so-called "summary statistics" is to give a simple characterization of a data distribution using as few numbers as possible. So both the mean and the median give some idea of the location of the "centre" of the data and both the standard deviation and inter-quartile range give some idea of how spread out the data is around the centre.

So why would we favour one simplified representation (median + inter-quartile range) over another (mean + standard deviation)?

The answer comes down to what the data looks like in the first place. Let's go through a few possibilities.

If your data looks like it has been drawn from a normal distribution, then we know that the mean and the standard deviation together exactly specify that distribution and so the sample mean and standard deviation would provide a good summary of that data sample.

What about if your data looks as if it has come from a Poisson distribution? We know then that the mean characterizes the distribution precisely (with the standard deviation equal to the square root of the mean). You only really then need one number to characterize your data.

On the other hand, what if you have no clue where your data comes from (in terms of probability distributions) and it looks horrible? It might not be symmetric about the mean (i.e. it is skew); it might have extreme outliers; it might have very fat tails (i.e. a lot of the data in the extremes of the distribution). It's then that you have to think carefully about how to provide a compact data summary!

If outliers are not a problem, then you might consider using the mean together with the standard deviation and the skewness and the kurtosis to characterize the distribution - but notice that your summary is becoming much more complex!

If outliers are a problem, then notice that they can have a big effect on the mean and the standard deviation; you have to ask yourself whether the mean and the standard deviation are giving a good representation of the majority of the data. If not, then notice that the median and quartiles are not hugely affected by the values of outliers; they may give a much better data summary in this case. Equally, you may wish to use another quantile (such as a decile) to communicate what is going on in the tails of the data whilst being protected from the effect of extreme outliers.

Best of all, of course, is a graphical summary of the data such as a box and whisker plot, or a rug, or a histogram.
Oh wow thank you, I kind of understand
0
reply
X

### Quick Reply

Write a reply...
Reply
new posts Back
to top
Latest
My Feed

### Oops, nobody has postedin the last few hours.

Why not re-start the conversation?

see more

### See more of what you like onThe Student Room

You can personalise what you see on TSR. Tell us a little about yourself to get started.

### Poll

Join the discussion

Yes (17)
32.08%
No (36)
67.92%

View All
Latest
My Feed

### Oops, nobody has postedin the last few hours.

Why not re-start the conversation?

### See more of what you like onThe Student Room

You can personalise what you see on TSR. Tell us a little about yourself to get started.