# Data distribution representation

Watch
Announcements

Page 1 of 1

Go to first unread

Skip to page:

Hi I'm not really sure how to word this so sorry if it's quite hard to understand. I don't understand when you use the standard deviation or mean or median to represent the distribution of data? Thank you in advance

0

reply

Report

#2

(Original post by

Hi I'm not really sure how to word this so sorry if it's quite hard to understand. I don't understand when you use the standard deviation or mean or median to represent the distribution of data? Thank you in advance

**Meggy moo 1**)Hi I'm not really sure how to word this so sorry if it's quite hard to understand. I don't understand when you use the standard deviation or mean or median to represent the distribution of data? Thank you in advance

The general idea of these so-called "summary statistics" is to give a simple characterization of a data distribution using as few numbers as possible. So both the mean and the median give some idea of the location of the "centre" of the data and both the standard deviation and inter-quartile range give some idea of how spread out the data is around the centre.

So why would we favour one simplified representation (median + inter-quartile range) over another (mean + standard deviation)?

The answer comes down to what the data looks like in the first place. Let's go through a few possibilities.

If your data looks like it has been drawn from a normal distribution, then we know that the mean and the standard deviation together exactly specify that distribution and so the sample mean and standard deviation would provide a good summary of that data sample.

What about if your data looks as if it has come from a Poisson distribution? We know then that the mean characterizes the distribution precisely (with the standard deviation equal to the square root of the mean). You only really then need one number to characterize your data.

On the other hand, what if you have no clue where your data comes from (in terms of probability distributions) and it looks horrible? It might not be symmetric about the mean (i.e. it is skew); it might have extreme outliers; it might have very fat tails (i.e. a lot of the data in the extremes of the distribution). It's then that you have to think carefully about how to provide a compact data summary!

If outliers are not a problem, then you might consider using the mean together with the standard deviation and the skewness and the kurtosis to characterize the distribution - but notice that your summary is becoming much more complex!

If outliers are a problem, then notice that they can have a big effect on the mean and the standard deviation; you have to ask yourself whether the mean and the standard deviation are giving a good representation of the majority of the data. If not, then notice that the median and quartiles are not hugely affected by the values of outliers; they may give a much better data summary in this case. Equally, you may wish to use another quantile (such as a decile) to communicate what is going on in the tails of the data whilst being protected from the effect of extreme outliers.

Best of all, of course, is a graphical summary of the data such as a box and whisker plot, or a rug, or a histogram.

0

reply

(Original post by

This is difficult question both to ask and to answer precisely!

The general idea of these so-called "summary statistics" is to give a simple characterization of a data distribution using as few numbers as possible. So both the mean and the median give some idea of the location of the "centre" of the data and both the standard deviation and inter-quartile range give some idea of how spread out the data is around the centre.

So why would we favour one simplified representation (median + inter-quartile range) over another (mean + standard deviation)?

The answer comes down to what the data looks like in the first place. Let's go through a few possibilities.

If your data looks like it has been drawn from a normal distribution, then we know that the mean and the standard deviation together exactly specify that distribution and so the sample mean and standard deviation would provide a good summary of that data sample.

What about if your data looks as if it has come from a Poisson distribution? We know then that the mean characterizes the distribution precisely (with the standard deviation equal to the square root of the mean). You only really then need one number to characterize your data.

On the other hand, what if you have no clue where your data comes from (in terms of probability distributions) and it looks horrible? It might not be symmetric about the mean (i.e. it is skew); it might have extreme outliers; it might have very fat tails (i.e. a lot of the data in the extremes of the distribution). It's then that you have to think carefully about how to provide a compact data summary!

If outliers are not a problem, then you might consider using the mean together with the standard deviation and the skewness and the kurtosis to characterize the distribution - but notice that your summary is becoming much more complex!

If outliers are a problem, then notice that they can have a big effect on the mean and the standard deviation; you have to ask yourself whether the mean and the standard deviation are giving a good representation of the majority of the data. If not, then notice that the median and quartiles are not hugely affected by the values of outliers; they may give a much better data summary in this case. Equally, you may wish to use another quantile (such as a decile) to communicate what is going on in the tails of the data whilst being protected from the effect of extreme outliers.

Best of all, of course, is a graphical summary of the data such as a box and whisker plot, or a rug, or a histogram.

**Gregorius**)This is difficult question both to ask and to answer precisely!

The general idea of these so-called "summary statistics" is to give a simple characterization of a data distribution using as few numbers as possible. So both the mean and the median give some idea of the location of the "centre" of the data and both the standard deviation and inter-quartile range give some idea of how spread out the data is around the centre.

So why would we favour one simplified representation (median + inter-quartile range) over another (mean + standard deviation)?

The answer comes down to what the data looks like in the first place. Let's go through a few possibilities.

If your data looks like it has been drawn from a normal distribution, then we know that the mean and the standard deviation together exactly specify that distribution and so the sample mean and standard deviation would provide a good summary of that data sample.

What about if your data looks as if it has come from a Poisson distribution? We know then that the mean characterizes the distribution precisely (with the standard deviation equal to the square root of the mean). You only really then need one number to characterize your data.

On the other hand, what if you have no clue where your data comes from (in terms of probability distributions) and it looks horrible? It might not be symmetric about the mean (i.e. it is skew); it might have extreme outliers; it might have very fat tails (i.e. a lot of the data in the extremes of the distribution). It's then that you have to think carefully about how to provide a compact data summary!

If outliers are not a problem, then you might consider using the mean together with the standard deviation and the skewness and the kurtosis to characterize the distribution - but notice that your summary is becoming much more complex!

If outliers are a problem, then notice that they can have a big effect on the mean and the standard deviation; you have to ask yourself whether the mean and the standard deviation are giving a good representation of the majority of the data. If not, then notice that the median and quartiles are not hugely affected by the values of outliers; they may give a much better data summary in this case. Equally, you may wish to use another quantile (such as a decile) to communicate what is going on in the tails of the data whilst being protected from the effect of extreme outliers.

Best of all, of course, is a graphical summary of the data such as a box and whisker plot, or a rug, or a histogram.

0

reply

X

Page 1 of 1

Go to first unread

Skip to page:

### Quick Reply

Back

to top

to top