The Student Room Group

Help understanding boxplots and outliers (on SPSS)

I made two boxplots on SPSS for length vs sex.
For males, I have 32 samples, and the lengths range from 3cm to 20cm, but on the boxplot it's showing 2 outliers that are above 30cm (the units on the axis only go up to 20cm, and there's 2 outliers above 30cm with a circle next to one of them).

Could someone explain how to find outliers on a boxplot and if this sounds right or if I've made a mistake somewhere? And when writing about the number of samples included in my boxplot, would I include the outliers and say all 32 points were used to make the boxplot or just 30?
Thanks
Original post by Petulia
I made two boxplots on SPSS for length vs sex.
For males, I have 32 samples, and the lengths range from 3cm to 20cm, but on the boxplot it's showing 2 outliers that are above 30cm (the units on the axis only go up to 20cm, and there's 2 outliers above 30cm with a circle next to one of them).

Could someone explain how to find outliers on a boxplot and if this sounds right or if I've made a mistake somewhere? And when writing about the number of samples included in my boxplot, would I include the outliers and say all 32 points were used to make the boxplot or just 30?
Thanks


SPSS uses the following definition to detect outliers on boxplots (http://www.unige.ch/ses/sococ/cl/spss/concepts/outliers.html), however it uses a fairly conservative method to define outliers so that if you have a big enough sample it will detect outliers by chance. We really cannot tell you what to do as we don't know why your making a box plot, what its meant to show, etc.
Reply 2
Original post by iammichealjackson
SPSS uses the following definition to detect outliers on boxplots (http://www.unige.ch/ses/sococ/cl/spss/concepts/outliers.html), however it uses a fairly conservative method to define outliers so that if you have a big enough sample it will detect outliers by chance. We really cannot tell you what to do as we don't know why your making a box plot, what its meant to show, etc.


So are the outliers supposed to be values from within my data range or are they calculated differently? I'm confused because I don't understand where the outlier values are coming from really - are they numbers from your data samples or is there a specific way to calculate them? Sorry I'm just having trouble understanding some of the basic maths here.
Original post by Petulia
So are the outliers supposed to be values from within my data range or are they calculated differently? I'm confused because I don't understand where the outlier values are coming from really - are they numbers from your data samples or is there a specific way to calculate them? Sorry I'm just having trouble understanding some of the basic maths here.


So from Tabachnick & Fidell (2007, p. 73):


There are four reasons for the presence of an outlier. First is incorrect data entry. Cases that are extreme should be checked carefully to see that data are correctly entered. Second is failure to specify missing-value codes in computer syntax so that missing-value indicators are read as real data. Third is that the outlier is not a member of the population from which you intended to sample. If the case should not have been sampled, it is deleted once it is detected. Fourth is that the case is from the intended population but the distribution for the variable in the population has more extreme values than a normal distribution. In this event, the researcher retains the case but considers changing the value on the variable(s) so that the case no longer has as much impact. Although errors in data entry and missing values specification are easily found and remedied, deciding between alternatives three and four, between deletion and retention with alteration, is difficult.


So there isn't really a standard definition of an outlier, nor a standard way of detecting them. It's quite usual to use some criteria of scores with a z value of more than a certain amount to remove outliers- but this isn't a good rule to always apply! There's lots of information about it on google anyhow...
Reply 4
Original post by iammichealjackson
So from Tabachnick & Fidell (2007, p. 73):



So there isn't really a standard definition of an outlier, nor a standard way of detecting them. It's quite usual to use some criteria of scores with a z value of more than a certain amount to remove outliers- but this isn't a good rule to always apply! There's lots of information about it on google anyhow...


Thanks for this reply. I'm reading Discovering Stats on SPSS by Andy Field now, I think it's the 4th Edition but you were right it is really good at breaking this stuff down.
Reply 5
Finally worked out where I was going wrong. The numbers it was showing on the boxplot weren't the actual outlier values, they were the case numbers, so I was supposed to go back to my data set and check what number was in box 32, and that value is the outlier.
Discovering Stats using SPSS is proving to be very helpful - I had been Googling this issue for over a week with no luck until I started using this book. Would definitely recommend!

Quick Reply

Latest

Trending

Trending