The Student Room Group

Is the mean appropriate in this situation?

I'm currently working on a research project for uni that involves a small amount of statistical analysis. To avoid being too technical, I've renamed the variable A and B.

I need to investigate the effect of variable A on variable B.
Variable A is categorical and ordinal, with 6 categories.
Variable B is numerical and discrete, with 3 intervals (0, 1 and 2).

I have 5 years worth of data on variable B, so I'm going to compare variable A to the average of variable B for each year. However, do I use the median or the mean?

Variable B is heavily skewed for each year, with 85% of the data being 0. This means that the median average is ALWAYS 0. Is the mean appropriate to use in this situation, even though the data isn't normally distributed?

Thanks in advance!
Original post by Willisme1
I'm currently working on a research project for uni that involves a small amount of statistical analysis. To avoid being too technical, I've renamed the variable A and B.

I need to investigate the effect of variable A on variable B.
Variable A is categorical and ordinal, with 6 categories.
Variable B is numerical and discrete, with 3 intervals (0, 1 and 2).


Do you mean that B is numerical with 3 possible values (0, 1 and 2)?


I have 5 years worth of data on variable B,


How many individual observations to you have per year?


so I'm going to compare variable A to the average of variable B for each year. However, do I use the median or the mean?


It's not clear to me why you would use a summary statistic at this point. Perhaps this would depend upon the research question (which it would be helpful for you to state), but something like ordinal logistic regression is suuggesting itself to me at this point, treating the outcome B as ordinal.

Variable B is heavily skewed for each year, with 85% of the data being 0. This means that the median average is ALWAYS 0. Is the mean appropriate to use in this situation, even though the data isn't normally distributed?


Depends strongly on the type of analysis that you go for. If you chose ordinal logistic regression, the problem goes away. If 85% of the oucome is zero, it would be very helpful to understand the meaning of the difference between the values 1 and 2 of B - would an analysis that compares zero with non-zero be more helpful, for instance?
Reply 2
Thanks for the reply, I did some thinking about how I'm tackling this and I'm very confused by what I should be doing.

I'm investigating the effect of parity (litter number) on the number of piglets born alive per litter in breeding sows. One of the things I've suggested is that parity might correlate with the number of stillbirths per litter.

Parity is categorical but as there are 8 categories, I have chosen to treat it as continuous (as recommended by my lecturers).
Stillbirths per litter is numerical and discrete, with values of either 0, 1 or 2.

However, the distribution of stillbirths per litter is strongly skewed to one side as previously mentioned. If I just treat the 5 years as one block of data, and run a Kruskal-Wallis test, I get a significant result (P < 0.0001) but I'm not sure whether I've done that right.
Original post by Willisme1
Thanks for the reply, I did some thinking about how I'm tackling this and I'm very confused by what I should be doing.

I'm investigating the effect of parity (litter number) on the number of piglets born alive per litter in breeding sows. One of the things I've suggested is that parity might correlate with the number of stillbirths per litter.

Parity is categorical but as there are 8 categories, I have chosen to treat it as continuous (as recommended by my lecturers).
Stillbirths per litter is numerical and discrete, with values of either 0, 1 or 2.

However, the distribution of stillbirths per litter is strongly skewed to one side as previously mentioned. If I just treat the 5 years as one block of data, and run a Kruskal-Wallis test, I get a significant result (P < 0.0001) but I'm not sure whether I've done that right.



There are a number of approaches to data like this - and the choice of approach depends on how much data you have and whether you feel confident in using the methods!

1) The Kruskal-Wallis test is a good start. It's a non-parametric test (which doesn't care about the distribution of the variables) that tells you whether the distribution of the outcome (number of stillbirths in this case) varies by the value of the independent variable. It's nice and simple to use but suffers from the weakness that it's a portmanteau test - it doesn't tell you how the number of stillbirths varies with parity, just that it does. If you have enough data, you could simply apply this by year.

2) If you want to see how the number of still births varies by parity then you might consider Poisson regression. Either take the parity as a numerical variable (which will assume the effect is linear) or as categorical, if you have enough data, to see exactly what is happening in each category. Again, if you have enough data, you could add year number as an independent variable. The downside of using this approach is that it is parametric and you should check the diagnostics of the regression to make sure that assumptions are being met.

3) If you assume that both number of stillbirths and parity are ordinal variables (seems reasonable to me) then you could fit some sort of ordinal logistic regression. This makes fewer distributional assumptions than Poisson regression, but implies a particular model for how stillbirths increase by parity.

Quick Reply

Latest