The Student Room Group

Choosing Statistical Analysis

Hi there,
I am a student undertaking a non-mathematical subject at university doing a research project and I have some data I need to analyse, only the last time I did any statistics was 5 years ago at school so I need some advice!

I have 2 variables which are clinical measurements taken from a number of patients in a clinic (variables called A and B); A is numerical continuous and B is numerical but has to be a whole number.

My aim is to demonstrate their correlation with one another, and identify specific pairs of data which are discrepant i.e. do not fit into the correlation. I'm struggling to identify via google which statistical method is best to do each of these things (or whether there is one that will do both).

My specific questions are these:
1. Is B a continuous or categorical variable since is has to be a whole number?

2. What statistical method would be best to 1) demonstrate correlation between points A and B (linear regression? Mann-Whitney? Pearson?) and 2) identify patients who have discrepant data pairs (no idea!)?

Many thanks for any advice given.

EDIT: can this be done with Z-scores? As in using simple regression to find the correlation coefficient, and then using that to find the predicted z-score for A depending upon the z-score for B? Not sure how I would interpret how far out the true value was though and whether this was discrepant.
(edited 7 years ago)
Original post by Pwhiskers
Hi there,
I am a student undertaking a non-mathematical subject at university doing a research project and I have some data I need to analyse, only the last time I did any statistics was 5 years ago at school so I need some advice!

I have 2 variables which are clinical measurements taken from a number of patients in a clinic (variables called A and B); A is numerical continuous and B is numerical but has to be a whole number.

My aim is to demonstrate their correlation with one another, and identify specific pairs of data which are discrepant i.e. do not fit into the correlation. I'm struggling to identify via google which statistical method is best to do each of these things (or whether there is one that will do both).

That would be because there are a variety of ways of approaching data like this, and the best choice(s) depend on exactly what it looks like!

My specific questions are these:
1. Is B a continuous or categorical variable since is has to be a whole number?


It could be treated either way depending on (a) what type of measurement it is and (b) the range of values that it can take. Are you able to describe these to me?


2. What statistical method would be best to 1) demonstrate correlation between points A and B (linear regression? Mann-Whitney? Pearson?) and 2) identify patients who have discrepant data pairs (no idea!)?


It would be extremely helpful if you were able to attach a scatter plot of the data, so that I could see the form of the relationship between the variables. Deciding between a non-parametric approach (such as Mann-Whitney) and a parametric approach (such as linear regression) would be greatly facilitated by this.
Reply 2
Original post by Gregorius

It could be treated either way depending on (a) what type of measurement it is and (b) the range of values that it can take. Are you able to describe these to me?


It is an integer from counting the number of follicles visible on a scan. It doesn't have a fixed range but in my data set the minimum is 1 and the maximum is 113.

Original post by Gregorius

It would be extremely helpful if you were able to attach a scatter plot of the data, so that I could see the form of the relationship between the variables. Deciding between a non-parametric approach (such as Mann-Whitney) and a parametric approach (such as linear regression) would be greatly facilitated by this.


Here is the scatter plot I generated in SPSS just now. I have a lot of data (1279 pairs) which is why it's so messy.

Scatter plot.png

Thank you so much for helping
Original post by Pwhiskers
It is an integer from counting the number of follicles visible on a scan. It doesn't have a fixed range but in my data set the minimum is 1 and the maximum is 113.



Here is the scatter plot I generated in SPSS just now. I have a lot of data (1279 pairs) which is why it's so messy.

Scatter plot.png

Thank you so much for helping


This looks like linear regression would work - however, there are a few features of the plot that make me want to do a bit of investigation. In particular, the way the smaller pairs of values are distributed suggest there may be some non-linearity going on. So, before proceeding, I'm going to suggest you do another plot.

This time, plot log(1 + AFC) versus log(1 + AMH). The log transform will stretch out the data so that we can see better what is going on. (As there appear to be zero values in the data, we have to add one to each variable to get rid of these).

Post the plot and I'll pick up the thread in the morning.
Reply 4
Original post by Gregorius

This time, plot log(1 + AFC) versus log(1 + AMH). The log transform will stretch out the data so that we can see better what is going on. (As there appear to be zero values in the data, we have to add one to each variable to get rid of these).

Post the plot and I'll pick up the thread in the morning.


Here is the requested plot of log(1 + AFC) and log(1 + AMH).

Scatter plot2.png

Thanks
Original post by Pwhiskers
Here is the requested plot of log(1 + AFC) and log(1 + AMH).

Scatter plot2.png

Thanks


Ah, now that's looking about as good as it could be! What you are seeing in this plot is that there is pretty much a linear relationship between the logarithms of your two variables. (Do either of your variables actually have any zero values? From the log-log plot it looks as though they may not. If this is so, then you can remove the "1 +" from the logarithm and just plot log(AFC) versus log(AMH).)

So, now you can happily do linear regression of the log of one variable against the log of the other. (Which way around depends upon the meaning of the variables - do you think that one variable is in a sense explaining the other?)

The answer to your second question about discrepant data pairs is to look at the residuals from the regression. If the assumptions of linear regression are met, they should be normally distributed. Residuals that are obvious outliers then correspond to your discrepand observations.
Reply 6
Original post by Gregorius

The answer to your second question about discrepant data pairs is to look at the residuals from the regression. If the assumptions of linear regression are met, they should be normally distributed. Residuals that are obvious outliers then correspond to your discrepand observations.


Wonderful, I think I've got it to work! I've taken the outliers as having a residual >2 or <-2, is that correct?

Thank you so much for your help, it's been incredibly valuable.
Original post by Pwhiskers
Wonderful, I think I've got it to work! I've taken the outliers as having a residual >2 or <-2, is that correct?

Thank you so much for your help, it's been incredibly valuable.


It's a pleasure.

Now, just a caution. For the p-values that come out of linear regression to be accurate, you need the distribution of the residuals from the regression to be normally distributed, or close to that. Your statistical software should be able to give you some means of checking this.

As for identifying outliers, if your residuals are normally distributed then you expect 5% of them to be further than 1.96 times their standard devation away from the mean. So you could be looking for two things. The first of these is whether there are any residuals that are far too large to be consistent with the nornaility assumption when the rest are. These are genuine outliers that do not conform to the probabilistic model. Thes second is that you could just be interested in points with "large" residuals, irrespective of whether they are consistent or not with the probability model. Up to you really!
Reply 8
Original post by Gregorius
It's a pleasure.

Now, just a caution. For the p-values that come out of linear regression to be accurate, you need the distribution of the residuals from the regression to be normally distributed, or close to that. Your statistical software should be able to give you some means of checking this.

As for identifying outliers, if your residuals are normally distributed then you expect 5% of them to be further than 1.96 times their standard devation away from the mean. So you could be looking for two things. The first of these is whether there are any residuals that are far too large to be consistent with the nornaility assumption when the rest are. These are genuine outliers that do not conform to the probabilistic model. Thes second is that you could just be interested in points with "large" residuals, irrespective of whether they are consistent or not with the probability model. Up to you really!


OK I have one more question (sorry!) - If I now look at those 5% whose measurements are not concordant (residual >1.96 or <-1.96), and I want to determine in those patients which variable of the two is the best at predicting a third (continuous) variable, is it sufficient to simply look at the correlation coefficients for each log(AFC) and log(AMH) against the third variable?
Original post by Pwhiskers
OK I have one more question (sorry!) - If I now look at those 5% whose measurements are not concordant (residual >1.96 or <-1.96), and I want to determine in those patients which variable of the two is the best at predicting a third (continuous) variable, is it sufficient to simply look at the correlation coefficients for each log(AFC) and log(AMH) against the third variable?


The first thing to do would be to plot graphs of this third variable against the values of AFC and AMH separately, for this group of subjects. You'll then (as before) get an idea of whether anything needs to be transformed, and you'll also get an idea of which relationship is the stronger. Once you've got the relationship looking vaguely linear, you can use a correlation coefficient, or you could use linear regression. (The value of the explained variation R2R^2 from linear regression is the square of the correlation coefficient between the variables.)

Ramping up the sophistication a bit, a linear regression (between appropriately transformed variables) using the third variable as outcome and both AMH and AFC as explanatory variables (and possibly an interaction term between AMH and AFC) would delve more deeply into the relationship between the variables. But perhaps the science itself suggests that you only need a simple analysis here...
Reply 10
I understand. Thank you :smile:

Quick Reply

Latest