The Student Room Group

Is this data normally distributed?

Hello,I want to use regression and correlation analysis on two variables. n=1500.

The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]

Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.

K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
Reply 1
Original post by studentshello
Hello,I want to use regression and correlation analysis on two variables. n=1500.
[ul]
[li]The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].[/li]
[li]The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10][/li]
[li]Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.[/li]
[/ul]
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?

Are you using spss? What are the values of the statistics and the significance? Can you upload a few images of the data and the results of the tests?
For the normal tests, are you testing to see if x is normally distributed and also testing to see if y is normally distributed?
Normal distribution does not necessarily imply two variables are correlated and vice versa.
Original post by studentshello
Hello,I want to use regression and correlation analysis on two variables. n=1500.

The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]

Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.


K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?


There are a few comments to be made on this question.

(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?

(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.

(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.

(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.
Original post by Gregorius
There are a few comments to be made on this question.

(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?

(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.

(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.

(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.

Thanks for replying. Actually, the issue is not if the dataset is normally distributed (the residuals ARE normally distributed), but if the dataset is linear.

My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].



Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:


Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity

Scatter graph of variables (no linearity here?)



Histogram (can lineary be shown here?)



However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?
(edited 5 years ago)
Reply 4
Original post by studentshello
Thanks for replying. Actually, the issue is not if the dataset is normally distributed (the residuals ARE normally distributed), but if the dataset is linear.

My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].



Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:


Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity

Scatter graph of variables (no linearity here?)

https://i.imgur.com/BvdBsrf.jpg

Histogram (can lineary be shown here?)

https://i.imgur.com/4sz8ESE.jpg

However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?


There are quite a few terms you're throwing into the description which I'm not sure you understand. How can you say the residuals are normal, if you're not giving the model, why talk about autocorrelation, multiple regression ...
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.
Original post by studentshello

Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:

Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity

Scatter graph of variables (no linearity here?)

https://i.imgur.com/BvdBsrf.jpg

Because your observations only take discrete values, a simple scatter plot like this can't tell you very much. As you have 1500 observations, and there are only 121 possible combinations of the two values, many of those dots on your plot will be multiply overplotted. The standard way of plotting data like this is to "jitter" it - to add a little bit of random noise to every observation. This will then separate the plotted points, and you'll be able to see whether there's really a strong linear trend.
Original post by mqb2766
There are quite a few terms you're throwing into the description which I'm not sure you understand. How can you say the residuals are normal, if you're not giving the model, why talk about autocorrelation, multiple regression ...
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.


These are assumptions for linear regression. Here:

https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php

The dataset complies with all those assumptions, except linearity.
(edited 5 years ago)
Original post by Gregorius
Because your observations only take discrete values, a simple scatter plot like this can't tell you very much. As you have 1500 observations, and there are only 121 possible combinations of the two values, many of those dots on your plot will be multiply overplotted. The standard way of plotting data like this is to "jitter" it - to add a little bit of random noise to every observation. This will then separate the plotted points, and you'll be able to see whether there's really a strong linear trend.


Here is the scatter graph accounting for jitter

https://i.imgur.com/w59iAoY.png

Can linearity be seen here?
Reply 8
Original post by studentshello
These are assumptions for linear regression. Here:

https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php

I comply with all those assumptions, except linearity.


I do know the assumptions made in linear regression. However, I don't know whether your data does, as I've not seen the evidence.
As previous post, first thing I would do would be some form of mosaic plot, which is similar to a scatter plot, but for discrete data. That will give you some evidence of the data distribution as well as some idea of what form, of deterministic map may exist.
Original post by studentshello
These are assumptions for linear regression. Here:

https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php

I comply with all those assumptions, except linearity.

Oh dear. Be careful not to believe everything you read on the internet! This may be a bit beside the point for your particular problem, but linear regression is perfectly able to deal with non-linear relationships between variables, so their assumption #2 is incorrect. The "linear" in "linear regression" refers to the linearity of the regression coefficients (in the lieanr predictor), not the linearity of the relationship between the variables. You can happily fit a covariate together with its square and its cube and more, if you want (or use a spline basis, if you want the data to tell you what the relationship really should be between the variables).

When you look for a linear relationship between your variables in a scatter plot, you're looking for evidence that will justify only fitting linear terms in your regression equation.
Original post by studentshello
Here is the scatter graph accounting for jitter

https://i.imgur.com/w59iAoY.png

Can linearity be seen here?

That looks a possibility. I would now fit a linear regression, and then take a look at the plot of the residuals versus the fitted values. This plot can tell you a lot about whether inference from your regression will be accurate. I would guess from that jittered plot that there might be some problem with the residuals around the x = 7 or 8 point.
Original post by Gregorius
That looks a possibility. I would now fit a linear regression, and then take a look at the plot of the residuals versus the fitted values. This plot can tell you a lot about whether inference from your regression will be accurate. I would guess from that jittered plot that there might be some problem with the residuals around the x = 7 or 8 point.

Just to be clear.

The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).

As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.

Here are partial regression plots for each of the 4 independent variables.



Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?
(edited 5 years ago)
Original post by studentshello
Just to be clear.

The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).

As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.

Here are partial regression plots for each of the 4 independent variables.

https://i.imgur.com/qA5q55s.png

https://i.imgur.com/SjT8ZaN.png

https://i.imgur.com/f5obqgX.png

https://i.imgur.com/UwWVw2v.png


Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?

Yes, these plots look OK. But as I say, a residuals versus fitted values plot (with a superimposed smooth) will reveal if there are any serious problems with the model.
Original post by Gregorius
Yes, these plots look OK. But as I say, a residuals versus fitted values plot (with a superimposed smooth) will reveal if there are any serious problems with the model.

Thank you. What makes you think the plots look OK? Can you maybe describe your reasoning here?
Original post by studentshello
Thank you. What makes you think the plots look OK? Can you maybe describe your reasoning here?

Experience. And a lot of creating and looking at simulated data designed to illustrate these sorts of issues for my students.
No... it's generally classified.

Quick Reply

Latest