# Is this data normally distributed?

Watch
Announcements
#1
Hello,I want to use regression and correlation analysis on two variables. n=1500.
• The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
• The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]
• Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
0
2 years ago
#2
(Original post by studentshello)
Hello,I want to use regression and correlation analysis on two variables. n=1500.
[ul]
[li]The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].[/li]
[li]The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10][/li]
[li]Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.[/li]
[/ul]
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
Are you using spss? What are the values of the statistics and the significance? Can you upload a few images of the data and the results of the tests?
For the normal tests, are you testing to see if x is normally distributed and also testing to see if y is normally distributed?
Normal distribution does not necessarily imply two variables are correlated and vice versa.
0
2 years ago
#3
(Original post by studentshello)
Hello,I want to use regression and correlation analysis on two variables. n=1500.
• The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
• The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]
• Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.

K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?

(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?

(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.

(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.

(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.
0
#4
(Original post by Gregorius)

(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?

(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.

(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.

(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.
Thanks for replying. Actually, the issue is not if the dataset is normally distributed (the residuals ARE normally distributed), but if the dataset is linear.

My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:

Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity

Scatter graph of variables (no linearity here?)

Histogram (can lineary be shown here?)

However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?
Last edited by studentshello; 2 years ago
0
2 years ago
#5
(Original post by studentshello)
Thanks for replying. Actually, the issue is not if the dataset is normally distributed (the residuals ARE normally distributed), but if the dataset is linear.

My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].

Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:

Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity

Scatter graph of variables (no linearity here?)

https://i.imgur.com/BvdBsrf.jpg

Histogram (can lineary be shown here?)

https://i.imgur.com/4sz8ESE.jpg

However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?
There are quite a few terms you're throwing into the description which I'm not sure you understand. How can you say the residuals are normal, if you're not giving the model, why talk about autocorrelation, multiple regression ...
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.
0
2 years ago
#6
(Original post by studentshello)
Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:

Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity

Scatter graph of variables (no linearity here?)

https://i.imgur.com/BvdBsrf.jpg
Because your observations only take discrete values, a simple scatter plot like this can't tell you very much. As you have 1500 observations, and there are only 121 possible combinations of the two values, many of those dots on your plot will be multiply overplotted. The standard way of plotting data like this is to "jitter" it - to add a little bit of random noise to every observation. This will then separate the plotted points, and you'll be able to see whether there's really a strong linear trend.
0
#7
(Original post by mqb2766)
There are quite a few terms you're throwing into the description which I'm not sure you understand. How can you say the residuals are normal, if you're not giving the model, why talk about autocorrelation, multiple regression ...
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.
These are assumptions for linear regression. Here:

https://statistics.laerd.com/spss-tu...statistics.php

The dataset complies with all those assumptions, except linearity.
Last edited by studentshello; 2 years ago
0
#8
(Original post by Gregorius)
Because your observations only take discrete values, a simple scatter plot like this can't tell you very much. As you have 1500 observations, and there are only 121 possible combinations of the two values, many of those dots on your plot will be multiply overplotted. The standard way of plotting data like this is to "jitter" it - to add a little bit of random noise to every observation. This will then separate the plotted points, and you'll be able to see whether there's really a strong linear trend.
Here is the scatter graph accounting for jitter

https://i.imgur.com/w59iAoY.png

Can linearity be seen here?
0
2 years ago
#9
(Original post by studentshello)
These are assumptions for linear regression. Here:

https://statistics.laerd.com/spss-tu...statistics.php

I comply with all those assumptions, except linearity.
I do know the assumptions made in linear regression. However, I don't know whether your data does, as I've not seen the evidence.
As previous post, first thing I would do would be some form of mosaic plot, which is similar to a scatter plot, but for discrete data. That will give you some evidence of the data distribution as well as some idea of what form, of deterministic map may exist.
0
2 years ago
#10
(Original post by studentshello)
These are assumptions for linear regression. Here:

https://statistics.laerd.com/spss-tu...statistics.php

I comply with all those assumptions, except linearity.
Oh dear. Be careful not to believe everything you read on the internet! This may be a bit beside the point for your particular problem, but linear regression is perfectly able to deal with non-linear relationships between variables, so their assumption #2 is incorrect. The "linear" in "linear regression" refers to the linearity of the regression coefficients (in the lieanr predictor), not the linearity of the relationship between the variables. You can happily fit a covariate together with its square and its cube and more, if you want (or use a spline basis, if you want the data to tell you what the relationship really should be between the variables).

When you look for a linear relationship between your variables in a scatter plot, you're looking for evidence that will justify only fitting linear terms in your regression equation.
0
2 years ago
#11
(Original post by studentshello)
Here is the scatter graph accounting for jitter

https://i.imgur.com/w59iAoY.png

Can linearity be seen here?
That looks a possibility. I would now fit a linear regression, and then take a look at the plot of the residuals versus the fitted values. This plot can tell you a lot about whether inference from your regression will be accurate. I would guess from that jittered plot that there might be some problem with the residuals around the x = 7 or 8 point.
0
#12
(Original post by Gregorius)
That looks a possibility. I would now fit a linear regression, and then take a look at the plot of the residuals versus the fitted values. This plot can tell you a lot about whether inference from your regression will be accurate. I would guess from that jittered plot that there might be some problem with the residuals around the x = 7 or 8 point.
Just to be clear.

The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).

As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.

Here are partial regression plots for each of the 4 independent variables.

Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?
Last edited by studentshello; 2 years ago
0
2 years ago
#13
(Original post by studentshello)
Just to be clear.

The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).

As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.

Here are partial regression plots for each of the 4 independent variables.

https://i.imgur.com/qA5q55s.png

https://i.imgur.com/SjT8ZaN.png

https://i.imgur.com/f5obqgX.png

https://i.imgur.com/UwWVw2v.png

Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?
Yes, these plots look OK. But as I say, a residuals versus fitted values plot (with a superimposed smooth) will reveal if there are any serious problems with the model.
0
#14
(Original post by Gregorius)
Yes, these plots look OK. But as I say, a residuals versus fitted values plot (with a superimposed smooth) will reveal if there are any serious problems with the model.
Thank you. What makes you think the plots look OK? Can you maybe describe your reasoning here?
0
2 years ago
#15
(Original post by studentshello)
Thank you. What makes you think the plots look OK? Can you maybe describe your reasoning here?
Experience. And a lot of creating and looking at simulated data designed to illustrate these sorts of issues for my students.
0
2 years ago
#16
No... it's generally classified.
0
X

new posts
Back
to top
Latest
My Feed

### Oops, nobody has postedin the last few hours.

Why not re-start the conversation?

see more

### See more of what you like onThe Student Room

You can personalise what you see on TSR. Tell us a little about yourself to get started.

### Poll

Join the discussion

#### Have you experienced financial difficulties as a student due to Covid-19?

Yes, I have really struggled financially (12)
12.24%
I have experienced some financial difficulties (22)
22.45%
I haven't experienced any financial difficulties and things have stayed the same (44)
44.9%
I have had better financial opportunities as a result of the pandemic (16)
16.33%
4.08%