Is this data normally distributed?
Watch
Announcements
Page 1 of 1
Skip to page:
Hello,I want to use regression and correlation analysis on two variables. n=1500.
- The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
- The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]
- Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.
0
reply
Report
#2
(Original post by studentshello)
Hello,I want to use regression and correlation analysis on two variables. n=1500.
[ul]
[li]The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].[/li]
[li]The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10][/li]
[li]Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.[/li]
[/ul]
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
Hello,I want to use regression and correlation analysis on two variables. n=1500.
[ul]
[li]The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].[/li]
[li]The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10][/li]
[li]Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.[/li]
[/ul]
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
For the normal tests, are you testing to see if x is normally distributed and also testing to see if y is normally distributed?
Normal distribution does not necessarily imply two variables are correlated and vice versa.
0
reply
Report
#3
(Original post by studentshello)
Hello,I want to use regression and correlation analysis on two variables. n=1500.
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
Hello,I want to use regression and correlation analysis on two variables. n=1500.
- The y (dependant) variable takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
- The x (independent) variable) also takes values between [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10]
- Data was obtained from questionnaire using Likert scale and re-coded, numerically as above.
K-S test and Shapiro-Wilk both yield p<0.05, therefore, data is NOT normally distributed. However, Q-Q plots show 11 points along a diagonal line, suggesting data IS normally distributed. Therefore, there is a contradiction here.If data IS normally distributed, I can use Pearson correlation. If not, I use non-parametric correlation.Is this sufficient evidence to conclude data is not normally distributed?
(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?
(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.
(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.
(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.
0
reply
(Original post by Gregorius)
There are a few comments to be made on this question.
(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?
(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.
(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.
(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.
There are a few comments to be made on this question.
(1) The data is obviously not normally distributed, as it doesn't come from a continuous distribution. The question is, does this matter?
(2) One will often obtain a small p-value from a large sample! The problem with normality testing (especially when you have a large sample) is that a small p-value will not tell you how badly non-normal the data is, and whether the degree of non-normality matters for the purpose you have in mind. You've very wisely looked at qq plots and seen that the small p-value appears to be a symptom of the sample size rather than any gross deviation from normality.
(3) When you're doing (ordinary) regression analysis, you're not interested in whether the variables themselves are normally distributed, but whether the residuals from the regression are normally distributed.
(4) So, should this data be analyzed using ordinary linear regression? This is difficult to answer, as for data like this, it depends on the meaning of the variables. If you can guarantee that your outcome (especially) behaves as if it were an interval variable, then you can. If it's not, you're probably looking at ordinal regression or some non-parametric technique.
My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:
Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity
Scatter graph of variables (no linearity here?)
Histogram (can lineary be shown here?)
However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?
Last edited by studentshello; 2 years ago
0
reply
Report
#5
(Original post by studentshello)
Thanks for replying. Actually, the issue is not if the dataset is normally distributed (the residuals ARE normally distributed), but if the dataset is linear.
My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:
Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity
Scatter graph of variables (no linearity here?)
https://i.imgur.com/BvdBsrf.jpg
Histogram (can lineary be shown here?)
https://i.imgur.com/4sz8ESE.jpg
However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?
Thanks for replying. Actually, the issue is not if the dataset is normally distributed (the residuals ARE normally distributed), but if the dataset is linear.
My dataset consists of a dependant variable that is either continuous or nominal (depending on how you choose to classify the Likert scale) and takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
The independent variable also takes values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10].
Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:
Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity
Scatter graph of variables (no linearity here?)
https://i.imgur.com/BvdBsrf.jpg
Histogram (can lineary be shown here?)
https://i.imgur.com/4sz8ESE.jpg
However, residuals appear to be normally distributed and the data appears homoscedastic. There are no outliers, auto-correlation or multicollinearity. Therefore, is it ok to proceed with the LINEAR regression analysis? If not, should I consider ordinal regression, because of this violation of linearity?
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.
0
reply
Report
#6
(Original post by studentshello)
Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:
Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity
Scatter graph of variables (no linearity here?)
https://i.imgur.com/BvdBsrf.jpg
Therefore, given the nature of the dependant variable, I believe my dataset violates the linearity assumption criterion for multiple regression:
Normality, linearity, homoscedasticity, autocorrelation, outliers and absence of multicollinearity
Scatter graph of variables (no linearity here?)
https://i.imgur.com/BvdBsrf.jpg
0
reply
(Original post by mqb2766)
There are quite a few terms you're throwing into the description which I'm not sure you understand. How can you say the residuals are normal, if you're not giving the model, why talk about autocorrelation, multiple regression ...
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.
There are quite a few terms you're throwing into the description which I'm not sure you understand. How can you say the residuals are normal, if you're not giving the model, why talk about autocorrelation, multiple regression ...
* Try doing a mosaic plot (or multiple box and whiskers plot for each of the 10 "x" values). The trend may give you some indication of the deterministic relationship.
* You could give linear regression a bash, to see what happens. If its 1/2 ok, do some residual analysis.
https://statistics.laerd.com/spss-tu...statistics.php
The dataset complies with all those assumptions, except linearity.
Last edited by studentshello; 2 years ago
0
reply
(Original post by Gregorius)
Because your observations only take discrete values, a simple scatter plot like this can't tell you very much. As you have 1500 observations, and there are only 121 possible combinations of the two values, many of those dots on your plot will be multiply overplotted. The standard way of plotting data like this is to "jitter" it - to add a little bit of random noise to every observation. This will then separate the plotted points, and you'll be able to see whether there's really a strong linear trend.
Because your observations only take discrete values, a simple scatter plot like this can't tell you very much. As you have 1500 observations, and there are only 121 possible combinations of the two values, many of those dots on your plot will be multiply overplotted. The standard way of plotting data like this is to "jitter" it - to add a little bit of random noise to every observation. This will then separate the plotted points, and you'll be able to see whether there's really a strong linear trend.
https://i.imgur.com/w59iAoY.png
Can linearity be seen here?
0
reply
Report
#9
(Original post by studentshello)
These are assumptions for linear regression. Here:
https://statistics.laerd.com/spss-tu...statistics.php
I comply with all those assumptions, except linearity.
These are assumptions for linear regression. Here:
https://statistics.laerd.com/spss-tu...statistics.php
I comply with all those assumptions, except linearity.
As previous post, first thing I would do would be some form of mosaic plot, which is similar to a scatter plot, but for discrete data. That will give you some evidence of the data distribution as well as some idea of what form, of deterministic map may exist.
0
reply
Report
#10
(Original post by studentshello)
These are assumptions for linear regression. Here:
https://statistics.laerd.com/spss-tu...statistics.php
I comply with all those assumptions, except linearity.
These are assumptions for linear regression. Here:
https://statistics.laerd.com/spss-tu...statistics.php
I comply with all those assumptions, except linearity.
When you look for a linear relationship between your variables in a scatter plot, you're looking for evidence that will justify only fitting linear terms in your regression equation.
0
reply
Report
#11
(Original post by studentshello)
Here is the scatter graph accounting for jitter
https://i.imgur.com/w59iAoY.png
Can linearity be seen here?
Here is the scatter graph accounting for jitter
https://i.imgur.com/w59iAoY.png
Can linearity be seen here?
0
reply
(Original post by Gregorius)
That looks a possibility. I would now fit a linear regression, and then take a look at the plot of the residuals versus the fitted values. This plot can tell you a lot about whether inference from your regression will be accurate. I would guess from that jittered plot that there might be some problem with the residuals around the x = 7 or 8 point.
That looks a possibility. I would now fit a linear regression, and then take a look at the plot of the residuals versus the fitted values. This plot can tell you a lot about whether inference from your regression will be accurate. I would guess from that jittered plot that there might be some problem with the residuals around the x = 7 or 8 point.
The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).
As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.
Here are partial regression plots for each of the 4 independent variables.
Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?
Last edited by studentshello; 2 years ago
0
reply
Report
#13
(Original post by studentshello)
Just to be clear.
The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).
As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.
Here are partial regression plots for each of the 4 independent variables.
https://i.imgur.com/qA5q55s.png
https://i.imgur.com/SjT8ZaN.png
https://i.imgur.com/f5obqgX.png
https://i.imgur.com/UwWVw2v.png
Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?
Just to be clear.
The linear regression model I am trying to validate includes 5 variables (1 dependant (as above), and 4 independent - some dichotomous variables like gender included).
As you suggest, another way of testing the linearity of each of these independent variables is by partial regression plots.
Here are partial regression plots for each of the 4 independent variables.
https://i.imgur.com/qA5q55s.png
https://i.imgur.com/SjT8ZaN.png
https://i.imgur.com/f5obqgX.png
https://i.imgur.com/UwWVw2v.png
Do these plots show linearity? Remember, only one plot has to be non-linear for the linearity assumption to be violated -- therefore, nominal regression might be better suited here?
0
reply
(Original post by Gregorius)
Yes, these plots look OK. But as I say, a residuals versus fitted values plot (with a superimposed smooth) will reveal if there are any serious problems with the model.
Yes, these plots look OK. But as I say, a residuals versus fitted values plot (with a superimposed smooth) will reveal if there are any serious problems with the model.
0
reply
Report
#15
(Original post by studentshello)
Thank you. What makes you think the plots look OK? Can you maybe describe your reasoning here?
Thank you. What makes you think the plots look OK? Can you maybe describe your reasoning here?
0
reply
X
Page 1 of 1
Skip to page:
Quick Reply
Back
to top
to top