The Student Room Group

Correlation

I have river flow rates for two separate rivers.

The data produced this scatter graph for both rivers

https://i.imgur.com/jVnWF3t.png

What is the best type of correlation to use here?

Spearman, Pearson or Kendall?

Why?

Thank you.
Reply 1
Original post by studentshello
I have river flow rates for two separate rivers.

The data produced this scatter graph for both rivers

https://i.imgur.com/jVnWF3t.png

What is the best type of correlation to use here?

Spearman, Pearson or Kendall?

Why?

Thank you.


Reasonable summary of the assumptions, pros & cons at
http://www.statisticssolutions.com/wp-content/uploads/wp-post-to-pdf-enhanced-cache/1/correlation-pearson-kendall-spearman.pdf
What do you think? Is the data normally distributed data etc?
Original post by studentshello
I have river flow rates for two separate rivers.

The data produced this scatter graph for both rivers

https://i.imgur.com/jVnWF3t.png

What is the best type of correlation to use here?

Spearman, Pearson or Kendall?

Why?

Thank you.


I would be inclined to apply a transform to both variables before attempting any correlation analysis. What are the two variable measuring?
Original post by mqb2766
Reasonable summary of the assumptions, pros & cons at
http://www.statisticssolutions.com/wp-content/uploads/wp-post-to-pdf-enhanced-cache/1/correlation-pearson-kendall-spearman.pdf
What do you think? Is the data normally distributed data etc?

Original post by Gregorius
I would be inclined to apply a transform to both variables before attempting any correlation analysis. What are the two variable measuring?


Looking at the graph, I would suggest Pearson's because the data appears to be normaly distributed. Also, no ranking is required, so Spearman and Kendall unsuitable. Would you agree?
Original post by studentshello
Looking at the graph, I would suggest Pearson's because the data appears to be normaly distributed. Also, no ranking is required, so Spearman and Kendall unsuitable. Would you agree?


Nope. "The data appear to be normally distributed" is meaningless; you should be looking at the distribution of each variable separately, then considering whether each needs some sort of transform (log, perhaps, or square root) that makes the distributions more nearly approximate normal. If this doesn't work, consider a non-parametric approach.
Reply 5
Original post by Gregorius
Nope. "The data appear to be normally distributed" is meaningless; you should be looking at the distribution of each variable separately, then considering whether each needs some sort of transform (log, perhaps, or square root) that makes the distributions more nearly approximate normal. If this doesn't work, consider a non-parametric approach.


What he said.

I'm not sure whether the OP is answering a book question or trying to do some actual analysis, but the data is certainly not normally distributed.
Original post by mqb2766
What he said.

I'm not sure whether the OP is answering a book question or trying to do some actual analysis, but the data is certainly not normally distributed.

How can you tell that by the graph?

Yes I'm trying to do data analysis.

Which correlation is most suitable, Spearman or Kendall?
(edited 5 years ago)
Original post by Gregorius
Nope. "The data appear to be normally distributed" is meaningless; you should be looking at the distribution of each variable separately, then considering whether each needs some sort of transform (log, perhaps, or square root) that makes the distributions more nearly approximate normal. If this doesn't work, consider a non-parametric approach.


Spearman or Kendall, which is most suitable?


Transform should not be necessary for my task.
Reply 8
Original post by studentshello
How can you tell that by the graph?

Yes I'm trying to do data analysis.

Which correlation is most suitable, Pearson, Spearman or Kendall?


A normal distribution is a bell shaped curve, symmetric round the mean. Just google it.
You would project the data onto each axis (just look at the distribution of each column).
As Gregorius suggests, doing a log-type transformation of the data may well give you some better insights as it would help see the clump of data near zero.
Original post by studentshello
How can you tell that by the graph?


Most of the values are down near zero values for both variables.
Original post by studentshello

Transform should not be necessary for my task.


That's a very strange thing to say; why do you think a transform "should not" be necessary? If you are doing real-world data analysis, you should be applying standard techniques to your data!
Original post by Gregorius
That's a very strange thing to say; why do you think a transform "should not" be necessary? If you are doing real-world data analysis, you should be applying standard techniques to your data!

We were not taught this.

They said one of three correlation methods is fine.

But I don't know which one in this situation, Spearman or Kendall?
Original post by studentshello
We were not taught this.

They said one of three correlation methods is fine.

But I don't know which one in this situation, Spearman or Kendall?


Ah, so this is an exercise in data analysis that you're doing is it? Have you been told anything about what the data means?

The individual variables are not normally distributed, so you have a choice: either (a) transform the variables so that they do become approximately normal (which would be the standard approach with data that looks like the stufff you have) and then apply the Pearson correlation coefficient or (b) use a rank correlation method, using the scale they're measured on to give you ranks.

If (b) you have a choice again of which rank correlation. Time to look at their applicability. Looking at your plot, which one of these do you think best applies here?
Original post by Gregorius
Ah, so this is an exercise in data analysis that you're doing is it? Have you been told anything about what the data means?

The individual variables are not normally distributed, so you have a choice: either (a) transform the variables so that they do become approximately normal (which would be the standard approach with data that looks like the stufff you have) and then apply the Pearson correlation coefficient or (b) use a rank correlation method, using the scale they're measured on to give you ranks.

If (b) you have a choice again of which rank correlation. Time to look at their applicability. Looking at your plot, which one of these do you think best applies here?


From what I gather, Kendall is better suited for data with outliers, whereas Spearman's is better suited for data with no outliers, closely grouped.

Looking at my graph, most of the data is grouped together (monotonic relationship?), with few (or no) outliers? So, I guess Spearman's would be the best choice here? What do you think?

OR... Perhaps Kendall is the better choice because it takes into account those data points in the top right on the graph?

It depends on if you consider those data points in the top right outliers, or not.


I'm not so sure!
(edited 5 years ago)
Original post by studentshello

It depends on if you consider those data points in the top right outliers, or not.


An outlier is basically a data point that is very unlikely to be consistent with the probability model that you're using. So for variables like the ones you have, you'd be looking for a point (or two or three) that are well separated from the rest. So here? Nope, not outliers.
Original post by Gregorius
An outlier is basically a data point that is very unlikely to be consistent with the probability model that you're using. So for variables like the ones you have, you'd be looking for a point (or two or three) that are well separated from the rest. So here? Nope, not outliers.

So, Spearmans would be best choice?
Original post by studentshello
So, Spearmans would be best choice?


I'd be happy with that.

(Mind you, I wouldn't dream of analyzing data like this using a correlation coefficient!)

Quick Reply

Latest