Outliners and CorrelationWatch
I know outliners are often not considered when drawing a slope of a line, but if they were to be considered how would this affect the analysis? For example, if we were measuring speed in like a 30mph zone and a car was travelling over 50mph, would this cause the graph to increase, decrease or would it stay the same?
Would it increase purely because of the such big difference in speed?
Following on how would correlation affect this? Does it become stonger, weaker or stay the same?
I would say it increase as well only because the outliner is
I don't really know a full valid reason, so if anyone could help it would be appreciated.
Generally, if you include very high outliers it will make the average higher (and extremely low outliers will make the average lower), in your car example, including that high figure would raise the average overall.
Outliers also weaken correlation: how much it weakens the correlation depends on the number of items of data, if the sample is very small, then the correlation weakens a lot, if the sample is huge then one single outlier will affect it less.
Remember, weak correlation means there is a lot of deviation within a data set: and an outlier is, by definition, a big deviation!
You can give a full valid reason using Spearman's Correlational Coefficient, which will show how the inclusion of the outlier affects the correlation on a scale from -1 to 1.
I understand what you are saying about the outliners but a little confused about the correltation. In an example i have looked at, it has a density of 68.6 cars. So but this meaning there are a lot of cars, does this weaken the correlation?
Correlation is about the relationship between all the cars and the average; how much deviation there is (how far is between each car from the average): are all the cars pretty close to the average, or do they all vary randomly and wildly? How greatly do they differ from the mean?
A weak correlation means that the cars are all going at completely different speeds, there's no real relationship between all the data.
A strong correlation means that all the cars are pretty much hovering around the mean, maybe they are all going at 29.5 and 30.2mph. There is a strong relationship between all the cars and the average, and there is very little deviation from the mean, .
So this outlier, this car speeding along at 90mph, it would weaken the correlation because it is a huge deviation: it's no where near the mean, so that nice, neat trend of cars all going near the average is weakened.
There are two ways you can see correlation really happening: either by plotting all the data on a graph and inspecting how the dots all hover around the same area together, or you can work the correlation out mathematically with the correlational coefficient.
(I am really explaining this terribly.)
So for this question:
Traffic planners investigated the relationship between traffic density (number of cars per mile) and the average speed of the traffic on a moderately larger city throughfare. The data were collected at the same location at 10 different times over a span of 3 months. They found a mean traffic denstiy of 68.6 carsper mile (cpm) with standard deviation 27.07 cpm. Overall, the cars' average speed was 26.38mph, with standard deviation 9.68 mph. These researchers found the regression line for there data to be:
Speed = 50.55-0.352Density
The data initally included the point density = 125cpm, speed 55mph. This point was considered an outliner and was not inlcuded in the analysis. Will the slop increase, decrease or remain the same if we redo the analysis and inclue the same point?
Provide full reasoning
The slope will increase because the outliner is some way above the average speed. When the outliner was not included, the average speed was 26.38mph, but as this car was travelling at 55mph (28.62mph above the average speed), it will increase. If the outliner car was travelling at 35mph, it would stay the same.
Will the correlation become stronger, weaker or stay the same if we redo the analysis and inclide the point (152,55)? Provide fill reasoning
The correlation will become weaker as this outliner is travelling at a completely different speed to the average speed, so there is no relationship to the data.
I am not sure what they mean by "full reasoning", do they want you to provide your calculations: For example, don't just say "it was above the average". Demonstrate how 125cpm is 2.08 standard deviations over the mean, (and generally we hold anything over 2 standard deviations from the mean to be an outlier) and that 55mph is 2.96 standard deviations from the mean. (which is quite a bit!) So yes, you're certainly right in saying including such an abnormally high value would drag up the average quite a bit!
And such huge standard deviations from the mean would definitely weaken the correlation.
Imagine the set of points (0,0), ..., (10, 10). Obviously they lie perfectly on the line y = x. The average value for y is 5.
You get a new point (0, 17). Obviously this is going to push up the average value of y (to 6, as it happens). But what happens to the slope of the best fit line? Draw a sketch, and you'll see the slope actually decreases (the line is going to have to move towards the new point, and that will push the slope down).
If I were the OP, I'd draw a sketch representing the regression line and data (obviously it can only be a sketch - you don't have the actual data). Then draw on the 'outlier' point, and see how it will affect the data.
So the slope for the first question will actaully decrease because as the outliner is further away from the otehr points, it decreases to include that point?
Does my answer to the second question look alright?
Again, for the 2nd question, you need to consider a sketch.
[For example, imagine you have 11 data points, this time (0, 1) (1,0) (2,3) (3,2) (4,5) (5,4) (6,7) (7,6) (8,9) (9,8) (10,10). The data roughly speaking follows the line y = x, with a bit of deviation. Now you get an outlier (100,100). Note that it fits the "y=x" equation perfectly. I haven't done the actual calculation, but I'm pretty sure you'll find the correlation actually goes up, even though y=100 is a mile away from the average for the rest of the data.]
HI, I was wondering how one would interpret correlation results when there is significant but weak correlation between variables.
If two variables are (approximately) linearly related then the correlation between them measures the degree to which the values of one of the variables can predict the values of the other. If the correlation between them is high then knowledge of the value of one of them will predict the value of the other with a good degree of precision. If the correlation is low, then the prediction will have low precision.
The correlation will be "statistically significant" if it is unlikely to have arisen simply by chance (and you will have probably set a value of 0.05, or similar, as a threshold for something happening by chance).
So, in summary a low, statistically significant, correlation suggests a real relationship between the variables, but one that has little predictive power.
BTW: would be a good idea in future to start a new thread rather than tacking on to an old one!