I know outliners are often not considered when drawing a slope of a line, but if they were to be considered how would this affect the analysis? For example, if we were measuring speed in like a 30mph zone and a car was travelling over 50mph, would this cause the graph to increase, decrease or would it stay the same?
Would it increase purely because of the such big difference in speed?
Following on how would correlation affect this? Does it become stonger, weaker or stay the same?
I would say it increase as well only because the outliner is
I don't really know a full valid reason, so if anyone could help it would be appreciated.
It depends how you are analysing the data: For example the mode is not sensitive to outliers at all, the mean is sensitive depending on the size of the sample (bigger sample = less sensitive), and the median and midrange are extremely sensitive to outliers.
Generally, if you include very high outliers it will make the average higher (and extremely low outliers will make the average lower), in your car example, including that high figure would raise the average overall.
Outliers also weaken correlation: how much it weakens the correlation depends on the number of items of data, if the sample is very small, then the correlation weakens a lot, if the sample is huge then one single outlier will affect it less.
Remember, weak correlation means there is a lot of deviation within a data set: and an outlier is, by definition, a big deviation!
You can give a full valid reason using Spearman's Correlational Coefficient, which will show how the inclusion of the outlier affects the correlation on a scale from -1 to 1.
Thanks for the reply.
I understand what you are saying about the outliners but a little confused about the correltation. In an example i have looked at, it has a density of 68.6 cars. So but this meaning there are a lot of cars, does this weaken the correlation?
The number of cars doesn't affect the correlation directly, 100 cars can be just as strongly correlated as 10 cars or 1,000,000 cars:
Correlation is about the relationship between all the cars and the average; how much deviation there is (how far is between each car from the average): are all the cars pretty close to the average, or do they all vary randomly and wildly? How greatly do they differ from the mean?
A weak correlation means that the cars are all going at completely different speeds, there's no real relationship between all the data.
A strong correlation means that all the cars are pretty much hovering around the mean, maybe they are all going at 29.5 and 30.2mph. There is a strong relationship between all the cars and the average, and there is very little deviation from the mean, .
So this outlier, this car speeding along at 90mph, it would weaken the correlation because it is a huge deviation: it's no where near the mean, so that nice, neat trend of cars all going near the average is weakened.
There are two ways you can see correlation really happening: either by plotting all the data on a graph and inspecting how the dots all hover around the same area together, or you can work the correlation out mathematically with the correlational coefficient.
(I am really explaining this terribly.)
thanks, i think its becoming clearer.
So for this question:
Traffic planners investigated the relationship between traffic density (number of cars per mile) and the average speed of the traffic on a moderately larger city throughfare. The data were collected at the same location at 10 different times over a span of 3 months. They found a mean traffic denstiy of 68.6 carsper mile (cpm) with standard deviation 27.07 cpm. Overall, the cars' average speed was 26.38mph, with standard deviation 9.68 mph. These researchers found the regression line for there data to be:
Speed = 50.55-0.352Density
The data initally included the point density = 125cpm, speed 55mph. This point was considered an outliner and was not inlcuded in the analysis. Will the slop increase, decrease or remain the same if we redo the analysis and inclue the same point?
Provide full reasoning
The slope will increase because the outliner is some way above the average speed. When the outliner was not included, the average speed was 26.38mph, but as this car was travelling at 55mph (28.62mph above the average speed), it will increase. If the outliner car was travelling at 35mph, it would stay the same.
Will the correlation become stronger, weaker or stay the same if we redo the analysis and inclide the point (152,55)? Provide fill reasoning
The correlation will become weaker as this outliner is travelling at a completely different speed to the average speed, so there is no relationship to the data.
I'm not particularly qualified in stats, but I think some of the reasoning here is fundamentally flawed.
Imagine the set of points (0,0), ..., (10, 10). Obviously they lie perfectly on the line y = x. The average value for y is 5.
You get a new point (0, 17). Obviously this is going to push up the average value of y (to 6, as it happens). But what happens to the slope of the best fit line? Draw a sketch, and you'll see the slope actually decreases (the line is going to have to move towards the new point, and that will push the slope down).
If I were the OP, I'd draw a sketch representing the regression line and data (obviously it can only be a sketch - you don't have the actual data). Then draw on the 'outlier' point, and see how it will affect the data.
Oh right, I see what you mean now.
So the slope for the first question will actaully decrease because as the outliner is further away from the otehr points, it decreases to include that point?
Does my answer to the second question look alright?
I don't understand how you've reached that conclusion from what I said.
Again, for the 2nd question, you need to consider a sketch.
[For example, imagine you have 11 data points, this time (0, 1) (1,0) (2,3) (3,2) (4,5) (5,4) (6,7) (7,6) (8,9) (9,8) (10,10). The data roughly speaking follows the line y = x, with a bit of deviation. Now you get an outlier (100,100). Note that it fits the "y=x" equation perfectly. I haven't done the actual calculation, but I'm pretty sure you'll find the correlation actually goes up, even though y=100 is a mile away from the average for the rest of the data.]
HI, I was wondering how one would interpret correlation results when there is significant but weak correlation between variables.
should the title read outliers?