The question is:

State two ways in which better use could be made of the large data set to produce a model describing a relationship between humidity and visibility.

For this question we're shown the data for humidity and visibility in Heathrow over the first 2 weeks of September 2015, which can be seen by clicking on this link and scrolling to the bottom where you must click Heathrow Sep 2015: https://mathsorchard.weebly.com/edex...data-sets.html

The answer given says we should increase the sample size by using the data for humidity and visibility from other months too and also taking random samples through the whole of September.
I understand the first point but I don't get why taking random samples in this case would be that beneficial.. could someone explain this please?
1 month ago
Taking random samples means everyone/thing in the population has an equal chance of being picked.
This leads to a less biased and more representative sample of the target population which increases the validity of the data.

Hope this helps!
1 month ago
Have you got the actual model solution - there may be a sublety in the wording? Also in the data set there is no visibility column, only humidity?
For the first one, generally there will be a greater range in both humidity & visibility so the model should be better in that it can predict across a wider range. But you understand that.
Saying taking random samples throughout September sounds like its almost repeating itself but just for one month rather than several.
1 month ago
Have you got the actual model solution - there may be a sublety in the wording? Also in the data set there is no visibility column, only humidity?
For the first one, generally there will be a greater range in both humidity & visibility so the model should be better in that it can predict across a wider range. But you understand that.
Saying taking random samples throughout September sounds like its almost repeating itself but just for one month rather than several.
https://qualifications.pearson.com/e...le-assessments

^^ There's a link to the official data set on that page. On the excel spreadsheet along the bottom it has Heathrow 2015.

https://activeteach-prod.resource.pe...b_sm1_ex4b.pdf

^^ That's the link to the worked solution, it's the last question(5d).
1 month ago
Taking random samples means everyone/thing in the population has an equal chance of being picked.
This leads to a less biased and more representative sample of the target population which increases the validity of the data.

Hope this helps!
Hmm I'm a bit confused tbh, but I kinda understand what you're saying. It's just weird though, this doesn't seem like something you'd be biased about.
1 month ago
https://qualifications.pearson.com/e...le-assessments

^^ There's a link to the official data set on that page. On the excel spreadsheet along the bottom it has Heathrow 2015.

https://activeteach-prod.resource.pe...b_sm1_ex4b.pdf

^^ That's the link to the worked solution, it's the last question(5d).
Ok, that data set does have both humidity and visibilty in.

Obviously, using only the first two weeks is a bad idea when there is a lot more data available, otherwise the model will be biased to the conditions (higher pressure & visibility) that were experienced during that time. Again, I know you know that.
One of the results of this is that in the 15 samples, the minimum humidity is 80, next is 87, than all the remainder are in the 90s. The single value of 80 will have a high leverage (large influence) on the model. Obviously, more data in the 80s would provide more information about those lower values.

In the second half of the September, there are more lower pressure / visibility data, so this is perhaps the reason for stating it explicitly. Obviously, if you're only allowed to choose 15 samples (not stated), you'd be better off randomly sampling 15 from 30. However, if there is no restriction, you'd be better off using all 30. Or indeed, using all the data in the data set. Unless the question says you can only choose 15 points there is no reason for stating it in the model solution.
