The Student Room Group

Plotting a Scatter Diagram Using AQA Large Data Set

Hello, I have been revising my understanding of statistics specifically bivariate data and the AQA A Level Mathematics Large Data Set. I have found the question below which I am hopelessly struggling with.
I have attached a copy of the data set here.

Using the data for the purchased quantities of food in the East Midlands from the LDS, plot a scatter diagram to investigate any correlation between purchased quantities of butter and margarine using data from 2006-2014.
Plot butter on the x-axis. Give your conclusions, including comments on the suitability of data.

So I have attempted to plot the scatter diagram. My first query is does the question intend for you to include both subsets of data on one axis, (which I have plotted on the x-axis) or rather does it demand two separate diagrams to investigate if there is any correlation, or a single diagram? I understand that in a scatter diagram the independent variable is plotted on the x-axis and the dependent variable on the y-axis. Since I did not think that either purchased quantities of butter nor margarine are dependent on each other I took them both to be independent variables and plot them on the x-axis.
Moreover, I have attempted to draw regression lines for each data set (lines of best fit) to better evaluate the distribution of the data, but do not think that I have done so accurately enough.

In the first diagram I have attached I believe that of the purchased quantities of butter the variables increase together thereby exhibiting a positive correlation, and presumably the correlation coefficient r has a positive value where r >0.
Similarly, of the purchased quantities of margarine the variables predominantly increase together and exhibit a positive correlation, however, this correlation is not as strong as for margarine and does include more outlying data points.
Moreover, the purchased quantities of butter continue to exceed that of margarine between 2006-2014 although there is a notable decline in the purchase of butter in 2009.
In terms of the suitability of the data, I believe that it is an extensive sample as it is divided among the regions of the UK and further collected to calculate an average for the purchased quantities per week, which is a very regular basis as opposed to say a month or year. The data extends from 2006-2014 which is a moderate time period to evaluate any changes in trend, although this could be extended from an earlier date to evaluate previous purchased quantities to broaden the data set and search for any further outliers. In this sense, the data is limited as it only concerns purchased quantities from 2006-2014 and excludes any previous or contemporary data.

However, would I instead plot the purchased quantities of margarine on the x-axis and thus the remaining variable, the purchased quantities of butter on the y-axis since a scatter diagram intends to show each pair of data values as a single point on the graph and to exhibit the type and strength of relationship between the two variables.
I have also done so and attached the graph here. In which case, I believe that the two variables are shown to be exhibit moderate positive correlation, especially discernible for the latter three points on the diagram. However, would it be more suitable to state that as a whole the bivariate data is uncorrelated and has zero correlation, i.e. a value of 0 for the correlation coefficient r=0?

I really want to improve upon my plotting of scatter diagrams and interpretation of data. How could I correct or improve upon my answer here, clearly I am rather confused but I am trying to comprehensively evaluate the information given. I would be very grateful of any response.
(edited 3 years ago)
Original post by AN630078

Using the data for the purchased quantities of food in the East Midlands from the LDS, plot a scatter diagram to investigate any correlation between purchased quantities of butter and margarine using data from 2006-2014.
Plot butter on the x-axis. Give your conclusions, including comments on the suitability of data.


I really want to improve upon my plotting of scatter diagrams and interpretation of data. How could I correct or improve upon my answer here, clearly I am rather confused but I am trying to comprehensively evaluate the information given. I would be very grateful of any response.

You are on the right lines, but generally when question says "... investigate any correlation between..." it generally means find the dependence of variable A on variable B. So, it is your bivariate plot that is most relevant here. You could improve your answer by trying to draw a best-fit line through the data and, if you can use Excel, you can find a coefficient of determination (R^2) to quantify the degree of correlation. Excel will calculate this for you automatically if add a linear trendline.

As you can see from the attached plot, the coefficient of variation is not zero (R^2 = 0.3229, R = 0.568) which indicates a weak positive correlation. In other words, people who buy more margarine generally tend to buy more butter, but not always, and vice versa. That makes sense, because they are both similar products, and to a certain extent, can be used for the same purpose (cooking, making toast, etc).

You could also plot quantities of butter and margarine separately versus time (see attached plot). It's best to put time on the x-axis because this is the variable which is changing most regularly (your plots have the axes the other way around). The univariate correlations show simply that amount of butter and margarine that people buy both increase with time, and arguably margarine sales are increasing faster than butter sales. The correlation is reasonably strong in both cases (R^2 for butter is heavily influenced by first few data points which are noisy).

Hope that helps.
Reply 2
Original post by lordaxil
You are on the right lines, but generally when question says "... investigate any correlation between..." it generally means find the dependence of variable A on variable B. So, it is your bivariate plot that is most relevant here. You could improve your answer by trying to draw a best-fit line through the data and, if you can use Excel, you can find a coefficient of determination (R^2) to quantify the degree of correlation. Excel will calculate this for you automatically if add a linear trendline.

As you can see from the attached plot, the coefficient of variation is not zero (R^2 = 0.3229, R = 0.568) which indicates a weak positive correlation. In other words, people who buy more margarine generally tend to buy more butter, but not always, and vice versa. That makes sense, because they are both similar products, and to a certain extent, can be used for the same purpose (cooking, making toast, etc).

You could also plot quantities of butter and margarine separately versus time (see attached plot). It's best to put time on the x-axis because this is the variable which is changing most regularly (your plots have the axes the other way around). The univariate correlations show simply that amount of butter and margarine that people buy both increase with time, and arguably margarine sales are increasing faster than butter sales. The correlation is reasonably strong in both cases (R^2 for butter is heavily influenced by first few data points which are noisy).

Hope that helps.

Thank you very much for your reply I greatly appreciate it, especially your attached plot.
I have redrawn my graph (attached here) of the bivariate scatter diagram to focus on the data given and reduce the empty space of the graph to better evaluate the relationship between the variables. Do you think that this is potentially beneficial? I have also attempted to include a line of best fit here.
Oh, ok I have not been taught how to calculate the coefficient of correlation, I just know from a textbook that the value of r is -1<r<1 and is:
r=1 for perfect positive correlation
r=0 for negative correlation
r=-1 for perfect negative correlation
Since the data did not appear to exhibit any correlation I took this to mean that it was exhibiting zero correlation, hence r=0, but clearly I was too hasty in my assumptions. Thank you for the suggestion, I will try adding this with Excel.


Thank you for evaluating the univariate data and showing that the purchased quantities of of butter and margarine both increase with time, and that margarine sales are increasing faster than butter sales.

Moreover, in regard to answering the question to arrive at a conclusion could I state:
There appears to be zero correlation, or only a very minor positive correlation (as you have shown that r does not equal 0 but 0.568),between the purchased quantities of butter and margarine . A weak positive correlation demonstrates that people who buy more margarine tend to purchase more butter, but not always which is also true that people who purchase more butter may purchase more margarine. That is rather logical as they are similar products that may be utilised for similar usages, or rather could be called substitutes.

Moreover, in commenting on the suitability of the data do you think that I could improve upon my previous thoughts:

"In terms of the suitability of the data, I believe that it is an extensive sample as it is divided among the regions of the UK and further collected to calculate an average for the purchased quantities per week, which is a very regular basis as opposed to say a month or year. The data extends from 2006-2014 which is a moderate time period to evaluate any changes in trend, although this could be extended from an earlier date to evaluate previous purchased quantities to broaden the data set and search for any further outliers. In this sense, the data is limited as it only concerns purchased quantities from 2006-2014 and excludes any previous or contemporary data."

Thank you very much again for your help I tremendously appreciate it 😁👍
Glad that you found my comments useful. Your revised scatter plot and justifications look much better. Good luck with your assessment.

As a more general note, R^2 is a very crude measure of goodness-of-fit, and should be used with caution. It is easy to construct data which have R^2 = 0 that are clearly correlated, similarly, and even when R^2 = 1 it does not guarantee that the best fit line is the correct one. There are much better measures of goodness-of-fit.

Quick Reply

Latest