The Student Room Group

How do I find statistical significance with two sets of data (regression analysis)?

Hey guys, I am having trouble understanding understanding how to find the coefficient(?) for statistical significance between two variables. I am trying to judge whether in elections the turnout rate positively correlates with the closesness of victory within it (just the example of the current data I have right now). I have been going through videos where they talk about things like "t tests" and "alpha levels"(?). what I understand is that you need to test two variables together and then you will get a coefficient. but I do not understand how to get that coefficient (i.e. in excel?). from that coefficient, how do I then work out statistical significance? in the data that I've read before they talk about "significant at the _% level" - how do I know whether something would be significant at a certain percentage? is this something I have to input before the calculations? I'm very sorry guys, I am no mathematician - I am in desperate need of help here
(edited 7 years ago)

Scroll to see replies

kudasai?
Original post by sleepysnooze
kudasai?


shudn't it be 'onegai?' :biggrin:
Original post by rayestar
shudn't it be 'onegai?' :biggrin:


please help me with my maths sempai

Spoiler

Original post by sleepysnooze
please help me with my maths sempai

Spoiler




ohh i like that pic :biggrin:

when u ask for help for maths, its: 'suugaku wo tesudatte kudasai'
but when u say 'please' alone it's 'onegai'

sorry i can't help u there cos we haven't covered that in my skl lol
i knw most of S1 cos i did level 3 statistical methods, but regression wasn't part of that spec so yeahh
have u tried examsolutions?
Original post by sleepysnooze
Hey guys, I am having trouble understanding understanding how to find the coefficient(?) for statistical significance between two variables. I am trying to judge whether in elections the turnout rate positively correlates with the closesness of victory within it (just the example of the current data I have right now). I have been going through videos where they talk about things like "t tests" and "alpha levels"(?). what I understand is that you need to test two variables together and then you will get a coefficient. but I do not understand how to get that coefficient (i.e. in excel?). from that coefficient, how do I then work out statistical significance? in the data that I've read before they talk about "significant at the _% level" - how do I know whether something would be significant at a certain percentage? is this something I have to input before the calculations? I'm very sorry guys, I am no mathematician - I am in desperate need of help here


So first off, what is the form of the data that you have? Do you have a number of observations in pairs where the first of the pair is percentage turnout and the second is margin of victory?

If you've got something like this then the first thing to do is to plot the data on a graph - turnout on the x-axis and margin of victory on the y-axis. What you do next depends on the shape of the plot that you get - the tools that you apply next depend on certain assumptions, and you may have to transform your data in some way in order to make it obey these assumptions. Post a plot of the data here if you want some advice.

But once this is done, you are probably going to do something using either (a) a correlation coefficient or (b) linear regression. With appropriate software you will automatically get some measure of statistical significance. What software do you have available, or are you doing this "by hand"?
Original post by Gregorius
So first off, what is the form of the data that you have? Do you have a number of observations in pairs where the first of the pair is percentage turnout and the second is margin of victory?


yes

If you've got something like this then the first thing to do is to plot the data on a graph - turnout on the x-axis and margin of victory on the y-axis. What you do next depends on the shape of the plot that you get - the tools that you apply next depend on certain assumptions, and you may have to transform your data in some way in order to make it obey these assumptions. Post a plot of the data here if you want some advice.


I don't have data yet though, I'll have to get back to you on this

But once this is done, you are probably going to do something using either (a) a correlation coefficient or (b) linear regression. With appropriate software you will automatically get some measure of statistical significance. What software do you have available, or are you doing this "by hand"?


is there free software out there I can download? what do you recommend?
Original post by sleepysnooze

is there free software out there I can download? what do you recommend?


R is free. https://cran.r-project.org/

It is very complex software, however, your task can be accomplished pretty quickly.
Original post by Gregorius
R is free. https://cran.r-project.org/

It is very complex software, however, your task can be accomplished pretty quickly.


how complex? too complex for a person who basically knows nothing about this level of maths?
Original post by sleepysnooze
how complex? too complex for a person who basically knows nothing about this level of maths?


It's the programme that professional statisticians use (among others) and it has a steep learning curve. That said, what you;re trying to do is fairly elementary and not difficult to do in R.

But let me turn the question around: what software/calculator/computer do you have available to do this? Are you used to using spreadsheets for example?
Original post by Gregorius
It's the programme that professional statisticians use (among others) and it has a steep learning curve. That said, what you;re trying to do is fairly elementary and not difficult to do in R.

But let me turn the question around: what software/calculator/computer do you have available to do this? Are you used to using spreadsheets for example?


so if I have two sets of data that I am trying to find the statistical significance for (via one variable on the other) how do I find that value in the R program?
Original post by sleepysnooze
so if I have two sets of data that I am trying to find the statistical significance for (via one variable on the other) how do I find that value in the R program?


If you set up your data in a dataframe (which we'll call "dfr":wink: with column names "turnout" and "margin" then plotting the data is as simple as

with(dfr, plot(turnout, margin))

If you get something nice and attractive for the plot then you would issue the incantation

with(dfr, cor.text(turnout, margin))

to calculate the correlation coefficient (and there are options for the different types of correlation coefficients)

or you might fit a linear regression via

mod <- lm(margin ~ turnout, data=dfr)
summary(mod)
Original post by Gregorius
If you set up your data in a dataframe (which we'll call "dfr"]

with column names "turnout" and "margin" then plotting the data is as simple as

with(dfr, plot(turnout, margin))

If you get something nice and attractive for the plot then you would issue the incantation

with(dfr, cor.text(turnout, margin))

to calculate the correlation coefficient (and there are options for the different types of correlation coefficients)

or you might fit a linear regression via

mod <- lm(margin ~ turnout, data=dfr)
summary(mod)


I'm really sorry that made very little sense to me - I've never used that program before so you'll have to explain it more simply
if I have, in excel for instance, two rows of results that I am wanting to compare, how would I paste those results into R and then set up these tests? you said "dfr" but is that me literally pasting all my data without anything separating them from each other (i.e. turnout from margin)?
or when you said "dtr, plot (margin, turnout), does "margin" simply mean all those results, and then turnout meaning the turnout results? so basically margin and turnout are actually (in your formula) meant to mean that I am just inputting numbers? could you please give me an example or something? I'm so sorry
Original post by sleepysnooze
I'm really sorry that made very little sense to me - I've never used that program before so you'll have to explain it more simply
if I have, in excel for instance, two rows of results that I am wanting to compare, how would I paste those results into R and then set up these tests? you said "dfr" but is that me literally pasting all my data without anything separating them from each other (i.e. turnout from margin)?
or when you said "dtr, plot (margin, turnout), does "margin" simply mean all those results, and then turnout meaning the turnout results? so basically margin and turnout are actually (in your formula) meant to mean that I am just inputting numbers? could you please give me an example or something? I'm so sorry


No worries - but if you have excel and are used to using it, you could do the analysis there rather than having to faff around with a new (and scary!) piece of software.

So in Excel, arrange your data in two columns, and then make sure that you have the Data Analysis toolpak installed in your copy of excel. (Google how to do this if it is not).

Then click on the "data analysis" button on the "Data" tab and choose "regression" as the analysis to do. You'll be prompted for the range of Y values (which should be the margin of victory numbers) and X values (the turnout numbers).

In the results, the value of "R squared" is the square of the correlation coefficient and the thing labelled "Significance F" is your p-value.

Do also plot the value you have using a scatter plot to make sure that there is a roughly linear relationship between the two variable.

If you want to attach your data here, I'm quite happy to have a look at it and guide through the analysis.
Original post by Gregorius
No worries - but if you have excel and are used to using it, you could do the analysis there rather than having to faff around with a new (and scary!) piece of software.

So in Excel, arrange your data in two columns, and then make sure that you have the Data Analysis toolpak installed in your copy of excel. (Google how to do this if it is not).


okay, I've just got the toolpak

Then click on the "data analysis" button on the "Data" tab and choose "regression" as the analysis to do. You'll be prompted for the range of Y values (which should be the margin of victory numbers) and X values (the turnout numbers).

In the results, the value of "R squared" is the square of the correlation coefficient and the thing labelled "Significance F" is your p-value.


what's a p-value? I've heard of that but there's never been an explanation for what it represents

Do also plot the value you have using a scatter plot to make sure that there is a roughly linear relationship between the two variable.


how do I set up the scatter graph? is it via regression or something totally different? I'm sorry - I'm probably asking a very stupid question right now - I'm just extremely lost and wanting to know this for certain

If you want to attach your data here, I'm quite happy to have a look at it and guide through the analysis.


okay I'm sorry if I've gone off from my original idea here but now I am wanting to test if a change of an institution will cause significant changes in the results of elections. I have 4 results sets: (1= the first country and 2 = the second) 1a) before the institutional change, and 1b) after the institutional change. and I have 2a) a similar system without institutional change, and 2b) after 1's institutional change and still with no institutional change (so I'm comparing how two similar systems can compare to each other when one gets a change of its electoral institution, basically). what should I do to prove that the institution was the determining factor of the change of results? I have found that there was a change for 1 but no change for 2 in terms of the average results over the time periods (the "effective number of political parties" increased by 1.5~, for instance) but how do I show that it wasn't a coincidence? I'm sorry, you sound extremely knowledgable about statistical analysis and this would help me beyond words if you could guide me
(edited 7 years ago)
Original post by Gregorius
No worries - but if you have excel and are used to using it, you could do the analysis there rather than having to faff around with a new (and scary!) piece of software.

So in Excel, arrange your data in two columns, and then make sure that you have the Data Analysis toolpak installed in your copy of excel. (Google how to do this if it is not).

Then click on the "data analysis" button on the "Data" tab and choose "regression" as the analysis to do. You'll be prompted for the range of Y values (which should be the margin of victory numbers) and X values (the turnout numbers).

In the results, the value of "R squared" is the square of the correlation coefficient and the thing labelled "Significance F" is your p-value.

Do also plot the value you have using a scatter plot to make sure that there is a roughly linear relationship between the two variable.

If you want to attach your data here, I'm quite happy to have a look at it and guide through the analysis.


also:

SUMMARY OUTPUT Regression Statistics Multiple R 0.022903566 R Square 0.000524573 Adjusted R Square -0.001017827 Standard Error 7.590635054 Observations 650 ANOVA df SS MS F Significance F Regression 1 19.59590527 19.59591 0.340101939 0.559973287 Residual 648 37336.29585 57.61774 Total 649 37355.89176 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 51.76244313 3.355788785 15.42482 5.93377E-46 45.17291013 58.35197612 45.17291 58.35198 X Variable 1 -2.954555179 5.066260901 -0.58318 0.559973287 -12.90282532 6.993714963 -12.9028 6.993715
bGwDpQJrNBb8dJvz2c1qw9X9Kg+tgKp7cdkPhvzGNfR2hWodjKxO4DBfCBWwIeOgyejpNf2EK89/sX
(woops, I thought it would appear as a table)

I just inserted my X data and Y data (X being majority of a candidate and Y being the turnout in an election) - if the confidence level is 95% (I don't even know what this implies, being a stats novice) and the significance F is 0.55997~ does that mena that the relationship between the two variables is significant? my guess is yes but this is quite important work I'm doing here so if you could explain it that would be absolutely fantastic
Would it help just to find the covariance between the two factors?
Original post by Big Weiner
Would it help just to find the covariance between the two factors?


That's exactly what we're doing in linear regression - another name for linear regression is "analysis of covariance".
Original post by sleepysnooze

what's a p-value? I've heard of that but there's never been an explanation for what it represents

The classical way of doing a statistical test - to see if something is a "statistically significant" result is to put forward a "null hypothesis" - usually a hypothesis that denies the connection that you're looking for - and then to see whether the data that you have is consistent with that null hypothesis.

So in your case, you would frame the null hypothesis "there is no correlation between voter turnout and margin of victory".

A p-value expresses how consistent your data is with the null hypothesis by calculating something (called a test statistic - in your case a correlation coefficient) from the observed data and working out the probability of getting as large a value of the test statistic as you have under the assumption that the null hypothesis is true.

So for your case, say you observe a value of 0.56 for the correlation coefficient between voter turnout and margin of victory, and that you software says the p-value for this is 0.035. Then what this is saying is that the probability of obtaining a value of 0.56 or higher, purely by chance - as we are assuming there is no correlation - for the correlation is 0.035.

A very small p-value (conventionally 0.05 or less) is taken to indicate evidence against the null hypothesis - that is, you can conclude that there is a correlation between turnout and margin of victory.


how do I set up the scatter graph? is it via regression or something totally different? I'm sorry - I'm probably asking a very stupid question right now - I'm just extremely lost and wanting to know this for certain


In Excel, choose the "Insert" tab and then choose "charts". I suggest you use a "scatter" chart - which on my installation is the first chart type at the top left of the selection.

Quick Reply

Latest