The Student Room Group

further maths statistics

I dont get why the df is 2. I think pooling of the last two tables makes it 4 columns but the ans book says v-2.

I am not sure why its -2.
Reply 2

You lose two degrees of freedom because the totals are the same and the mean has been calculated.
It does help to upload the full question so we know the context is a chi square test.
(edited 2 years ago)
Reply 3
Original post by mqb2766
You lose two degrees of freedom because the totals are the same and the mean has been calculated.
It does help to upload the full question so we know the context is a chi square test.

I cant see that means have been calculated. Where? Here are both parts.
Reply 4
Original post by sonal7
I cant see that means have been calculated. Where? Here are both parts.

..
Reply 5
Original post by sonal7
..

"she calculates the actual proportion"

I had to search for the other part to understand the OP.
Reply 6
Original post by mqb2766
"she calculates the actual proportion"

I had to search for the other part to understand the OP.

so why 2 df? why 4-2? Wheres the mean? thank you. I appreciate your time.
(edited 2 years ago)
Reply 7
Original post by sonal7
so why 2 df? why 4-2? Wheres the mean? thank you. I appreciate your time.

I dont understand? You lose one dof because the totals are equal and the other dof because the means are also the same as the binomial p used to generate the expected frequency has been calculated from the observed frequency. Its similar to the (normal) unbiased variance calculation where you divide by n-1 rather than n, because the mean is estimated/calculated from the data hence you lose one dof.
(edited 2 years ago)
Reply 8
Original post by mqb2766
I dont understand? You lose one dof because the totals are equal and the other dof because the means are also the same as the binomial p used to generate the expected frequency has been calculated from the observed frequency. Its similar to the (normal) unbiased variance calculation where you divide by n-1 rather than n, because the mean is estimated/calculated from the data hence you lose one dof.

thank you - I am a bit confused about part b). how did she get the Ei (expected frequencies) if not from binomial?
Reply 9
Original post by sonal7
thank you - I am a bit confused about part b). how did she get the Ei (expected frequencies) if not from binomial?

Ive not gone through the question carefully, but in part a) the expected frequency is generated from a binomial with p=0.05. Its not dependent on the observed data so you lose one dof when comparing the frequencies as the totals are the same, but the means (p) will be independent. As there seem to be 150, the expected value would be ~0.75 and that seems about right from the expected frequencies

In b) the expected frequencies are again generated from a binomial, but the p is calculated from the observed data, so the expected value is about 1.1. So there is a closer match with the observed data, and hence you lose one (extra) degree of freedom when comparing because the p is calculated. Its not independent.

Edit - For both question parts, why not generate the expected frequencies yourself to understand them?
(edited 2 years ago)
Reply 10
Original post by mqb2766
Ive not gone through the question carefully, but in part a) the expected frequency is generated from a binomial with p=0.05. Its not dependent on the observed data so you lose one dof when comparing the frequencies as the totals are the same, but the means (p) will be independent. As there seem to be 150, the expected value would be ~0.75 and that seems about right from the expected frequencies

In b) the expected frequencies are again generated from a binomial, but the p is calculated from the observed data, so the expected value is about 1.1. So there is a closer match with the observed data, and hence you lose one (extra) degree of freedom when comparing because the p is calculated. Its not independent.

Edit - For both question parts, why not generate the expected frequencies yourself to understand them?

I dont get this. Why would knowing the p value reduce your dof. I need a tutor.
Original post by sonal7
I dont get this. Why would knowing the p value reduce your dof. I need a tutor.


Because the 4 probabilities cells in the observed and expected frequencies are not independent.
They are similar because
* They both sum to 150
* They both have the same mean (p)
So really, there are only two degrees of freedom difference between the four cells.

Ill try and dig out a tutorial, but what does your textbook say?
Reply 12
Original post by mqb2766
Because the 4 probabilities cells in the observed and expected frequencies are not independent.
They are similar because
* They both sum to 150
* They both have the same mean (p)
So really, there are only two degrees of freedom difference between the four cells.

Ill try and dig out a tutorial, but what does your textbook say?

why is mean p? I thought mean is np. as n is the same then the mean must be the same. Thanks for trying to help. How does the mean help you work out the missing values. You mean once you have worked out two values in the E row then you can work out the mean( like reverse engineering)
(edited 2 years ago)
Original post by sonal7
why is mean p? I thought mean is np. as n is the same then the mean must be the same. Thanks for trying to help. How does the mean help you work out the missing values. You mean once you have worked out two values in the E row then you can work out the mean( like reverse engineerin


Sure, the mean is np. n is the same so saying p is the same in both distribution or the mean is the same is equivalent.

Ok, lets try and give a illustrative example and I was looking for a good tutorial but have been out. Will do so later

Lets say there are 4 cells each for the observed and expected and youre using a chi squared to compare them. This is obviously equivalent to this problem. When the numbers in each cells are arbitrary counts and you compare using chi square, there is 4 degrees of freedom different between the observed and expected as the expected numbers can be anything. You have 4 arbitrary numbers/cells so 4 dof.

Now lets assume both are frequencies so they sum to 100 (or whatever). Then for both the observed/expected, the last cell is
100-sum of previous 3 cells
So the last cell is not independent and is determined by the other 3 cells, so there are 4-1=3 dof in the chi square test. Obviously were assuming the sum of the previous 3 cells is <= 100. We'd expect the observed / expected to be a closer match because of the sum to 100 constraint so the reduced dof compensates for this.

In practice, we don't set data like this, rather simply require the cells to sum to 100, but there are still 3 dof. In part a) where the observations are tested against the expected generated by a (normalized) binomial with an arbitrary p, we'll have 3 dof. In a sense, the missing dof is shared out across each cell.

Now for part b), the expected data is again generated by a (normalized) binomial (n,p) so the data sums to 100, but the p is set from the observed data. Again we'd expect this to be a closer match to the observed data as both the observed and expected data sum to 100 and have the same mean (or p). This matching of the observed and expected data removes an extra dof from the chi square test.

OK, so why is it one extra parameter when you match the p? Imagine you have two cells so something like
O: 40 60
E: ? ?
If we require the Expected to sum to 100, we'd have something like:
O: 40 60
E: 30 100-30
Now matching the mean (or p=0.4) with a binomial, we'd have an exact match
O: 40 60
E: 40 60
Originally there were 2 dof between the observed/expected. Frequency sum means there is really just one difference (repeated in the 2nd column) so dof=1. Matching p means that the rows are identical and there are 2-2=0 dof difference. There is no difference between the two rows simply because we are using frequency data with a known/estimated p.

With three cells you have 3 dof. Using frequency means you have 3-1=2 dof. Matching the mean or p, means you can match two of the columns but not the third, hence 3-2=1 dof. Again, this 1 dof is shared out across the 3 cells, rather than having two columns matching and the 3rd being independent.

Note in part a) you're testing the observations against a 5% binomial distribution. In part b) you're simply testing to see whether the data follows a binomial distribution, where the p is estimated from observed data. They're different tests. Obviously, the second one must give a better (not worse) match to the observed data, and reducing the dof by 1 compensates for this in the test.

This guy does something similar
https://www.youtube.com/watch?v=O7wy6iBFdE8&ab_channel=jbstatistics
(edited 2 years ago)
Reply 14
i will look into later as there is loads of other stuff i need to learn. I think this is making sense. I will reply later, maybe even in 2 weeks.

Quick Reply

Latest