# Does a confidence interval have to be based on a normal distribution?

Announcements
#1
Let's imagine I have a function Trader(param), which takes a matrix of market data as a parameter, and then chooses whether to buy or sell. I feed it a bunch of historical data, and it spits out a series of numbers representing the value of each trade, sometimes making money, sometimes losing. These numbers belong to a dataset called Sample.

I have a hypothesis: giving real money to Trader() has a positive expected value. I want to find out if my dataset Sample is large enough to confirm this hypothesis according to some confidence interval Z, within a margin of error M. My required sample size is N. I'm led to believe that this is the appropriate formula:

N >= ((Z * σ) ÷ M)^2

Let's plug in an actual value for Z. I want to be 95% confident. 95% of the data points within a normal distribution are within 1.96 standard deviations of the mean, therefore Z == 1.96, for 95% confidence.

But does my null hypothesis necessarily conform to a normal distribution? If I calculated the value of every possible trade within my historical market data - that's to say buy on every day, and sell on every day in the future of the buying day, giving me a very large amount of potential trades - would *that* be my null hypothesis? And if so, would I then want to compute how many standard deviations from the mean 95% of these data points fall?
0
4 years ago
#2
(Original post by NingNong247)
Let's imagine I have a function Trader(param), which takes a matrix of market data as a parameter, and then chooses whether to buy or sell. I feed it a bunch of historical data, and it spits out a series of numbers representing the value of each trade, sometimes making money, sometimes losing. These numbers belong to a dataset called Sample.

I have a hypothesis: giving real money to Trader() has a positive expected value. I want to find out if my dataset Sample is large enough to confirm this hypothesis according to some confidence interval Z, within a margin of error M. My required sample size is N. I'm led to believe that this is the appropriate formula:

N >= ((Z * σ) ÷ M)^2

Let's plug in an actual value for Z. I want to be 95% confident. 95% of the data points within a normal distribution are within 1.96 standard deviations of the mean, therefore Z == 1.96, for 95% confidence.
So if I’m understanding you correctly, you have (a) a huge pile of historical data on all trades (i.e. buy and sell prices) made in a market over a period of time; (b) a function that looks at market prices on a particular day and which takes a decision as to whether to trade a (particular vector of?) market goods. You apply the function to the historical data and for each trade that it makes, it gives you the actual return (positive or negative) on each trade.

Your null hypothesis is that the future expected return for this function is positive (do you want the hypothesis to simply be that it is positive, or do you want to bound it away from zero?)

So, from a statistical point of view, in order to come to some conclusions about the return, we need to understand what the sampling distribution of the returns are assuming the null hypothesis to be true. I observe that your return is a sum of many returns, one from each trade – which immediately makes me think that the central limit theorem is going to come in useful. Provided that the probability distribution from each individual return is not too pathological (and this is something you’ll need to check) then the distribution of your return (being a sum of all those individual returns) will be normally distributed.

But does my null hypothesis necessarily conform to a normal distribution? If I calculated the value of every possible trade within my historical market data - that's to say buy on every day, and sell on every day in the future of the buying day, giving me a very large amount of potential trades - would *that* be my null hypothesis? And if so, would I then want to compute how many standard deviations from the mean 95% of these data points fall?
You’re talking about something being your “null hypothesis” whereas I rather think you mean “null distribution” (i.e. sampling distribution assuming the null hypothesis to be true). So are you asking here about exactly what calculations you should make on your historical data? Is it a choice between all potential trades or some subset of them?
1
#3
>Your null hypothesis is that the future expected return for this function is positive

My understanding of the concept of the null hypothesis is that it describes "Nothing is happening, your function is just random", and therefore my null hypothesis would be that I break even over a large enough sample (ignoring, for simplicity's sake, that actual stock exchanges charge a small fee per trade, so the real null hypothesis would be that I lose one trading fee per trade).

Isn't the idea that I make a small positive per trade return my "alternative hypothesis", ie, the one I hope to confirm by disproving my null hypothesis?

I think my confusion comes from the fact that I have three possible Z's to put in to my formula. One Z is based on my null hypothesis (ie every possible trade), another is based on my alternate hypothesis, ie the one calculated from the distribution of the trades my function made, and the third possible Z is taken from a normal distribution.

A different way of asking my question could be: where do I get the Z in the formula above? Is that a confidence interval calculated from the sample distribution, the null distribution, or just a normal distribution?

Yet another way of asking my question my might be: What if my null distribution is not a normal distribution? How do I calculate required sample size for Z confidence within M margin, given sigma standard deviation?
0
#4
I'm kind of muddling through with all this and I feel like I'm confusingly misusing a lot of terminology, so let me break the problem right back down to it's root:

I have a function which chooses when to make a trade. I've set it loose on a bunch of market historical data, and it's generated n trades. I have the profit/loss figures for each trade, and therefore a mean profit/loss across all of my function's trades. How can I calculate how many trades I would need to make before I can be Z confident that my function is showing a win/loss rate within M of it's "true" win rate? That's to say the win rate it would exhibit on an infinite timescale. How does normal distribution play in to this?
0
4 years ago
#5
(Original post by NingNong247)
My understanding of the concept of the null hypothesis is that it describes "Nothing is happening, your function is just random", and therefore my null hypothesis would be that I break even over a large enough sample (ignoring, for simplicity's sake, that actual stock exchanges charge a small fee per trade, so the real null hypothesis would be that I lose one trading fee per trade).
No, not quite. You can make a null hypothesis pretty much whatever you want that makes sense. The basic point is that the null hypothesis is the one that you're taking as a sort of baseline position - and you are collecting evidence in order to show that the null hypothesis is untenable or not. Briefly, you choose a "test statistic" (such as mean return) around which you will base your probability model; you set up a probability model that assumes that the null hypothesis is true - this will give you the "null distribution" for the test statistic; you collect data and calculate your test statistic on that data. If the value of the test statistic is "too far" out in the tails of the null distribution, you reject the null hypothesis.

Isn't the idea that I make a small positive per trade return my "alternative hypothesis", ie, the one I hope to confirm by disproving my null hypothesis?
You could do it that way - then you would be collecting evidence to show that the hypothesis of zero/small gain is untenable. Equally you could have a null hypothesis that posits a particular value of the return (other than zero) or a null hypothesis that the return is greater than or equal to some amount...

I think my confusion comes from the fact that I have three possible Z's to put in to my formula. One Z is based on my null hypothesis (ie every possible trade), another is based on my alternate hypothesis, ie the one calculated from the distribution of the trades my function made, and the third possible Z is taken from a normal distribution.
OK, so what you're talking about is not different hypotheses, but different probability models for your null distribution. In one case you're taking all possible trades; in the next, only those trades actually made by the function; and finally you're using a normal approximation to either of the first two. It sounds to me that you want to take the distribution of returns on the trades that the function actually made (as the decision as to whether to trade or not is key to the set-up here) or the normal approximation to it.

I have a function which chooses when to make a trade. I've set it loose on a bunch of market historical data, and it's generated n trades. I have the profit/loss figures for each trade, and therefore a mean profit/loss across all of my function's trades. How can I calculate how many trades I would need to make before I can be Z confident that my function is showing a win/loss rate within M of it's "true" win rate? That's to say the win rate it would exhibit on an infinite timescale. How does normal distribution play in to this?
OK, this is nice and clear now. From the historical data you get the distribution of the individual profits/losses from each trade. There's no particular reason to suspect that this will be even approximately normally distributed. If you work with this distribution, I would suggest trying a variety of transformations until you find one that makes the distribution approximately normal. You can then use this transformed distribution to do your sample size calculations.

Another approach would be to not take the individual profits/losses on each trade, but to bundle them up together into, for example, a days worth of profit.loss. If you've got enough trades in a day (say 20 or greater) then you can apply the central limit theorem, just using the mean and standard deviation of these bundled trade returns as a normal approximation in the sample size formulae.
0
X

new posts Back
to top
Latest
My Feed

### Oops, nobody has postedin the last few hours.

Why not re-start the conversation?

see more

### See more of what you like onThe Student Room

You can personalise what you see on TSR. Tell us a little about yourself to get started.

### Poll

Join the discussion

#### Year 12s - where are you at with making decisions about university?

I’ve chosen my course and my university (19)
32.2%
I’ve chosen my course and shortlisted some universities (22)
37.29%
I’ve chosen my course, but not any universities (2)
3.39%
I’ve chosen my university, but not my course (3)
5.08%
I’ve shortlisted some universities, but not my course (4)
6.78%
I’m starting to consider my university options (7)
11.86%
I haven’t started thinking about university yet (1)
1.69%
I’m not planning on going to university (1)
1.69%