Hey there! Sign in to join this conversationNew here? Join for free
x Turn on thread page Beta
    • Thread Starter
    Offline

    15
    ReputationRep:
    i have a project to do, wondering how people would approach it/whether there's some magic mathematical technique i don't know about.

    I have a load of data. several thousand pieces. let's say for simplicity that each piece of data has three numerical properties describing it: A, B, C.

    i want to find a set of a specific size N (of reasonably size) such that the sums or averages of A, B, C are in some way 'close' to some values I input, and also I get 'close' to a specified standard deviation for the set of one of the values, say A.

    any ideas?






    I had two, not sure how well they will work...

    i) build the set up from a small N, at each step adding an extra piece of data to the set that doesn't pull us away from our target values.

    ii) start with a random set of size N, switch elements in and out. if they result in an improvement, keep the switch. if not, move on.

    neither is very clever, and both have obvious flaws..
    Online

    17
    ReputationRep:
    If you know nothing about the data, simulated annealing is probably one of the better ways of getting an approximate solution. (It's basically a slightly more advanced version of (ii)).

    If the data is reasonably uniformly spread, and N is 'small' relative to the size of the data, then you might be able to 'cheat'. For example, choose A1, A2 so that if you have N/2 items equalling A1 and N/2 items equalling A2, you get the desired standard deviation. Then pick N/2 items close to {A1, B, C} and N/2 close to {A2, B, C}.
    Online

    17
    ReputationRep:
    OK, did a little experiment.

    5000 random vectors (a, b, c) with a, b, c all uniformly distributed on [0,1], looking for a subset of size 50 satisfying \bar{a} = 0.5, \bar{b} = 0.6, \bar{c} = 0.7 and s.d. of a = 0.2.

    Using simulated annealing with 10000000 iterations, I can find a subset which meets those constraints to around 1 part in 10000 in all 4 conditions. It takes under a second.
    • Thread Starter
    Offline

    15
    ReputationRep:
    (Original post by DFranklin)
    OK, did a little experiment.

    5000 random vectors (a, b, c) with a, b, c all uniformly distributed on [0,1], looking for a subset of size 50 satisfying \bar{a} = 0.5, \bar{b} = 0.6, \bar{c} = 0.7 and s.d. of a = 0.2.

    Using simulated annealing with 10000000 iterations, I can find a subset which meets those constraints to around 1 part in 10000 in all 4 conditions. It takes under a second.
    under a second!???? given that my code I've written so far, which just 'cleans' the data (there are only a few thousand pieces out of tens of thousands of data points that i'm interested in) takes a good 15 minutes (the data is in excel, and is 200 cells deep...), i'm expecting my program to take hours. hmm. also because i am a very lazy programmer.

    can i ask, using the terms that are in the wikipedia article, what you used as your measure of the 'energy'?

    also, did you use the 'temperature' funtion given on wikipedia?
    Online

    17
    ReputationRep:
    (Original post by Chewwy)
    under a second!???? given that my code I've written so far, which just 'cleans' the data (there are only a few thousand pieces out of tens of thousands of data points that i'm interested in) takes a good 15 minutes (the data is in excel, and is 200 cells deep...), i'm expecting my program to take hours. hmm. also because i am a very lazy programmer.
    Yeah well, I think we can say that Excel is not going to be competitive with C++ for this task.

    can i ask, using the terms that are in the wikipedia article, what you used as your measure of the 'energy'?

    also, did you use the 'temperature' funtion given on wikipedia?
    What I did was 'dirtier' than that (though as far as I know, it works out roughly the same in the end).

    My "scoring" function is just the sum of the errors for \bar{a}, \bar{b}, \bar{c} and the standard deviation.

    I accept a change if "new_score < old_score + epsilon", where epsilon is 10 * exp(-Iter * .000002), where Iter is the number of iterations (trials) so far.

    Which is somewhat "trial and error", but to be honest the exact numbers probably don't matter too much - the process seems to converge quite nicely.
    • Thread Starter
    Offline

    15
    ReputationRep:
    (Original post by DFranklin)
    Yeah well, I think we can say that Excel is not going to be competitive with C++ for this task.

    What I did was 'dirtier' than that (though as far as I know, it works out roughly the same in the end).

    My "scoring" function is just the sum of the errors for \bar{a}, \bar{b}, \bar{c} and the standard deviation.

    I accept a change if "new_score < old_score + epsilon", where epsilon is 10 * exp(-Iter * .000002), where Iter is the number of iterations (trials) so far.

    Which is somewhat "trial and error", but to be honest the exact numbers probably don't matter too much - the process seems to converge quite nicely.
    cool. so, to further prod you on the intricacies of your code... what was your transition step thing? i.e. how did it decide which element out of the 50 to throw out each time, and which to switch it with? I was thinking about, say, switching the Mn (modM)th element on the nth iteration, where M is coprime to N, something like that. But my M is variable, so that means i need to write some 'find a coprime' code. oh, woe is me.

    edit: maybe you just used M=1? oh yeah. for complicated reasons that could cost my employers billions of dollars, i can't do that.
    Online

    17
    ReputationRep:
    I just pick a random element out of the 50, and replace it with a random element of the 5000.
    (I also keep track of which elements I've already got, so as to make sure I don't end up with two elements the same).
    Offline

    11
    ReputationRep:
    http://www.amazon.co.uk/Programming-.../dp/0596529325
    This book has a few very nice chapters on various optimisation techniques (simulated annealing, genetic algorithms etc), and uses python, which is an awesome language.
 
 
 
Reply
Submit reply
Turn on thread page Beta
Updated: July 15, 2009
Poll
Do you agree with the proposed ban on plastic straws and cotton buds?
Useful resources

Make your revision easier

Maths

Maths Forum posting guidelines

Not sure where to post? Read the updated guidelines here

Equations

How to use LaTex

Writing equations the easy way

Student revising

Study habits of A* students

Top tips from students who have already aced their exams

Study Planner

Create your own Study Planner

Never miss a deadline again

Polling station sign

Thinking about a maths degree?

Chat with other maths applicants

Can you help? Study help unanswered threads

Groups associated with this forum:

View associated groups

The Student Room, Get Revising and Marked by Teachers are trading names of The Student Room Group Ltd.

Register Number: 04666380 (England and Wales), VAT No. 806 8067 22 Registered Office: International House, Queens Road, Brighton, BN1 3XE

Write a reply...
Reply
Hide
Reputation gems: You get these gems as you gain rep from other members for making good contributions and giving helpful advice.