Dissertation - data sampling Watch

Denzel89
Badges: 12
Rep:
?
#1
Report Thread starter 2 weeks ago
#1
Struggling with this, we have been taught how to sample data but it just doesn't work with my data set.

I have procured the data from a FOI request which basically lists all the required info. Problem is its all disjointed, some are false inputs and don't check out when cross referenced again other databases.

The information pertains to delisting applications of historic buildings, the FOI could only supply basic details of the delisting application, hence having to check the data against another database to find out further details.

Choosing which off the list to put through the database is difficult. Basically i have settled on every tenth entry, if it lands on a dud i go to the next entry.

Sounds pretty rudimentary but i cant think of any other way. Cant copy over (from word) to excel as the format they have used is a right mess. I did start by attempting to do every entry for a complete dataset but i managed 5 pages in 8 hours out of 350 pages.

Any ideas?
0
reply

Mr Wednesday
Badges: 18
Rep:
?
#2
Report 2 weeks ago
#2
(Original post by Denzel89)
I have procured the data from a FOI request which basically lists all the required info. Problem is its all disjointed, some are false inputs and don't check out when cross referenced again other databases.

Cant copy over (from word) to excel as the format they have used is a right mess. I did start by attempting to do every entry for a complete dataset but i managed 5 pages in 8 hours out of 350 pages.
Best to show a bit of example data and highlight what is "useful" and what is "junk" if you want some technical advice. It's very likely that a bit of simple search and replace in the origional word doc can cut out a lot of junk, leaving you something more compact, but you then need to come up with an algorithmic approach to pulling out the good stuff.

How long have you got BTW ? Sometimes its actually quicker to bite the bullet and crunch through messy data by hand as writing code to do this and then proving it works perfectly can actually take longer.
0
reply
Denzel89
Badges: 12
Rep:
?
#3
Report Thread starter 2 weeks ago
#3
(Original post by Mr Wednesday)
Best to show a bit of example data and highlight what is "useful" and what is "junk" if you want some technical advice. It's very likely that a bit of simple search and replace in the origional word doc can cut out a lot of junk, leaving you something more compact, but you then need to come up with an algorithmic approach to pulling out the good stuff.

How long have you got BTW ? Sometimes its actually quicker to bite the bullet and crunch through messy data by hand as writing code to do this and then proving it works perfectly can actually take longer.
Thanks for your reply.

So this is what i have come up with.
Copied the data from word to excel which filled out 8000 rows.
Un-merged the cells and deleted all superfluous/empty rows.
Then filtered and deleted rows with dud entries.
Filtered and deleted rows with unrelated items.
Got it down to 1500 entries.
New column and assigned each row a random number, then sorted them in to ascending order
First 100 is my sample. (supervisor said 100 was fine)

Sound ok? Basically trying to get an idea of the rate of delisting buildings relative to property type.

Now ive got to put 100 addresses through the heritage database one at a time lol
Last edited by Denzel89; 2 weeks ago
0
reply
Mr Wednesday
Badges: 18
Rep:
?
#4
Report 2 weeks ago
#4
(Original post by Denzel89)
Thanks for your reply.

So this is what i have come up with.
Copied the data from word to excel which filled out 8000 rows.
Un-merged the cells and deleted all superfluous/empty rows.
Then filtered and deleted rows with dud entries.
Filtered and deleted rows with unrelated items.
Got it down to 1500 entries.
New column and assigned each row a random number, then sorted them in to ascending order
First 100 is my sample. (supervisor said 100 was fine)

Sound ok? Basically trying to get an idea of the rate of delisting buildings relative to property type.

Now ive got to put 100 addresses through the heritage database one at a time lol
You don't show the "before" and "after" data so its hard to advise in detail, but it does sound like you have come up with something your supervisor thinks works so good progress.

Sounds like its time to do the data grind by hand and do the 1st 100 addresses - painful, but often the quickest way out of a messy problem like this. I would be a bit careful selecting what you work with here, you might want to do the 1st 20, then take another 20 from somewhere else in the database (not in the 1st 100) and do a quick comparison. Are they telling you broadly the same thing ? If so, push on with using the 1st 100, if not, you need to worry about how something might be changing over time / space to give you that feature in the data.
0
reply
Denzel89
Badges: 12
Rep:
?
#5
Report Thread starter 1 week ago
#5
Thought id update this as you so kindly took the time to respond.

I started processing the entries and it really wasn't that difficult. Got it down to 650 odd remaining entries after filtering out what turned out to be dud entries. Ive done 150 in a sitting today.

I made the mistake of not reading the email properly advising of what the document contained, made the filter much easier.

Regards, Daniel.
0
reply
Mr Wednesday
Badges: 18
Rep:
?
#6
Report 6 days ago
#6
(Original post by Denzel89)
Thought id update this as you so kindly took the time to respond.

I started processing the entries and it really wasn't that difficult. Got it down to 650 odd remaining entries after filtering out what turned out to be dud entries. Ive done 150 in a sitting today.

I made the mistake of not reading the email properly advising of what the document contained, made the filter much easier.
Glad to hear its going well, sometimes that brute force approach "Ive done 150 in a sitting today" gets the job done faster than trying to write elegant filtering code . There is a reason we don't all get hit by cars or eaten by tigers on the way to work, the human brain is very very good at pattern recognition looking at a big pile of messy, noisy data.
0
reply
X

Quick Reply

Attached files
Write a reply...
Reply
new posts
Back
to top
Latest
My Feed

See more of what you like on
The Student Room

You can personalise what you see on TSR. Tell us a little about yourself to get started.

Personalise

University open days

  • Bournemouth University
    Undergraduate Open Day Undergraduate
    Wed, 19 Feb '20
  • Buckinghamshire New University
    Postgraduate and professional courses Postgraduate
    Wed, 19 Feb '20
  • University of Warwick
    Warwick Business School Postgraduate
    Thu, 20 Feb '20

Have you ever accessed mental health support services at University?

Yes (48)
24.12%
No, but I have thought about it (56)
28.14%
No (95)
47.74%

Watched Threads

View All
Latest
My Feed

See more of what you like on
The Student Room

You can personalise what you see on TSR. Tell us a little about yourself to get started.

Personalise