Hey there! Sign in to join this conversationNew here? Join for free

Fastest way to mass-delete characters in a HTML file? Watch

    • Thread Starter
    Offline

    2
    ReputationRep:
    I want to write a program that loads x amount of HTML files from a website (TSR) and extracts the key information from them (number of views for a thread). The only way I can see this working is to loop through all characters of the file and pause at a unique identifier before the number of views, then get that information out. However it would greatly solve my problem if I could just delete all the information except the views from the file and then it'll be simple and fast. At the moment it will take forever because of how much kype there is in a HTML file and how much time it takes to loop through all those characters.

    Can anyone think of a better way of doing this? Eventually I want to be able to see the most viewed threads of all time on TSR.
    • Very Important Poster
    • Study Helper
    • Welcome Squad
    Offline

    20
    ReputationRep:
    (Original post by Mistletoe)
    I want to write a program that loads x amount of HTML files from a website (TSR) and extracts the key information from them (number of views for a thread). The only way I can see this working is to loop through all characters of the file and pause at a unique identifier before the number of views, then get that information out. However it would greatly solve my problem if I could just delete all the information except the views from the file and then it'll be simple and fast. At the moment it will take forever because of how much kype there is in a HTML file and how much time it takes to loop through all those characters.

    Can anyone think of a better way of doing this? Eventually I want to be able to see the most viewed threads of all time on TSR.
    Using a spider to scrape forums for the information (https://scrapy.org) - don't ask me for technical details about it though, because I'm not advanced enough to be able to haha! And also not sure whether it's actually against TSR's T&C to scrape!
    • TSR Group Staff
    Offline

    18
    ReputationRep:
    You could use regular expressions (regex) to extract the data from the HTML. Read this before you do so though, and be prepared for a lot of trouble trying to effectively parse a HTML page.

    However, the view count on a thread is estimated, it's not reflective of the actual data due to our caching systems, and certain actions that can muck up the numbers (e.g. thread merge/split). If you were to go off those (highly inaccurate) numbers, then this is the highest viewed thread of all time, but I highly doubt the view count on that thread is accurate. Based on the data available I would guess that this has the highest real view count on TSR.
    • Thread Starter
    Offline

    2
    ReputationRep:
    (Original post by Dez)
    You could use regular expressions (regex) to extract the data from the HTML. Read this before you do so though, and be prepared for a lot of trouble trying to effectively parse a HTML page.

    However, the view count on a thread is estimated, it's not reflective of the actual data due to our caching systems, and certain actions that can muck up the numbers (e.g. thread merge/split). If you were to go off those (highly inaccurate) numbers, then this is the highest viewed thread of all time, but I highly doubt the view count on that thread is accurate. Based on the data available I would guess that this has the highest real view count on TSR.
    Does it take into account views by non-registered users? I'm guessing when a TSR thread asks something useful that hasn't been answered on Google yet then that should generate lots of views. This thread: https://www.thestudentroom.co.uk/sho....php?t=1394604 for example has 36,000 views (though it is 7 years old), but it means it may have been useful to lots of members who accessed TSR via search engines rather than just registered users.
    Offline

    3
    ReputationRep:
    You should navigate the DOM rather than attempting to parse the HTML yourself.

    jQuery or BeautifulSoup should both do the job pretty well.
    Offline

    15
    ReputationRep:
    Deleting eveyrthing that isn't views is even more intensive than just reading to the views. If you read until a certain delimiter, read and store views and then move on you're only reading all the content before the views. If you try to delete everything that isn't related to views then you need to read and compare everything (both before and after the views) and then perform an opertation (delete). And of course you probably want to keep a little bit of other information around that makes the view count identifiable and linkable to the original thread. Then of course there's the issue of actually downloading every single thread in the first place.

    Can someone enlighten me to how and what format thread views are stored in?
    Offline

    11
    ReputationRep:
    If you're doing this with Python (which is usually the best option), just use BeautifulSoup's findAll function. It's very speedy.
    • TSR Group Staff
    Offline

    18
    ReputationRep:
    (Original post by Mistletoe)
    Does it take into account views by non-registered users? I'm guessing when a TSR thread asks something useful that hasn't been answered on Google yet then that should generate lots of views. This thread: https://www.thestudentroom.co.uk/sho....php?t=1394604 for example has 36,000 views (though it is 7 years old), but it means it may have been useful to lots of members who accessed TSR via search engines rather than just registered users.
    View count includes some guest user hits, but not all. We've got some more accurate data with Google Analytics, but that only goes back as far as 2007 so trying to determine the most visited thread is always going to be partly guesswork.
    • TSR Group Staff
    Offline

    18
    ReputationRep:
    The "Andy Gray SACKED!" thread has so many pageviews because it's used for some sort of load testing, I believe (which doesn't count as real pageviews in Google Analytics (GA))

    The highest viewed thread in the last 24 months is this one: https://www.thestudentroom.co.uk/sho....php?t=4120523 according to GA - I haven't got raw data back to 2007 though!. The thread view number on site is only around a third of the GA number, because as Dez said, as guests view cached versions so this doesn't add to the site view count each time.
 
 
 
Reply
Submit reply
TSR Support Team

We have a brilliant team of more than 60 Support Team members looking after discussions on The Student Room, helping to make it a fun, safe and useful place to hang out.

Updated: January 12, 2017
  • See more of what you like on The Student Room

    You can personalise what you see on TSR. Tell us a little about yourself to get started.

  • Poll
    Did TEF Bronze Award affect your UCAS choices?
    Useful resources
  • See more of what you like on The Student Room

    You can personalise what you see on TSR. Tell us a little about yourself to get started.

  • The Student Room, Get Revising and Marked by Teachers are trading names of The Student Room Group Ltd.

    Register Number: 04666380 (England and Wales), VAT No. 806 8067 22 Registered Office: International House, Queens Road, Brighton, BN1 3XE

    Quick reply
    Reputation gems: You get these gems as you gain rep from other members for making good contributions and giving helpful advice.