Best language for Web Scraping?

From C++ to PHP, debugging to webhosting; help and discussion about writing your latest program to running your website. NOT for help when your PC won't work.

Announcements Posted on
Please change your TSR password 23-05-2013
Sign in to Reply
  1. Yung Mon£y's Avatar
    • Banned
    • Posts: 565
    Best language for Web Scraping?
    I'm looking to set up a webscraper that can handle cookies as well as redirects. There are plenty of languages which can do this in one way or another but I'm wondering if anyone has a particular favourite?
    The idea is to scrape some API data and then via php put it into a sql database which can later be queried. Anyone have any tips or tricks? :cool:
  2. JGR's Avatar
    • Exalted and Worshipped Member
    • Posts: 1,239
    Re: Best language for Web Scraping?
    Well, you've pretty said that you're going to use PHP in the question.

    I would be inclined to use that, or depending on what sort of processing you intend to do, possibly Perl.
  3. j.smith1981's Avatar
    • Full Member
    • Location: UK
    • Posts: 101
    Re: Best language for Web Scraping?
    Sorry don't mean to bump a new thread but have you considered Python? The only reason why I say Python is firstly it's allot more straight forward than Perl and PHP in many respects but also can touch on some of the aspects of what C can do in the sense of memory clearing and things like that.

    I'd look into it I mean Google use it for their web crawler, what you will be using it for technically their web crawler is a web scraper
  4. Chrosson's Avatar
    • PS Helper
    • Vengeful, Imperial Overlord of The Student Room
    • Posts: 4,211
    Re: Best language for Web Scraping?
    (Original post by j.smith1981)
    Sorry don't mean to bump a new thread but have you considered Python? The only reason why I say Python is firstly it's allot more straight forward than Perl and PHP in many respects but also can touch on some of the aspects of what C can do in the sense of memory clearing and things like that.

    I'd look into it I mean Google use it for their web crawler, what you will be using it for technically their web crawler is a web scraper
    Memory clearing?
  5. roblee's Avatar
    • Exalted and Worshipped Member
    • Posts: 1,062
    Re: Best language for Web Scraping?
    (Original post by Chrosson)
    Memory clearing?
    exit(0); // frees up all memory in use by program

    I'm going to second perl for web-scraping, it was developed mainly for text-processing rather than html-processing but CPAN has a bunch of useful stuff to make this easier.
    Last edited by roblee; 22-06-2012 at 06:54.
  6. j.smith1981's Avatar
    • Full Member
    • Location: UK
    • Posts: 101
    Re: Best language for Web Scraping?
    (Original post by Chrosson)
    Memory clearing?
    Yes you can access memory used by other programs too just like you can in C and of course Assembly, , Python is a very powerful language why it's in use by ILM, Google and a few others of the bigger companies, Autodesk use Python for Motion Builder (precisely their SDK) their software for allowing better animation without motion capture, I think they included an SDK for Maya too but I can't quite remember.

    I personally like Perl but really Python's got a great ability to handle strings though I suppose Perl's got the aim of being a reporting tool why Larry Wall developed Perl in the first place.

    I will rephrase clearing memory that sounds way too brief, going into your system memory and clearing things out, is what Python has the capability of doing like in C and of course Assembly (though never actually bothered trying in Python though it is doable). exit(0) is just a way of exiting an application more of a script though than an application not really a definition of memory management because you're leaving the memory management up to the Python interpreter like in C when you exit an application.
    Last edited by j.smith1981; 22-06-2012 at 10:24.
  7. JGR's Avatar
    • Exalted and Worshipped Member
    • Posts: 1,239
    Re: Best language for Web Scraping?
    Freeing memory by terminating the process is not remarkable and any language can do that.
    For a web-server application typically you would want to avoid doing that unnecessarily to avoid the overhead associated with process forks/exec and deletion.

    Python is a GC language (like just about every other language of its type), and so typically requires a different strategy to languages like C/C++ and assembly, where you need to be on the ball about memory.

    The main thing that Python has on Perl is a more verbose/strict (and hence easier to read) syntax.
    Just about everything else you've said is largely common to all interpreted/JIT languages of that ilk.

    I find your comment about Perl and string handling bizarre, since string handling is Perl's forté and raison d'être.

    (Caveat: I've developed in Perl, PHP, C/C++, x86 assembly and various others, but not Python)
    Last edited by JGR; 22-06-2012 at 12:35.
  8. Chrosson's Avatar
    • PS Helper
    • Vengeful, Imperial Overlord of The Student Room
    • Posts: 4,211
    Re: Best language for Web Scraping?
    (Original post by j.smith1981)
    Yes you can access memory used by other programs too just like you can in C and of course Assembly
    I'm having trouble getting past this statement. Python is a garbage collected language - low level memory access is (afaik) beyond the native capabilities of python. Could you point me somewhere I can learn more?
  9. studiousgeek's Avatar
    • New Member
    • Posts: 19
    Re: Best language for Web Scraping?
    (Original post by Yung Mon£y)
    I'm looking to set up a webscraper that can handle cookies as well as redirects. There are plenty of languages which can do this in one way or another but I'm wondering if anyone has a particular favourite?
    The idea is to scrape some API data and then via php put it into a sql database which can later be queried. Anyone have any tips or tricks? :cool:
    I use Python (because it's awesome) in conjunction with two third-party modules.

    Mechanize, which implements a programatic web browser that supports cookies, forms and referrer.

    BeautifulSoup, which is a programmer friendly HTML parser.

    Using these tools I find I can get a web scraping application up and running in a few minutes.
  10. mfaxford's Avatar
    • Overlord in Training
    • Location: Southampton
    • Posts: 2,121
    Re: Best language for Web Scraping?
    (Original post by JoeDevise)
    I think the best language for Web Scraping is Javascript .
    This thread was over three months old, it seems a bit pointless answering it now. The same seems to be true of most places you've posted.
  11. davros's Avatar
    • Overlord in Training
    • Location: Skaro
    Re: Best language for Web Scraping?
    (Original post by mfaxford)
    This thread was over three months old, it seems a bit pointless answering it now. The same seems to be true of most places you've posted.
    I've just pointed out the same thing somewhere else - mega-confusing when you're trying to find a recent discussion to contribute to and somebody's just dragged up a 4 month old post for no reason
  12. ericlewis107's Avatar
    • New Member
    • Location: London
    • Posts: 11
    Re: Best language for Web Scraping?
    I think Java language because there are many libraries or tool for scrapping data from websites based on Java platform.
Sign in to Reply
Share this discussion:  
Useful resources
Article updates
Moderators

We have a brilliant team of more than 60 volunteers looking after discussions on The Student Room, helping to make it a fun, safe and useful place to hang out.

Reputation gems:
The Reputation gems seen here indicate how well reputed the user is, red gem indicate negative reputation and green indicates a good rep.
Post rating score:
These scores show if a post has been positively or negatively rated by our members.