Best language for Web Scraping?
From C++ to PHP, debugging to webhosting; help and discussion about writing your latest program to running your website. NOT for help when your PC won't work.
| Announcements | Posted on | |
|---|---|---|
| Please change your TSR password | 23-05-2013 | |
-
Best language for Web Scraping?
I'm looking to set up a webscraper that can handle cookies as well as redirects. There are plenty of languages which can do this in one way or another but I'm wondering if anyone has a particular favourite?
The idea is to scrape some API data and then via php put it into a sql database which can later be queried. Anyone have any tips or tricks?
-
Re: Best language for Web Scraping?
Sorry don't mean to bump a new thread but have you considered Python? The only reason why I say Python is firstly it's allot more straight forward than Perl and PHP in many respects but also can touch on some of the aspects of what C can do in the sense of memory clearing and things like that.
I'd look into it I mean Google use it for their web crawler, what you will be using it for technically their web crawler is a web scraper -
Re: Best language for Web Scraping?Memory clearing?(Original post by j.smith1981)
Sorry don't mean to bump a new thread but have you considered Python? The only reason why I say Python is firstly it's allot more straight forward than Perl and PHP in many respects but also can touch on some of the aspects of what C can do in the sense of memory clearing and things like that.
I'd look into it I mean Google use it for their web crawler, what you will be using it for technically their web crawler is a web scraper -
Re: Best language for Web Scraping?exit(0); // frees up all memory in use by program(Original post by Chrosson)
Memory clearing?
I'm going to second perl for web-scraping, it was developed mainly for text-processing rather than html-processing but CPAN has a bunch of useful stuff to make this easier.Last edited by roblee; 22-06-2012 at 06:54. -
Re: Best language for Web Scraping?Yes you can access memory used by other programs too just like you can in C and of course Assembly, , Python is a very powerful language why it's in use by ILM, Google and a few others of the bigger companies, Autodesk use Python for Motion Builder (precisely their SDK) their software for allowing better animation without motion capture, I think they included an SDK for Maya too but I can't quite remember.(Original post by Chrosson)
Memory clearing?
I personally like Perl but really Python's got a great ability to handle strings though I suppose Perl's got the aim of being a reporting tool why Larry Wall developed Perl in the first place.
I will rephrase clearing memory that sounds way too brief, going into your system memory and clearing things out, is what Python has the capability of doing like in C and of course Assembly (though never actually bothered trying in Python though it is doable). exit(0) is just a way of exiting an application more of a script though than an application not really a definition of memory management because you're leaving the memory management up to the Python interpreter like in C when you exit an application.Last edited by j.smith1981; 22-06-2012 at 10:24. -
Re: Best language for Web Scraping?
Freeing memory by terminating the process is not remarkable and any language can do that.
For a web-server application typically you would want to avoid doing that unnecessarily to avoid the overhead associated with process forks/exec and deletion.
Python is a GC language (like just about every other language of its type), and so typically requires a different strategy to languages like C/C++ and assembly, where you need to be on the ball about memory.
The main thing that Python has on Perl is a more verbose/strict (and hence easier to read) syntax.
Just about everything else you've said is largely common to all interpreted/JIT languages of that ilk.
I find your comment about Perl and string handling bizarre, since string handling is Perl's forté and raison d'être.
(Caveat: I've developed in Perl, PHP, C/C++, x86 assembly and various others, but not Python)Last edited by JGR; 22-06-2012 at 12:35. -
Re: Best language for Web Scraping?I'm having trouble getting past this statement. Python is a garbage collected language - low level memory access is (afaik) beyond the native capabilities of python. Could you point me somewhere I can learn more?(Original post by j.smith1981)
Yes you can access memory used by other programs too just like you can in C and of course Assembly -
Re: Best language for Web Scraping?I use Python (because it's awesome) in conjunction with two third-party modules.(Original post by Yung Mon£y)
I'm looking to set up a webscraper that can handle cookies as well as redirects. There are plenty of languages which can do this in one way or another but I'm wondering if anyone has a particular favourite?
The idea is to scrape some API data and then via php put it into a sql database which can later be queried. Anyone have any tips or tricks?
Mechanize, which implements a programatic web browser that supports cookies, forms and referrer.
BeautifulSoup, which is a programmer friendly HTML parser.
Using these tools I find I can get a web scraping application up and running in a few minutes. -
Re: Best language for Web Scraping?I've just pointed out the same thing somewhere else - mega-confusing when you're trying to find a recent discussion to contribute to and somebody's just dragged up a 4 month old post for no reason(Original post by mfaxford)
This thread was over three months old, it seems a bit pointless answering it now. The same seems to be true of most places you've posted.