Rules for Scrapping a website: Crawl using PERL

Perl is one of the best language for this purpose as it has very strong regex as compared to other language and it has multiple library used for crawling which made it unique from other language for this purpose, check out a simple example:


 use strict;
 use warnings;
 use LWP::UserAgent ();

 my $ua = LWP::UserAgent->new;

 my $response = $ua->get(URL);

 if ($response->is_success) {
        print $response->decoded_content;
 }
 else {
     die $response->status_line;
 }

For all such work there are lots of packages like:1. WWW::Mechanize : This is the best module for this kind of task. You can use it for almost all kind of websites except extensive dynamic websites where pages are created with request at run time, while if you can find the input parameters then you can use this module and believe me I've used WWW::Mecahnize for all my tasks till now.

2. LWP: This is the parent classs of WWW::Mechanize and only difference is that here you have to create your own modules for some of the operations whicha are already created in WWW::Mechanize, but since this is the parent so you can use it in any way.

3. WWW::Mechanize::Firefox: Best for dynamic websites means where javascript is widely used to generate pages, where automation is very complex with simple WWW::Mechanize.

4. WWW::Scripter: another WWW::Mechanize-workalike with Javascript support.

Please go through cpan and you might found more URLS for your task, but if you'll ask anyone then these four are the best packages referred for this kind of work. I hope this one will help you or let me know if you've any more queries.

1. Static Content Crawling:

This code crawl through a site and get through all links, get their data and write to a database. Once, we’ve all data in database then we just have to display as per our requirement


use WWW::Mechanize;

use DBI;

my $mech = WWW::Mechanize->new();

my $response = $mech->get(URL);

 if ($response->is_success) {
        print $mech->content;
 }
 else {
     die $response->status_line;
 }

2. Dynamic Content Crawling:

This example typically go through a dynamic website using WWW::Mechanize::Firefox and then fetches all data in a sorted manner.


use WWW::mechanize::Firefox;

use Data::Dumper;

$mech= WWW::Mechanize::Firefox->new();
$mech->get(URL);

%arr_ref = (AL => [1795, 1276, 795, 1719, 1363, 1145, 961, 17, 18, 1995, 977, 1910, 1691, 21, 1660, 1768], 

AK => [1145, 961, 1995, 977, 1781, 1704], 

AZ => [1873, 872, 1145, 690, 1162, 961, 918, 528, 811, 704, 529, 1983, 931, 40, 1995, 977, 597, 1157, 530, 598, 886, 782, 42, 691, 1945]);

foreach my $key (sort keys %arr_ref) {

   print "$key :: @{$arr_ref{$key}} \n";

   $mech->field( stateUSAId => $key );

   foreach (@{$arr_ref{$key}}) {

         $mech->field(institutionUSAId=>$_);
   }
}

=foreach (@array) {

$mech->field( stateUSAId => $_ );

$mech->field(institutionUSAId);

sleep(2);

}

Rules for Scrapping a website

Monday, 26 June 2017

Crawl using PERL

No comments:

Post a Comment