Monday, 26 June 2017

Crawling/Scrapping a website

Let's discuss about Crawling:
Here, we’re going to discuss all the steps to do a web crawling using any language or technology. Crawler/Scrapper/Spider/Bot/ multiple synonyms for same stuff which is basically meant to copy content from any site.
Q.1 Do you think crawling is legal?
Yes, it’s legal until unless you’re not copying data without a website admin’s consent and their permission. (Seek with your local judicial terms & rules before proceeding)
Q.2 I and my company don’t belong to software field, how it can be helpful for my business then?
It can help you out in creating a comparison site, where yours as well as your similar product can be compared easily.


1. Online presence can be tracked- That’s also an important aspect of web scraping where business profiles and reviews on the websites can be scrapped. This can be used to see the performance of the product, the user behavior and reaction.
2. Custom Analysis and curation- This one is basically for the new websites/ channels wherein the scrapped data can be helpful for the channels in knowing the viewer behavior.
3. Online Reputation - In this world of digitalization companies are bullish about the spent on the online reputation management. Thus the web scrapping is essential here as well.
4. Detect fraudulent reviews - It has become a common practice for people to read online opinions and reviews for different purposes. Thus it’s important to figure out the Opinion Spamming: It refers to "illegal" activities example writing fake reviews on the portals. It is also called shilling, which tries to mislead readers. Thus the web scrapping can be helpful crawling the reviews and detecting which one to block, to be verified, or streamline the experience.
5. To provide better targeted ads to your customers- The scrapping not only gives you numbers but also the sentiments and behavioral analytic thus you know the audience types and the choice of ads they would want to see.
6. Business specific scrapping – Taking doctors for example: you can scrape health physicians or doctors from their clinic websites to provide a catalog of available doctors as per specialization and region or any other specification.
7. To gather public opinion- Monitor specific company pages from social networks to gather updates for what people are saying about certain companies and their products. Data collection is always useful for the product’s growth.
8. Search engine results for SEO tracking- By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting. Thus you get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site.
9. Price competitiveness- It tracks the stock availability and prices of products in one of the most frequent ways and sends notifications whenever there is a change in competitors' prices or in the market. In ecommerce, Retailers or marketplaces use web scraping not only to monitor their competitor prices but also to improve their product attributes. To stay on top of their direct competitors, nowadays e-commerce sites have started closely monitoring their counterparts
10. Scrape leads- This is another important use for the sales driven organization wherein lead generation is done. Sales teams are always hungry for data and with the help of the web scrapping technique you can scrap leads from directories such as Yelp, Sulekha, Just Dial, Yellow Pages etc. and then contact them to make a sales introduction.
11. For events organization – You can scrape events from thousands of event websites in the US to create an application that consolidates all of the events together.
12. Job scraping sites: Job sites are also using scrapping to list all the data in one place. They scrape different company websites or jobs sites to create a central job board website and have a list of companies that are currently hiring to contact.

For more details, please visit my article. This link has all type of code set as well.
In other thread you'll find all types of Crawler implementation using different major technology. Let’s check out an example of using proxy in crawler to escape from anti robot algorithm and cross browser data as well:

use WWW::Mechanize;
use Try::Tiny;
my $source_file=shift; open (INPUT_FILE, "<$source_file") || die "Can't open $source_file: $!\n";
my @sources = ; my $crawler = WWW::Mechanize->new(); 
foreach (@sources) { 
    try { $crawler->get($_);
        # hunt for IP:PORT combination 
        my @ips= $crawler->text() =~ /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/g; 
        foreach (@ips){ 
            print "$_\n"; 
        } 
    } catch { warn "[!] Error, who cares\n";}}

Crawl using PERL

Perl is one of the best language for this purpose as it has very strong regex as compared to other language and it has multiple library used for crawling which made it unique from other language for this purpose, check out a simple example:

 use strict;
 use warnings;
 use LWP::UserAgent ();

 my $ua = LWP::UserAgent->new;

 my $response = $ua->get(URL);

 if ($response->is_success) {
        print $response->decoded_content;
 }
 else {
     die $response->status_line;
 } 
For all such work there are lots of packages like:1. WWW::Mechanize : This is the best module for this kind of task. You can use it for almost all kind of websites except extensive dynamic websites where pages are created with request at run time, while if you can find the input parameters then you can use this module and believe me I've used WWW::Mecahnize for all my tasks till now.


2. LWP: This is the parent classs of WWW::Mechanize and only difference is that here you have to create your own modules for some of the operations whicha are already created in WWW::Mechanize, but since this is the parent so you can use it in any way.


3. WWW::Mechanize::Firefox: Best for dynamic websites means where javascript is widely used to generate pages, where automation is very complex with simple WWW::Mechanize.


4. WWW::Scripter: another WWW::Mechanize-workalike with Javascript support.


Please go through cpan and you might found more URLS for your task, but if you'll ask anyone then these four are the best packages referred for this kind of work. I hope this one will help you or let me know if you've any more queries.

1. Static Content Crawling:

This code crawl through a site and get through all links, get their data and write to a database. Once, we’ve all data in database then we just have to display as per our requirement

use WWW::Mechanize;

use DBI;

my $mech = WWW::Mechanize->new();

my $response = $mech->get(URL);

 if ($response->is_success) {
        print $mech->content;
 }
 else {
     die $response->status_line;
 } 
2. Dynamic Content Crawling:

This example typically go through a dynamic website using WWW::Mechanize::Firefox and then fetches all data in a sorted manner.

use WWW::mechanize::Firefox;

use Data::Dumper;

$mech= WWW::Mechanize::Firefox->new();
$mech->get(URL);

%arr_ref = (AL => [1795, 1276, 795, 1719, 1363, 1145, 961, 17, 18, 1995, 977, 1910, 1691, 21, 1660, 1768], 

AK => [1145, 961, 1995, 977, 1781, 1704], 

AZ => [1873, 872, 1145, 690, 1162, 961, 918, 528, 811, 704, 529, 1983, 931, 40, 1995, 977, 597, 1157, 530, 598, 886, 782, 42, 691, 1945]);

foreach my $key (sort keys %arr_ref) {

   print "$key :: @{$arr_ref{$key}} \n";

   $mech->field( stateUSAId => $key );

   foreach (@{$arr_ref{$key}}) {

         $mech->field(institutionUSAId=>$_);
   }
}

=foreach (@array) {

$mech->field( stateUSAId => $_ );

$mech->field(institutionUSAId);

sleep(2);

}

Crawl Using Python

lxml is a pretty extensive library written for parsing XML and HTML documents very quickly, even handling messed up tags in the process. We will also be using the Requests module instead of the already built-in urllib2 module due to improvements in speed and readability. You can easily install both using pip install lxml and pip install requests.

from lxml.html import parse

from urllib2 import urlopen

from pandas.io.parsers import TextParser

def _unpack (row,kind='td'):

   elts = row.findall('.//%s' % kind)

   return [val.text for val in elts]

def parse_options_data (table):

   rows = table.findall('.//tr')

   header = _unpack(rows[0],kind='th')

   data = [_unpack(r) for r in rows[1:]]

   return TextParser(data,names=header).get_chunk()

if __name__ == '__main__':   

   #parsed = parse('http://finance.yahoo.com/q/op?s=AAPL+Options')

   #parsed = parse('http://www-rohan.sdsu.edu/~gawron')

   #parsed = parse('http://www.lajollasurf.org/cgi-bin/plottide.pl')

   url = 'http://www.ezfshn.com/tides/usa/california/san%20diego'

   parsed = parse(url)

   #id="ctl00_ctl00_Content_MCC_RadDatePicker_calendar_Top"

   doc = parsed.getroot()

   links = doc.findall('.//a')

   links_sub_list = links[15:20]   lnk = links_sub_list[0]

   sample_url = lnk.get('href')

   sample_display_text = lnk.text_content()

   tables = doc.findall('.//table')

   ## Look at tables, find a table of interest

   #puts = tables[9]

   ## Ditto

   #calls = tables[13]

   dt = tables[0]

   rows = dt.findall('.//tr')

   headers = _unpack(rows[0],kind='th')

   row_vals = _unpack(rows[1],kind='td')

   #call_data = parse_options_data(calls)

   tide_data = parse_options_data(dt)

   print tide_data[:10]
Go through following URLs to learn in detail about scrappign techniques in Python:

Crawl Using Java

For Crawling in Java, you'll require Jsoup lirary. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
A simple example of Jsoup:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Add it to your project from pom:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

public getmoreURL {
 Document doc = null;
 String currentURL;
 for (HashMap tmpMap: urlInfos) {
  currentURL = tmpMap.get("url");
  doc = Jsoup.connect(currentURL).get();
  Elements articleLead = doc.select("article.lead-story");
  if (currentURL.contains("xyz")) {
   for (Element elem: articleLead) {
    int size = elem.select("a").size();
    Element anchor = elem.select("a").get(1);
    String title = anchor.text();
    String URL = anchor.attr("href");
    if (title.length() == 0) title = elem.select("a").get(2).text();
    if (!dbUtils.find("url", URL)) urlQueue.add(URL);
    if (!alreadyInList(URL)) {
     HashMap info = new HashMap();
     info.put("url", URL);
     urlInfos.add(info);
     retrieveRelatedStory(URL);
    } else System.out.println("Title - " + title + " already in the list.");
   }
   Elements articleElement = doc.select("article.teaser");
   for (Element elem: articleElement) {
    int size = elem.select("a").size();
    Element anchor = elem.select("a").get(1);
    String title = anchor.text();
    String URL = anchor.attr("href");
    if (title.length() == 0) title = elem.select("a").get(2).text();
    if (!dbUtils.find("url", URL)) {
     urlQueue.add(URL);
     crawURLsFromHomePage(title, URL);
     retrieveRelatedStory(URL);
    } else System.out.println("Title - " + title + " already in the list.");
   }
  } else {
   if (!alreadyInList(currentURL)) {
    HashMap tempRelated = new HashMap();
    tempRelated.put("url", currentURL);
    urlInfos.add(tempRelated);
    retrieveRelatedStory(currentURL);
   }
   retrieveRelatedStory(currentURL);
  }
 }
}
Refer following URL :
https://ksah.in/introduction-to-web-scraping-with-java/
http://jaunt-api.com/