QScraper : hpricot interface to scrapi


QScraper is a wrapper over scrapi to provide Hpricot like interface.

Motivation: Hpricot interface is simple and easy to use while scrapi is more powerful because of bundle scraping and anonymous classes. I was using hpricot for quick testing and checking but scrapi for project implementation. To avoid working with two html scrapers, I wrote this wrapper over scrapi.

Bundle Scraping: It refers to extraction of multiple attributes of an element from a web page in a single parse. Most screen scraping tools extract only multiple elements but not multiple attributes of an element. Lets take an example of blog scraping, each blog post would be an element and I would like to extract multiple attributes of blog post like info about author, published on, title and content. Rather than making individual calls like doc.search(author_selector), doc.search(published_selector) etc., I would like to do doc.find(author_selector, date_selector, title ...).

Here comes QScraper :)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
require 'rubygems'
require 'scrapi'
require 'open-uri'

class QScraper
  attr_accessor :document
  attr_accessor :rule_list

  # source can be a node or file
  # Constructor for this class ..
  # Takes a source (file or html::node) as argument.
  # Example : qScraper = QScraper.new("index.html")
  def initialize source, local_file=true
    case source
      when URI, HTML::Node
        @document = source
      when String
        @document = (local_file && File.exists?(source)) ?  File.read(source) : source
      else
        raise ArgumentError, "Can only scrape URI, String or HTML::Node"
    end
  end

  # search string selector in given source and 
  # returns all the matching elements as HTML::Tag object
  # Selector is a css selector 
  # Example:
  # qScraper.search("div.highlight")
  def search selector
    output = (Scraper.define do
      array :elems
      process selector, :elems =>:element
      result :elems
    end).scrape(@document)
    output
  end

  # search the first node matching this selector
  def search_first selector
    output = (Scraper.define do
      process_first selector, :elem=>:element
      result :elem
    end).scrape(@document)
    output
  end

  # Adds a rule for parsing, you can add many rules and 
  # then parse them by making a call to parse_rules
  # Example:
  # q.add_rule({"br#lgpd"=>{:n1ame=>"@id"}})
  def add_rule selector, extractors
    @rule_list ||= {}
    @rule_list.merge!({selector=>extractors})
  end

  # Example:
  # q.add_rules({"table.searchResults2>tbody>tr>td:nth-of-type(3)>a"=>
  #                               {:href=>"@href",:name=>:text},
  #              "table.searchResults2>tr>td:nth-of-type(3)>a.cat"=>
  #                               {:reviewcount=>:text}})
  def add_rules selector_extractors_hash
    @rule_list ||= {}
    @rule_list.merge! (selector_extractors_hash)
  end

  # parses the given rule set .. specified by add_rule methods.
  def parse_rules parser=nil
    result_string = "result "
    @rule_list.values.each do |j| 
        j.keys.each {|i| 
           result_string << ":" << i.to_s << ", "
        }
    end
    result_string = result_string[0..-3]
    array_string = result_string.sub(/result/,"array")
    create_process_string = ""
    @rule_list.keys.each {|i|
      create_process_string << "process \"" << i << "\", " 
      create_process_string << @rule_list[i].inspect.gsub(/[{}]/,"") << "\n"
    }
    res = String.module_eval %{Scraper.define do
      #{array_string}
      #{create_process_string}
      #{result_string}
    end}
    parser ||= :tidy
    f = res.scrape(@document,:parser=>parser)
  end
end

# Small extension to scrapi's HTML::Tag class 
# for "inner_html" method.
#
module HTML
  class Tag < Node
    def inner_html 
      data = ""
      children.each {|e| data += e.to_s}
      data
    end
  end
end
Usage
1
2
3
4
5
6
7
8
9
10
11
12
  # Example page to parse : http://www.google.co.in/search?q=scrapi
  doc = QScraper.new(open("http://www.google.co.in/search?q=scrapi"){|f| f.read})

  # Extracting titles elements of search results 
  # returned objects are of HTML::Tag type
  result = doc.search("a.l")

  # now extracting title and link for each of the search results.
  doc.add_rules({"a.l"=>{:url=>"@href",:title=>:text}})
  result = doc.parse_rules
  #=> result.url returns an array of 10 url links
  #=> result.title returns the array of titles
Comparison of Hpricot and scrapi(+QScraper) interface
Hpricot Scrapi (with Qscraper)
doc=Hpricot(string/file/uri) doc=QScraper(string/file/uri)
doc.search("css selector") [or doc/"css selector"]doc.search("css selector")
doc.at("css selector")doc.search_first("css selector")
(doc/"css selector").inner_html doc.search_first("css selector").inner_html
(doc/"css selector").to_html doc.search_first("css selector").to_s
allows html page editing --N/A--
--N/A--Allows bundled scraping (example below)

TODO: Add XPath support in scrapi.

Filed in our tools ruby
Tagged as   
Posted on 23 August
0 comment Bookmark   AddThis Social Bookmark Button Updated on 05 November
Comments

Leave a response