HTML::Tag class in scrapi


While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.

In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.

Filed in ruby
Tagged as   
Posted on 28 August
0 comment Bookmark   AddThis Social Bookmark Button Updated on 09 September

QScraper : hpricot interface to scrapi


QScraper is a wrapper over scrapi to provide Hpricot like interface.

Motivation: Hpricot interface is simple and easy to use while scrapi is more powerful because of bundle scraping and anonymous classes. I was using hpricot for quick testing and checking but scrapi for project implementation. To avoid working with two html scrapers, I wrote this wrapper over scrapi.

Bundle Scraping: It refers to extraction of multiple attributes of an element from a web page in a single parse. Most screen scraping tools extract only multiple elements but not multiple attributes of an element. Lets take an example of blog scraping, each blog post would be an element and I would like to extract multiple attributes of blog post like info about author, published on, title and content. Rather than making individual calls like doc.search(author_selector), doc.search(published_selector) etc., I would like to do doc.find(author_selector, date_selector, title ...).

Filed in our tools ruby
Tagged as   
Posted on 23 August
0 comment Bookmark   AddThis Social Bookmark Button Updated on 05 November

archive

HTML::Tag class in scrapi

While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.

In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.

1
2
3
4
5
6
7
8
9
10
require 'rubygems'
require 'scrapi'
require 'open-uri'

scraper = (Scraper.define do
   process_first "span.metaspan", :node=>:element
   result :node
end).scrape(open("http://www.quarkruby.com"){|f| f.read})
# scraper.class
=> "HTML::Tag"

Methods of HTML::Tag object

METHOD DESCRIPTION
tag? true if this is HTML::Tag . HTML::Tag and HTML::Text have many different methods so its an important check.
name returns name of the tag
attributes returns hash of attributes name/value pairs
children returns array of children nodes, which may contain both HTML::Text and HTML::Tag class objects
next_sibling returns next adjacent node, which can be HTML::Text or HTML::Tag
next_element returns next HTML::Tag node
to_s returns html content of this node and its children
find match on subtree of this node (not root node).
position byte index of matched node in html content
line line number of matched node in html content
parent returns parent HTML::Tag
match matches the given node with given arguments (explained with examples below)

Next two methods are not a part of HTML::Tag class but I frequently find them useful. Their code is available below
inner_html javascript's innerHTML :)
text returns the text content for this node and its childrens


Some examples on how to use match
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#test if the node is a "span" tag
  node.match :tag => "span"

#test if the node's parent is a "div"
  node.match :parent => { :tag => "div" }
#test if any of the node's ancestors are "table" tags
  node.match :ancestor => { :tag => "table" }

#test if any of the node's immediate children are "em" tags
  node.match :child => { :tag => "em" }
  
#test if any of the node's descendants are "strong" tags
  node.match :descendant => { :tag => "strong" }

#test if the node has between 2 and 4 span tags as immediate children
  node.match :children => { :count => 2..4, :only => { :tag => "span" } }

#get funky: test to see if the node is a "div", has a "ul" ancestor
#and an "li" parent (with "class" = "enum"), and whether or not it has
#a "span" descendant that contains # text matching /hello world/:
  node.match :tag => "div",
             :ancestor => { :tag => "ul" },
             :parent => { :tag => "li",
             :attributes => { :class => "enum" } },
             :descendant => { :tag => "span",
             :child => /hello world/ }

# Sometimes, you can get a HTML::Text node in return
# if its a text node, following are ways of using match on that node.
  node.match (conditions)
# * if +conditions+ is a string, it must be a substring of the node's  content
# * if +conditions+ is a regular expression, it must match the node's content
# * if +conditions+ is a hash, it must contain a <tt>:content</tt> key that
#   is either a string or a regexp, and which is interpreted as described above.
(above examples are shamelessly copied from html/node.rb, scrapi)

Code for new methods text and inner_html..
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
module HTML
   class Tag < Node
      def text
         return @content if respond_to? :content
         data=""
         0.upto(children.size-1) {|i|
             next if children[i].respond_to?:name and children[i].name=="script"
             data += children[i].text
         }
         return data
      end
     
      def inner_html 
         data = ""
         children.each {|e| data += e.to_s}
         data
       end
   end
end


Assaf Arkin has written a well-documented cheetsheet on scrapi. Surprisingly, I found few selectors left unexplained..

  1. E,F : i.e. alternate selectors (matching any of E or F)
  2. E:content('hello world'): content match
  3. E[foo=bar][foo2=bar2]: element E with attribute foo and foo2 having values bar and bar2 respectively

QScraper : hpricot interface to scrapi

QScraper is a wrapper over scrapi to provide Hpricot like interface.

Motivation: Hpricot interface is simple and easy to use while scrapi is more powerful because of bundle scraping and anonymous classes. I was using hpricot for quick testing and checking but scrapi for project implementation. To avoid working with two html scrapers, I wrote this wrapper over scrapi.

Bundle Scraping: It refers to extraction of multiple attributes of an element from a web page in a single parse. Most screen scraping tools extract only multiple elements but not multiple attributes of an element. Lets take an example of blog scraping, each blog post would be an element and I would like to extract multiple attributes of blog post like info about author, published on, title and content. Rather than making individual calls like doc.search(author_selector), doc.search(published_selector) etc., I would like to do doc.find(author_selector, date_selector, title ...).

Here comes QScraper :)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
require 'rubygems'
require 'scrapi'
require 'open-uri'

class QScraper
  attr_accessor :document
  attr_accessor :rule_list

  # source can be a node or file
  # Constructor for this class ..
  # Takes a source (file or html::node) as argument.
  # Example : qScraper = QScraper.new("index.html")
  def initialize source, local_file=true
    case source
      when URI, HTML::Node
        @document = source
      when String
        @document = (local_file && File.exists?(source)) ?  File.read(source) : source
      else
        raise ArgumentError, "Can only scrape URI, String or HTML::Node"
    end
  end

  # search string selector in given source and 
  # returns all the matching elements as HTML::Tag object
  # Selector is a css selector 
  # Example:
  # qScraper.search("div.highlight")
  def search selector
    output = (Scraper.define do
      array :elems
      process selector, :elems =>:element
      result :elems
    end).scrape(@document)
    output
  end

  # search the first node matching this selector
  def search_first selector
    output = (Scraper.define do
      process_first selector, :elem=>:element
      result :elem
    end).scrape(@document)
    output
  end

  # Adds a rule for parsing, you can add many rules and 
  # then parse them by making a call to parse_rules
  # Example:
  # q.add_rule({"br#lgpd"=>{:n1ame=>"@id"}})
  def add_rule selector, extractors
    @rule_list ||= {}
    @rule_list.merge!({selector=>extractors})
  end

  # Example:
  # q.add_rules({"table.searchResults2>tbody>tr>td:nth-of-type(3)>a"=>
  #                               {:href=>"@href",:name=>:text},
  #              "table.searchResults2>tr>td:nth-of-type(3)>a.cat"=>
  #                               {:reviewcount=>:text}})
  def add_rules selector_extractors_hash
    @rule_list ||= {}
    @rule_list.merge! (selector_extractors_hash)
  end

  # parses the given rule set .. specified by add_rule methods.
  def parse_rules parser=nil
    result_string = "result "
    @rule_list.values.each do |j| 
        j.keys.each {|i| 
           result_string << ":" << i.to_s << ", "
        }
    end
    result_string = result_string[0..-3]
    array_string = result_string.sub(/result/,"array")
    create_process_string = ""
    @rule_list.keys.each {|i|
      create_process_string << "process \"" << i << "\", " 
      create_process_string << @rule_list[i].inspect.gsub(/[{}]/,"") << "\n"
    }
    res = String.module_eval %{Scraper.define do
      #{array_string}
      #{create_process_string}
      #{result_string}
    end}
    parser ||= :tidy
    f = res.scrape(@document,:parser=>parser)
  end
end

# Small extension to scrapi's HTML::Tag class 
# for "inner_html" method.
#
module HTML
  class Tag < Node
    def inner_html 
      data = ""
      children.each {|e| data += e.to_s}
      data
    end
  end
end
Usage
1
2
3
4
5
6
7
8
9
10
11
12
  # Example page to parse : http://www.google.co.in/search?q=scrapi
  doc = QScraper.new(open("http://www.google.co.in/search?q=scrapi"){|f| f.read})

  # Extracting titles elements of search results 
  # returned objects are of HTML::Tag type
  result = doc.search("a.l")

  # now extracting title and link for each of the search results.
  doc.add_rules({"a.l"=>{:url=>"@href",:title=>:text}})
  result = doc.parse_rules
  #=> result.url returns an array of 10 url links
  #=> result.title returns the array of titles
Comparison of Hpricot and scrapi(+QScraper) interface
Hpricot Scrapi (with Qscraper)
doc=Hpricot(string/file/uri) doc=QScraper(string/file/uri)
doc.search("css selector") [or doc/"css selector"]doc.search("css selector")
doc.at("css selector")doc.search_first("css selector")
(doc/"css selector").inner_html doc.search_first("css selector").inner_html
(doc/"css selector").to_html doc.search_first("css selector").to_s
allows html page editing --N/A--
--N/A--Allows bundled scraping (example below)

TODO: Add XPath support in scrapi.