HTML::Tag class in scrapi  1


While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.

In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.

1
2
3
4
5
6
7
8
9
10
require 'rubygems'
require 'scrapi'
require 'open-uri'

scraper = (Scraper.define do
   process_first "span.metaspan", :node=>:element
   result :node
end).scrape(open("http://www.quarkruby.com"){|f| f.read})
# scraper.class
=> "HTML::Tag"

Methods of HTML::Tag object

METHOD DESCRIPTION
tag? true if this is HTML::Tag . HTML::Tag and HTML::Text have many different methods so its an important check.
name returns name of the tag
attributes returns hash of attributes name/value pairs
children returns array of children nodes, which may contain both HTML::Text and HTML::Tag class objects
next_sibling returns next adjacent node, which can be HTML::Text or HTML::Tag
next_element returns next HTML::Tag node
to_s returns html content of this node and its children
find match on subtree of this node (not root node).
position byte index of matched node in html content
line line number of matched node in html content
parent returns parent HTML::Tag
match matches the given node with given arguments (explained with examples below)

Next two methods are not a part of HTML::Tag class but I frequently find them useful. Their code is available below
inner_html javascript's innerHTML :)
text returns the text content for this node and its childrens


Some examples on how to use match
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#test if the node is a "span" tag
  node.match :tag => "span"

#test if the node's parent is a "div"
  node.match :parent => { :tag => "div" }
#test if any of the node's ancestors are "table" tags
  node.match :ancestor => { :tag => "table" }

#test if any of the node's immediate children are "em" tags
  node.match :child => { :tag => "em" }
  
#test if any of the node's descendants are "strong" tags
  node.match :descendant => { :tag => "strong" }

#test if the node has between 2 and 4 span tags as immediate children
  node.match :children => { :count => 2..4, :only => { :tag => "span" } }

#get funky: test to see if the node is a "div", has a "ul" ancestor
#and an "li" parent (with "class" = "enum"), and whether or not it has
#a "span" descendant that contains # text matching /hello world/:
  node.match :tag => "div",
             :ancestor => { :tag => "ul" },
             :parent => { :tag => "li",
             :attributes => { :class => "enum" } },
             :descendant => { :tag => "span",
             :child => /hello world/ }

# Sometimes, you can get a HTML::Text node in return
# if its a text node, following are ways of using match on that node.
  node.match (conditions)
# * if +conditions+ is a string, it must be a substring of the node's  content
# * if +conditions+ is a regular expression, it must match the node's content
# * if +conditions+ is a hash, it must contain a <tt>:content</tt> key that
#   is either a string or a regexp, and which is interpreted as described above.
(above examples are shamelessly copied from html/node.rb, scrapi)

Code for new methods text and inner_html..
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
module HTML
   class Tag < Node
      def text
         return @content if respond_to? :content
         data=""
         0.upto(children.size-1) {|i|
             next if children[i].respond_to?:name and children[i].name=="script"
             data += children[i].text
         }
         return data
      end
     
      def inner_html 
         data = ""
         children.each {|e| data += e.to_s}
         data
       end
   end
end


Assaf Arkin has written a well-documented cheetsheet on scrapi. Surprisingly, I found few selectors left unexplained..

  1. E,F : i.e. alternate selectors (matching any of E or F)
  2. E:content('hello world'): content match
  3. E[foo=bar][foo2=bar2]: element E with attribute foo and foo2 having values bar and bar2 respectively

Filed in ruby
Tagged as html scraping scrapi 
Posted on 28 August
1 comment Bookmark   AddThis Social Bookmark Button
Comments

Leave a response

  1. Marcos KuhnsSeptember 05, 2007 @ 01:45 PM

    Wonderful documentation. I too have been searching high and low for some sort of reference for the HTML::Tag class & this is just the ticket. Thanks!

Comment