While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.
In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.
1 2 3 4 5 6 7 8 9 10 |
require 'rubygems' require 'scrapi' require 'open-uri' scraper = (Scraper.define do process_first "span.metaspan", :node=>:element result :node end).scrape(open("http://www.quarkruby.com"){|f| f.read}) # scraper.class => "HTML::Tag" |
Methods of HTML::Tag object
| METHOD | DESCRIPTION |
|---|---|
| tag? | true if this is HTML::Tag . HTML::Tag and HTML::Text have many different methods so its an important check. |
| name | returns name of the tag |
| attributes | returns hash of attributes name/value pairs |
| children | returns array of children nodes, which may contain both HTML::Text and HTML::Tag class objects |
| next_sibling | returns next adjacent node, which can be HTML::Text or HTML::Tag |
| next_element | returns next HTML::Tag node |
| to_s | returns html content of this node and its children |
| find | match on subtree of this node (not root node). |
| position | byte index of matched node in html content |
| line | line number of matched node in html content |
| parent | returns parent HTML::Tag |
| match | matches the given node with given arguments (explained with examples below) |
| Next two methods are not a part of HTML::Tag class but I frequently find them useful. Their code is available below | |
| inner_html | javascript's innerHTML :) |
| text | returns the text content for this node and its childrens |
Some examples on how to use
match1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
#test if the node is a "span" tag node.match :tag => "span" #test if the node's parent is a "div" node.match :parent => { :tag => "div" } #test if any of the node's ancestors are "table" tags node.match :ancestor => { :tag => "table" } #test if any of the node's immediate children are "em" tags node.match :child => { :tag => "em" } #test if any of the node's descendants are "strong" tags node.match :descendant => { :tag => "strong" } #test if the node has between 2 and 4 span tags as immediate children node.match :children => { :count => 2..4, :only => { :tag => "span" } } #get funky: test to see if the node is a "div", has a "ul" ancestor #and an "li" parent (with "class" = "enum"), and whether or not it has #a "span" descendant that contains # text matching /hello world/: node.match :tag => "div", :ancestor => { :tag => "ul" }, :parent => { :tag => "li", :attributes => { :class => "enum" } }, :descendant => { :tag => "span", :child => /hello world/ } # Sometimes, you can get a HTML::Text node in return # if its a text node, following are ways of using match on that node. node.match (conditions) # * if +conditions+ is a string, it must be a substring of the node's content # * if +conditions+ is a regular expression, it must match the node's content # * if +conditions+ is a hash, it must contain a <tt>:content</tt> key that # is either a string or a regexp, and which is interpreted as described above. |
Code for new methods text and inner_html..
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
module HTML class Tag < Node def text return @content if respond_to? :content data="" 0.upto(children.size-1) {|i| next if children[i].respond_to?:name and children[i].name=="script" data += children[i].text } return data end def inner_html data = "" children.each {|e| data += e.to_s} data end end end |
Assaf Arkin has written a well-documented cheetsheet on scrapi. Surprisingly, I found few selectors left unexplained..
- E,F : i.e. alternate selectors (matching any of E or F)
- E:content('hello world'): content match
- E[foo=bar][foo2=bar2]: element E with attribute foo and foo2 having values bar and bar2 respectively

Wonderful documentation. I too have been searching high and low for some sort of reference for the HTML::Tag class & this is just the ticket. Thanks!