While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.
In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.
QScraper is a wrapper over scrapi to provide Hpricot like interface.
Motivation: Hpricot interface is simple and easy to use while scrapi is more powerful because of bundle scraping and anonymous classes. I was using hpricot for quick testing and checking but scrapi for project implementation. To avoid working with two html scrapers, I wrote this wrapper over scrapi.
Bundle Scraping: It refers to extraction of multiple attributes of an element from a web page in a single parse. Most screen scraping tools extract only multiple elements but not multiple attributes of an element. Lets take an example of blog scraping, each blog post would be an element and I would like to extract multiple attributes of blog post like info about author, published on, title and content. Rather than making individual calls like doc.search(author_selector), doc.search(published_selector) etc., I would like to do doc.find(author_selector, date_selector, title ...).
archive
While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.
In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.
1
2
3
4
5
6
7
8
9
10
|
require 'rubygems'
require 'scrapi'
require 'open-uri'
scraper = (Scraper.define do
process_first "span.metaspan", :node=>:element
result :node
end).scrape(open("http://www.quarkruby.com"){|f| f.read})
# scraper.class
=> "HTML::Tag" |
| METHOD | DESCRIPTION |
| tag? | true if this is HTML::Tag . HTML::Tag and HTML::Text have many different methods so its an important check. |
| name | returns name of the tag |
| attributes | returns hash of attributes name/value pairs |
| children | returns array of children nodes, which may contain both HTML::Text and HTML::Tag class objects |
| next_sibling | returns next adjacent node, which can be HTML::Text or HTML::Tag |
| next_element | returns next HTML::Tag node |
| to_s | returns html content of this node and its children |
| find | match on subtree of this node (not root node). |
| position | byte index of matched node in html content |
| line | line number of matched node in html content |
| parent | returns parent HTML::Tag |
| match | matches the given node with given arguments (explained with examples below) |
Next two methods are not a part of HTML::Tag class but I frequently find them useful. Their code is available below |
| inner_html | javascript's innerHTML :) |
| text | returns the text content for this node and its childrens |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
#test if the node is a "span" tag
node.match :tag => "span"
#test if the node's parent is a "div"
node.match :parent => { :tag => "div" }
#test if any of the node's ancestors are "table" tags
node.match :ancestor => { :tag => "table" }
#test if any of the node's immediate children are "em" tags
node.match :child => { :tag => "em" }
#test if any of the node's descendants are "strong" tags
node.match :descendant => { :tag => "strong" }
#test if the node has between 2 and 4 span tags as immediate children
node.match :children => { :count => 2..4, :only => { :tag => "span" } }
#get funky: test to see if the node is a "div", has a "ul" ancestor
#and an "li" parent (with "class" = "enum"), and whether or not it has
#a "span" descendant that contains # text matching /hello world/:
node.match :tag => "div",
:ancestor => { :tag => "ul" },
:parent => { :tag => "li",
:attributes => { :class => "enum" } },
:descendant => { :tag => "span",
:child => /hello world/ }
# Sometimes, you can get a HTML::Text node in return
# if its a text node, following are ways of using match on that node.
node.match (conditions)
# * if +conditions+ is a string, it must be a substring of the node's content
# * if +conditions+ is a regular expression, it must match the node's content
# * if +conditions+ is a hash, it must contain a <tt>:content</tt> key that
# is either a string or a regexp, and which is interpreted as described above. |
(
above examples are shamelessly copied from html/node.rb, scrapi)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
module HTML
class Tag < Node
def text
return @content if respond_to? :content
data=""
0.upto(children.size-1) {|i|
next if children[i].respond_to?:name and children[i].name=="script"
data += children[i].text
}
return data
end
def inner_html
data = ""
children.each {|e| data += e.to_s}
data
end
end
end |
Assaf Arkin has written a well-documented cheetsheet on scrapi. Surprisingly, I found few selectors left unexplained..
- E,F : i.e. alternate selectors (matching any of E or F)
- E:content('hello world'): content match
- E[foo=bar][foo2=bar2]: element E with attribute foo and foo2 having values bar and bar2 respectively
QScraper is a wrapper over scrapi to provide Hpricot like interface.
Motivation: Hpricot interface is simple and easy to use while scrapi is more powerful because of bundle scraping and anonymous classes. I was using hpricot for quick testing and checking but scrapi for project implementation. To avoid working with two html scrapers, I wrote this wrapper over scrapi.
Bundle Scraping: It refers to extraction of multiple attributes of an element from a web page in a single parse. Most screen scraping tools extract only multiple elements but not multiple attributes of an element. Lets take an example of blog scraping, each blog post would be an element and I would like to extract multiple attributes of blog post like info about author, published on, title and content. Rather than making individual calls like doc.search(author_selector), doc.search(published_selector) etc., I would like to do doc.find(author_selector, date_selector, title ...).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
|
require 'rubygems'
require 'scrapi'
require 'open-uri'
class QScraper
attr_accessor :document
attr_accessor :rule_list
# source can be a node or file
# Constructor for this class ..
# Takes a source (file or html::node) as argument.
# Example : qScraper = QScraper.new("index.html")
def initialize source, local_file=true
case source
when URI, HTML::Node
@document = source
when String
@document = (local_file && File.exists?(source)) ? File.read(source) : source
else
raise ArgumentError, "Can only scrape URI, String or HTML::Node"
end
end
# search string selector in given source and
# returns all the matching elements as HTML::Tag object
# Selector is a css selector
# Example:
# qScraper.search("div.highlight")
def search selector
output = (Scraper.define do
array :elems
process selector, :elems =>:element
result :elems
end).scrape(@document)
output
end
# search the first node matching this selector
def search_first selector
output = (Scraper.define do
process_first selector, :elem=>:element
result :elem
end).scrape(@document)
output
end
# Adds a rule for parsing, you can add many rules and
# then parse them by making a call to parse_rules
# Example:
# q.add_rule({"br#lgpd"=>{:n1ame=>"@id"}})
def add_rule selector, extractors
@rule_list ||= {}
@rule_list.merge!({selector=>extractors})
end
# Example:
# q.add_rules({"table.searchResults2>tbody>tr>td:nth-of-type(3)>a"=>
# {:href=>"@href",:name=>:text},
# "table.searchResults2>tr>td:nth-of-type(3)>a.cat"=>
# {:reviewcount=>:text}})
def add_rules selector_extractors_hash
@rule_list ||= {}
@rule_list.merge! (selector_extractors_hash)
end
# parses the given rule set .. specified by add_rule methods.
def parse_rules parser=nil
result_string = "result "
@rule_list.values.each do |j|
j.keys.each {|i|
result_string << ":" << i.to_s << ", "
}
end
result_string = result_string[0..-3]
array_string = result_string.sub(/result/,"array")
create_process_string = ""
@rule_list.keys.each {|i|
create_process_string << "process \"" << i << "\", "
create_process_string << @rule_list[i].inspect.gsub(/[{}]/,"") << "\n"
}
res = String.module_eval %{Scraper.define do
#{array_string}
#{create_process_string}
#{result_string}
end}
parser ||= :tidy
f = res.scrape(@document,:parser=>parser)
end
end
# Small extension to scrapi's HTML::Tag class
# for "inner_html" method.
#
module HTML
class Tag < Node
def inner_html
data = ""
children.each {|e| data += e.to_s}
data
end
end
end |
1
2
3
4
5
6
7
8
9
10
11
12
|
# Example page to parse : http://www.google.co.in/search?q=scrapi
doc = QScraper.new(open("http://www.google.co.in/search?q=scrapi"){|f| f.read})
# Extracting titles elements of search results
# returned objects are of HTML::Tag type
result = doc.search("a.l")
# now extracting title and link for each of the search results.
doc.add_rules({"a.l"=>{:url=>"@href",:title=>:text}})
result = doc.parse_rules
#=> result.url returns an array of 10 url links
#=> result.title returns the array of titles |
| Hpricot | Scrapi (with Qscraper) |
| doc=Hpricot(string/file/uri) | doc=QScraper(string/file/uri) |
| doc.search("css selector") [or doc/"css selector"] | doc.search("css selector") |
| doc.at("css selector") | doc.search_first("css selector") |
| (doc/"css selector").inner_html | doc.search_first("css selector").inner_html |
| (doc/"css selector").to_html | doc.search_first("css selector").to_s |
| allows html page editing | --N/A-- |
| --N/A-- | Allows bundled scraping (example below) |
TODO: Add XPath support in scrapi.