QScraper is a wrapper over scrapi to provide Hpricot like interface.
Motivation: Hpricot interface is simple and easy to use while scrapi is more powerful because of bundle scraping and anonymous classes. I was using hpricot for quick testing and checking but scrapi for project implementation. To avoid working with two html scrapers, I wrote this wrapper over scrapi.
Bundle Scraping: It refers to extraction of multiple attributes of an element from a web page in a single parse. Most screen scraping tools extract only multiple elements but not multiple attributes of an element. Lets take an example of blog scraping, each blog post would be an element and I would like to extract multiple attributes of blog post like info about author, published on, title and content. Rather than making individual calls like doc.search(author_selector), doc.search(published_selector) etc., I would like to do doc.find(author_selector, date_selector, title ...).
Here comes QScraper :)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
require 'rubygems' require 'scrapi' require 'open-uri' class QScraper attr_accessor :document attr_accessor :rule_list # source can be a node or file # Constructor for this class .. # Takes a source (file or html::node) as argument. # Example : qScraper = QScraper.new("index.html") def initialize source, local_file=true case source when URI, HTML::Node @document = source when String @document = (local_file && File.exists?(source)) ? File.read(source) : source else raise ArgumentError, "Can only scrape URI, String or HTML::Node" end end # search string selector in given source and # returns all the matching elements as HTML::Tag object # Selector is a css selector # Example: # qScraper.search("div.highlight") def search selector output = (Scraper.define do array :elems process selector, :elems =>:element result :elems end).scrape(@document) output end # search the first node matching this selector def search_first selector output = (Scraper.define do process_first selector, :elem=>:element result :elem end).scrape(@document) output end # Adds a rule for parsing, you can add many rules and # then parse them by making a call to parse_rules # Example: # q.add_rule({"br#lgpd"=>{:n1ame=>"@id"}}) def add_rule selector, extractors @rule_list ||= {} @rule_list.merge!({selector=>extractors}) end # Example: # q.add_rules({"table.searchResults2>tbody>tr>td:nth-of-type(3)>a"=> # {:href=>"@href",:name=>:text}, # "table.searchResults2>tr>td:nth-of-type(3)>a.cat"=> # {:reviewcount=>:text}}) def add_rules selector_extractors_hash @rule_list ||= {} @rule_list.merge! (selector_extractors_hash) end # parses the given rule set .. specified by add_rule methods. def parse_rules parser=nil result_string = "result " @rule_list.values.each do |j| j.keys.each {|i| result_string << ":" << i.to_s << ", " } end result_string = result_string[0..-3] array_string = result_string.sub(/result/,"array") create_process_string = "" @rule_list.keys.each {|i| create_process_string << "process \"" << i << "\", " create_process_string << @rule_list[i].inspect.gsub(/[{}]/,"") << "\n" } res = String.module_eval %{Scraper.define do #{array_string} #{create_process_string} #{result_string} end} parser ||= :tidy f = res.scrape(@document,:parser=>parser) end end # Small extension to scrapi's HTML::Tag class # for "inner_html" method. # module HTML class Tag < Node def inner_html data = "" children.each {|e| data += e.to_s} data end end end |
1 2 3 4 5 6 7 8 9 10 11 12 |
# Example page to parse : http://www.google.co.in/search?q=scrapi doc = QScraper.new(open("http://www.google.co.in/search?q=scrapi"){|f| f.read}) # Extracting titles elements of search results # returned objects are of HTML::Tag type result = doc.search("a.l") # now extracting title and link for each of the search results. doc.add_rules({"a.l"=>{:url=>"@href",:title=>:text}}) result = doc.parse_rules #=> result.url returns an array of 10 url links #=> result.title returns the array of titles |
| Hpricot | Scrapi (with Qscraper) |
|---|---|
| doc=Hpricot(string/file/uri) | doc=QScraper(string/file/uri) |
| doc.search("css selector") [or doc/"css selector"] | doc.search("css selector") |
| doc.at("css selector") | doc.search_first("css selector") |
| (doc/"css selector").inner_html | doc.search_first("css selector").inner_html |
| (doc/"css selector").to_html | doc.search_first("css selector").to_s |
| allows html page editing | --N/A-- |
| --N/A-- | Allows bundled scraping (example below) |
TODO: Add XPath support in scrapi.
