We have been using ruby library Scrapi quite a lot for HTML Scraping in QuarkRank and other projects. Most of the times, I want to extract/scrape specific information from a page and directly dump it into the database. There were a few processes which were regularly repeated in my code, so as to make my code more DRY, I have enhanced Scrapi so that manipulations of extracted information becomes easier.
Lets consider an example, for each of the top 250 movies at IMDB, I want to extract and store in DB the following properties :
- imdb_id
- name
- release date
- rating
- tagline
- runtime
- director
- Get the list of top 250 movies and their URLs.
- For each URL, scrape the corresponding properties.
- For each property, do cleaning/formatting (like parse Release Date into Date object) if required.
- Build a hash out of extracted information.
- Create Movie object using hash in above step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
require 'scrapi' require 'open-uri' # Step 1: Get list of top 250 movies and their URLs. # - Go to http://www.imdb.com/chart/top. # - Open up firebug, with firequark installed. # - Right click on any movie from the list and select "Inspect This" option from menu. In firebug you will see one selected element. # - Right click on this element in firebug, and select option "get css selector" (put 250 in popup input box). # - This returns "font>a", which is css selector for getting all 250 movies from this page. links = (Scraper.define do array :urls process "font>a", :urls=>"@href" result :urls end).scrape(open("http://www.imdb.com/chart/top").read) # Step 2: For each movie, get its properties links.each do |url| nurl = "http://www.imdb.com" + url hash = (Scraper.define do process "h5:content('Release Date:')", :date=>:element process "div.general>b:content('User Rating:')", :rating=>:element process "h5:content('Tagline:')", :tagline=>:element process "h1", :title=>:text result :date, :rating, :title, :tagline end).scrape(open(nurl).read) # Step 3: Cleaning and hash building of properties. object = {} date_val = hash.date.next_sibling.text object[:date] = Date.parse(date_val[0..date_val.rindex('(')||-1]).to_s object[:rating] = hash.rating.next_element.text.gsub(/\/.*/,"") object[:title] = hash.title[0..hash.title.rindex('(')||-1] object[:tagline] = hash.tagline.next_sibling.text.strip if hash.tagline object[:imdb_id] = url.gsub(/\/title\//,"").gsub(/\//,"") # save the hash to database Movie.create(object) end |
- Inserting each struct member/value pair into hash explicitly
- I would love to do manipulations inline, i.e. when extracting an element. Particularly, in cases where I need to do small/minor modifications like, getting the text corresponding to next element of selected element or converting a string into date object
Fix1: Assign value of block manupilation to associated variable.
So, I could write rules like:
1 2 3 4 |
# get the text content of element next to "div#footer" into "footer_text" variable process "div#footer", :footer_text do |x| x.next_element.text end |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
Index: lib/scraper/base.rb =================================================================== --- lib/scraper/base.rb (revision 268) +++ lib/scraper/base.rb (working copy) @@ -296,7 +296,8 @@ extracts.each do |extract| extract.bind(self).call(element) end - true + #true + map.keys[0] end end @@ -499,6 +500,9 @@ extractor = selector.pop elsif selector.last.is_a?(Hash) extractor = extractor(selector.pop) + elsif selector.last.is_a?(Symbol) + block_key = selector.pop + attr_accessor block_key end if block && extractor # Ugly, but no other way to chain two calls bound to the @@ -509,11 +513,24 @@ extractor2 = instance_method(:__extractor) remove_method :__extractor extractor = lambda do |element| - extractor1.bind(self).call(element) + #extractor1.bind(self).call(element) + key = extractor1.bind(self).call(element) extractor2.bind(self).call(element) end elsif block - extractor = block + #extractor = block + operator = "=" + if !@arrays.nil? and @arrays.include? block_key.to_sym + extractor = lambda do |element| + send(block_key.to_s+"=",[]) if send(block_key.to_s).nil? + send(block_key.to_s+"=", send(block_key.to_s) + [block.call(element)]) + end + else + extractor = lambda do |element| + send(block_key.to_s+"=", block.call(element)) + end + end end raise ArgumentError, "Missing extractor: the last argument tells us what to extract" unless extractor # And if we think the extractor is the last argument, |
Fix2: Struct to Hash conversion
1 2 3 4 5 6 7 8 9 |
class Struct def to_hash hash = {} members.each do |key| hash[key.to_sym] = self.send(key) end hash end end |
Using above two fixes, my refactored code becomes cleaner and shorter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
links.each do |url| nurl = "http://www.imdb.com" + url hash = (Scraper.define do process "h5:content('Release Date:')", :date do |elem| date_val = elem.next_sibling.text Date.parse(date_val[0..date_val.rindex('(')||-1]) end process "div.general>b:content('User Rating:')", :rating do |elem| elem.next_element.text.gsub(/\/.*/,"") end process "h5:content('Tagline:')", :tagline do |elem| elem.next_sibling.text end process "h1", :title do |elem| elem.text[0..(elem.text.rindex('(')||0)-1].strip end result :date, :rating, :title, :tagline end).scrape(open(nurl).read) object = hash.to_hash object[:imdb_id] = url.gsub(/\/title\//,"").gsub(/\//,"") Movie.create(object) end |
NOTE: When you run above code snippets, you will get error "text" not defined for HTML::Tag class. Copy this code in your file, which defines this function for HTML::Text and HTML::Tag class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
module HTML class Text def text @content end end class Tag def text return @content if respond_to? :content data="" children.each do |child| next if child.respond_to?:name and child.name=="script" data += child.text end data end end end |
I wish this is helpful to the Scrapi community.

With these changes I can’t call process(selector, &block). I have to always specify a :dummy symbol for the return value of the block.
Yet another problem with these changes. I am not able to access instance methods of my derived class from Scraper::Base from these blocks. I think these blocks are not “bind” to the local instance properly. Following code is an example from base.rb of scrapi and this doesn’t work.