Scrapi enhancements


We have been using ruby library Scrapi quite a lot for HTML Scraping in QuarkRank and other projects. Most of the times, I want to extract/scrape specific information from a page and directly dump it into the database. There were a few processes which were regularly repeated in my code, so as to make my code more DRY, I have enhanced Scrapi so that manipulations of extracted information becomes easier.

Lets consider an example, for each of the top 250 movies at IMDB, I want to extract and store in DB the following properties :

  1. imdb_id
  2. name
  3. release date
  4. rating
  5. tagline
  6. runtime
  7. director
So, the process would be :
  1. Get the list of top 250 movies and their URLs.
  2. For each URL, scrape the corresponding properties.
  3. For each property, do cleaning/formatting (like parse Release Date into Date object) if required.
  4. Build a hash out of extracted information.
  5. Create Movie object using hash in above step.

The code for the above process would be something like :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

require 'scrapi'
require 'open-uri'

# Step 1: Get list of top 250 movies and their URLs.
#  - Go to http://www.imdb.com/chart/top.
#  - Open up firebug, with firequark installed.
#  - Right click on any movie from the list and select "Inspect This" option from menu. In firebug you will see one selected element.
#  - Right click on this element in firebug, and select option "get css selector" (put 250 in popup input box).
#  - This returns "font>a", which is css selector for getting all 250 movies from this page.
links = (Scraper.define do
  array :urls
  process "font>a", :urls=>"@href"
  result :urls
end).scrape(open("http://www.imdb.com/chart/top").read)

# Step 2: For each movie, get its properties
links.each do |url|
  nurl = "http://www.imdb.com" + url
  hash = (Scraper.define do
    process "h5:content('Release Date:')", :date=>:element
    process "div.general>b:content('User Rating:')", :rating=>:element
    process "h5:content('Tagline:')", :tagline=>:element
    process "h1", :title=>:text
    result :date, :rating, :title, :tagline
  end).scrape(open(nurl).read)

  # Step 3: Cleaning and hash building of properties.
  object = {}
  date_val = hash.date.next_sibling.text
  object[:date] = Date.parse(date_val[0..date_val.rindex('(')||-1]).to_s
  object[:rating] = hash.rating.next_element.text.gsub(/\/.*/,"")
  object[:title] = hash.title[0..hash.title.rindex('(')||-1]
  object[:tagline] = hash.tagline.next_sibling.text.strip if hash.tagline
  object[:imdb_id] = url.gsub(/\/title\//,"").gsub(/\//,"")

  # save the hash to database
  Movie.create(object)
end
Problems in the above code:
  1. Inserting each struct member/value pair into hash explicitly
  2. I would love to do manipulations inline, i.e. when extracting an element. Particularly, in cases where I need to do small/minor modifications like, getting the text corresponding to next element of selected element or converting a string into date object
Following two enhancements in Scrapi allow me to get rid of problems and simplify my code + efforts :

Fix1: Assign value of block manupilation to associated variable.
So, I could write rules like:
1
2
3
4
5

# get the text content of element next to "div#footer" into "footer_text" variable
process "div#footer", :footer_text do |x|
  x.next_element.text
end
This is the patch which provides above functionality:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

Index: lib/scraper/base.rb
===================================================================
--- lib/scraper/base.rb (revision 268)
+++ lib/scraper/base.rb (working copy)
@@ -296,7 +296,8 @@
           extracts.each do |extract|
             extract.bind(self).call(element)
           end
-          true
+          #true
+          map.keys[0]
         end
       end
 
@@ -499,6 +500,9 @@
           extractor = selector.pop
         elsif selector.last.is_a?(Hash)
           extractor = extractor(selector.pop)
+        elsif selector.last.is_a?(Symbol)
+          block_key = selector.pop
+          attr_accessor block_key
         end
         if block && extractor
           # Ugly, but no other way to chain two calls bound to the
@@ -509,11 +513,24 @@
           extractor2 = instance_method(:__extractor)
           remove_method :__extractor
           extractor = lambda do |element|
-            extractor1.bind(self).call(element)
+            #extractor1.bind(self).call(element)
+            key = extractor1.bind(self).call(element)
             extractor2.bind(self).call(element)
           end
         elsif block
-          extractor = block
+          #extractor = block
+          operator = "="
+          if !@arrays.nil? and @arrays.include? block_key.to_sym
+            extractor = lambda do |element|
+              send(block_key.to_s+"=",[]) if send(block_key.to_s).nil?
+              send(block_key.to_s+"=", send(block_key.to_s) + [block.call(element)])
+            end
+          else
+            extractor = lambda do |element|
+              send(block_key.to_s+"=", block.call(element))
+            end
+          end
         end
         raise ArgumentError, "Missing extractor: the last argument tells us what to extract" unless extractor
         # And if we think the extractor is the last argument,

Fix2: Struct to Hash conversion
1
2
3
4
5
6
7
8
9
10

class Struct
  def to_hash
    hash = {}
    members.each do |key|
      hash[key.to_sym] = self.send(key)
    end
    hash
  end
end

Using above two fixes, my refactored code becomes cleaner and shorter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

links.each do |url|
  nurl = "http://www.imdb.com" + url
  hash = (Scraper.define do
    process "h5:content('Release Date:')", :date do |elem|
       date_val = elem.next_sibling.text
       Date.parse(date_val[0..date_val.rindex('(')||-1])
      end
    process "div.general>b:content('User Rating:')", :rating do |elem|
        elem.next_element.text.gsub(/\/.*/,"")
      end
    process "h5:content('Tagline:')", :tagline do |elem|
        elem.next_sibling.text
      end
    process "h1", :title do |elem|
        elem.text[0..(elem.text.rindex('(')||0)-1].strip
      end
    result :date, :rating, :title, :tagline
  end).scrape(open(nurl).read)
  object = hash.to_hash
  object[:imdb_id] = url.gsub(/\/title\//,"").gsub(/\//,"")
  Movie.create(object)
end

NOTE: When you run above code snippets, you will get error "text" not defined for HTML::Tag class. Copy this code in your file, which defines this function for HTML::Text and HTML::Tag class.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

module HTML
  class Text
    def text
      @content
    end
  end
  class Tag
    def text
      return @content if respond_to? :content
      data=""
      children.each do |child|
        next if child.respond_to?:name and child.name=="script"
        data += child.text
      end
      data
    end
  end
end

I wish this is helpful to the Scrapi community.
Filed in ruby on rails
Tagged as   
Posted on 30 January
7 comment Bookmark   AddThis Social Bookmark Button Updated on 23 February
Comments

Leave a response

  1. ArvindApril 28, 2008 @ 11:53 AM

    With these changes I can’t call process(selector, &block). I have to always specify a :dummy symbol for the return value of the block.

  2. ArvindMay 03, 2008 @ 02:23 AM

    Yet another problem with these changes. I am not able to access instance methods of my derived class from Scraper::Base from these blocks. I think these blocks are not “bind” to the local instance properly. Following code is an example from base.rb of scrapi and this doesn’t work.

    1. class ScrapePosts < Scraper::Base
    2. # Select the title of a post
    3. selector :select_title, “h2” #
    4. # Select the body of a post
    5. selector :select_body, ”.body” #
    6. # All elements with class name post.
    7. process ”.post” do |element|
    8. title = select_title(element)
    9. body = select_body(element)
    10. @posts << Post.new(title, body)
    11. true
    12. end #
    13. attr_reader :posts
    14. end #
    15. posts = ScrapePosts.scrape(html).posts
  3. dJuly 07, 2008 @ 12:41 AM

    With these changes I can’t call process(selector, &block). I have to always specify a :dummy symbol for the return value of the block.

  4. Pet Dog BlogJuly 30, 2008 @ 08:03 PM

    D your right I believe if I remember it corectly. You can’t call the process(selector, &block).

  5. Driver downloadSeptember 20, 2008 @ 07:22 AM
  6. DavidSeptember 25, 2008 @ 02:09 PM

    This is a great tool for scraping and data for seocollections

    Keep it up

  7. Jaram EiwittenOctober 11, 2008 @ 10:47 AM

    Thanks for the code, turns out that this was exactly what I needed to do (but couldn’t on my own). The entire Scrapi library is brilliant :-)