We have been using ruby library Scrapi quite a lot for HTML Scraping in QuarkRank and other projects. Most of the times, I want to extract/scrape specific information from a page and directly dump it into the database. There were a few processes which were regularly repeated in my code, so as to make my code more DRY, I have enhanced Scrapi so that manipulations of extracted information becomes easier.
Lets consider an example, for each of the top 250 movies at IMDB, I want to extract and store in DB the following properties :
- imdb_id
- name
- release date
- rating
- tagline
- runtime
- director
archive
We have been using ruby library Scrapi quite a lot for HTML Scraping in QuarkRank and other projects. Most of the times, I want to extract/scrape specific information from a page and directly dump it into the database. There were a few processes which were regularly repeated in my code, so as to make my code more DRY, I have enhanced Scrapi so that manipulations of extracted information becomes easier.
Lets consider an example, for each of the top 250 movies at IMDB, I want to extract and store in DB the following properties :
- imdb_id
- name
- release date
- rating
- tagline
- runtime
- director
So, the process would be :
- Get the list of top 250 movies and their URLs.
- For each URL, scrape the corresponding properties.
- For each property, do cleaning/formatting (like parse Release Date into Date object) if required.
- Build a hash out of extracted information.
- Create Movie object using hash in above step.
The code for the above process would be something like :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
require 'scrapi'
require 'open-uri'
# Step 1: Get list of top 250 movies and their URLs.
# - Go to http://www.imdb.com/chart/top.
# - Open up firebug, with firequark installed.
# - Right click on any movie from the list and select "Inspect This" option from menu. In firebug you will see one selected element.
# - Right click on this element in firebug, and select option "get css selector" (put 250 in popup input box).
# - This returns "font>a", which is css selector for getting all 250 movies from this page.
links = (Scraper.define do
array :urls
process "font>a", :urls=>"@href"
result :urls
end).scrape(open("http://www.imdb.com/chart/top").read)
# Step 2: For each movie, get its properties
links.each do |url|
nurl = "http://www.imdb.com" + url
hash = (Scraper.define do
process "h5:content('Release Date:')", :date=>:element
process "div.general>b:content('User Rating:')", :rating=>:element
process "h5:content('Tagline:')", :tagline=>:element
process "h1", :title=>:text
result :date, :rating, :title, :tagline
end).scrape(open(nurl).read)
# Step 3: Cleaning and hash building of properties.
object = {}
date_val = hash.date.next_sibling.text
object[:date] = Date.parse(date_val[0..date_val.rindex('(')||-1]).to_s
object[:rating] = hash.rating.next_element.text.gsub(/\/.*/,"")
object[:title] = hash.title[0..hash.title.rindex('(')||-1]
object[:tagline] = hash.tagline.next_sibling.text.strip if hash.tagline
object[:imdb_id] = url.gsub(/\/title\//,"").gsub(/\//,"")
# save the hash to database
Movie.create(object)
end
|
Problems in the above code:
- Inserting each struct member/value pair into hash explicitly
- I would love to do manipulations inline, i.e. when extracting an element. Particularly, in cases where I need to do small/minor modifications like, getting the text corresponding to next element of selected element or converting a string into date object
Following two enhancements in Scrapi allow me to get rid of problems and simplify my code + efforts :
Fix1: Assign value of block manupilation to associated variable.
So, I could write rules like:
1
2
3
4
5
|
# get the text content of element next to "div#footer" into "footer_text" variable
process "div#footer", :footer_text do |x|
x.next_element.text
end
|
This is the patch which provides above functionality:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
Index: lib/scraper/base.rb
===================================================================
--- lib/scraper/base.rb (revision 268)
+++ lib/scraper/base.rb (working copy)
@@ -296,7 +296,8 @@
extracts.each do |extract|
extract.bind(self).call(element)
end
- true
+ #true
+ map.keys[0]
end
end
@@ -499,6 +500,9 @@
extractor = selector.pop
elsif selector.last.is_a?(Hash)
extractor = extractor(selector.pop)
+ elsif selector.last.is_a?(Symbol)
+ block_key = selector.pop
+ attr_accessor block_key
end
if block && extractor
# Ugly, but no other way to chain two calls bound to the
@@ -509,11 +513,24 @@
extractor2 = instance_method(:__extractor)
remove_method :__extractor
extractor = lambda do |element|
- extractor1.bind(self).call(element)
+ #extractor1.bind(self).call(element)
+ key = extractor1.bind(self).call(element)
extractor2.bind(self).call(element)
end
elsif block
- extractor = block
+ #extractor = block
+ operator = "="
+ if !@arrays.nil? and @arrays.include? block_key.to_sym
+ extractor = lambda do |element|
+ send(block_key.to_s+"=",[]) if send(block_key.to_s).nil?
+ send(block_key.to_s+"=", send(block_key.to_s) + [block.call(element)])
+ end
+ else
+ extractor = lambda do |element|
+ send(block_key.to_s+"=", block.call(element))
+ end
+ end
end
raise ArgumentError, "Missing extractor: the last argument tells us what to extract" unless extractor
# And if we think the extractor is the last argument,
|
Fix2: Struct to Hash conversion
1
2
3
4
5
6
7
8
9
10
|
class Struct
def to_hash
hash = {}
members.each do |key|
hash[key.to_sym] = self.send(key)
end
hash
end
end
|
Using above two fixes, my refactored code becomes cleaner and shorter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
links.each do |url|
nurl = "http://www.imdb.com" + url
hash = (Scraper.define do
process "h5:content('Release Date:')", :date do |elem|
date_val = elem.next_sibling.text
Date.parse(date_val[0..date_val.rindex('(')||-1])
end
process "div.general>b:content('User Rating:')", :rating do |elem|
elem.next_element.text.gsub(/\/.*/,"")
end
process "h5:content('Tagline:')", :tagline do |elem|
elem.next_sibling.text
end
process "h1", :title do |elem|
elem.text[0..(elem.text.rindex('(')||0)-1].strip
end
result :date, :rating, :title, :tagline
end).scrape(open(nurl).read)
object = hash.to_hash
object[:imdb_id] = url.gsub(/\/title\//,"").gsub(/\//,"")
Movie.create(object)
end
|
NOTE: When you run above code snippets, you will get error "text" not defined for HTML::Tag class. Copy this code in your file, which defines this function for HTML::Text and HTML::Tag class.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
module HTML
class Text
def text
@content
end
end
class Tag
def text
return @content if respond_to? :content
data=""
children.each do |child|
next if child.respond_to?:name and child.name=="script"
data += child.text
end
data
end
end
end
|
I wish this is helpful to the Scrapi community.