UTF-8 and html screen scraping in Ruby on Rails


Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.com

Configuration & Tools we will be using are rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn

At Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :



Filed in ruby on rails
Posted on 22 September
143 comment Bookmark   AddThis Social Bookmark Button Updated on 25 June

archive

UTF-8 and html screen scraping in Ruby on Rails

Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.com

Configuration & Tools we will be using are rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn

At Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :



Now, its not like only I don't understand it, no one can understand it :P, because they are not valid characters at all. So, we need to fix this issue and fetch readable utf-8 data while scraping.
The first step will be to fetch the html document using any of the ruby's URL library (net/http, open-uri or curb). We will be using curb here.

1
2

curl_object = Curl::Easy.perform("http://196m.cn")
Now, if I do curl_object.content, i see a lot of "?", and other garbage characters. if I try to parse it using scrapi or hpricot, i get some crap, which makes no sense and is unreadable to all.
Before scraping curl_object using Hpricot or Scrapi, I should tell you that Hpricot requires input in utf8-encoding and Scrapi has no such condition but it does not always work with other character encoding.

So, we should convert html data in curl_object to utf-8 encoding, for which we need current encoding of the document. One way to do this will be:

1
2
3
4
5
6
7

if meta = curl_object.content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i)
  if meta = meta[0].match(/charset=([\w-]*)/i)
    encoding = meta[1]
  end
end
=> 'gb2312'
Next, we use ruby's iconv library to convert text from one encoding to the other
1
2
3

require 'iconv'
Iconv.conv('utf-8', encoding, curl_object.content) 
The above conversion will fail for this site. Since, iconv is not able to convert all the characters from one encoding to the required one. One solution to this is to ask Iconv to convert as much as it can
1
2

Iconv.conv('utf-8//IGNORE', encoding, curl_object.content) 

Cool! Now we have content in utf-8 format. Lets scrape it!

Using Hpricot:

1
2
3

Hpricot(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).search("meta")
=> #<Hpricot::Elements[{emptyelem <meta content="text/html; charset=gb2312" http-equiv="Content-Type">}, {emptyelem <meta name="description" content="包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站">}, {emptyelem <meta name="keywords" content="茶,茶叶,茶网,茶文化,茶艺培训,茶道,tea,茶文化,茶业,红茶,绿茶,铁观音,茶饮,茶艺,茶餐厅,茶具,紫砂,龙井,碧螺春,乌龙茶,大红袍,普洱茶">}]>

Using Scrapi:
1
2
3
4
5
6
7
8
9

Scraper.define do
  process "meta[name='description']", "r[]"=>:element
  result :r
end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).to_s
=> "<meta name=\"description\" content=\"&#229;&#338;&#8230;&#230;&#8249;&#172;&#232;&#338;&#182;
&#229;&#182;&#229;&#352;&#232;&#338;&#182;&#233;&#165;&#174;&#230;&#8211;
&#8482;,&#232;&#338;&#182;&#233;&#163;&#376;&#229;&#8220;,&#230;&#339;&#232;
&#338;&#182;,&#232;&#338;&#182;&#230;&#176;&#8216;&.......

This is not the expected output. Scrapi comes with 2 parsers: tidy and html_parser. Scrapi by default uses tidy, so lets try using html_parser
1
2
3
4
5
6

t=Scraper.define do
  process "meta[name='description']", "r[]"=>:element
  result :r
end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content), :parser=>:html_parser).to_s
=> "<meta name=\"description\" content=\"包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站\">"

Yep, and it that works now!. This is how the site information, now looks like: