Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem statement of this article is
How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project
Quarkbase.com
Configuration & Tools we will be using are
rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn
At
Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :
archive
Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem statement of this article is
How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project
Quarkbase.com
Configuration & Tools we will be using are
rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn
At
Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :
Now, its not like only I don't understand it, no one can understand it :P, because they are not valid characters at all. So, we need to fix this issue and fetch readable utf-8 data while scraping.
The first step will be to fetch the html document using any of the ruby's URL library (net/http, open-uri or curb). We will be using curb here.
1
2
|
curl_object = Curl::Easy.perform("http://196m.cn")
|
Now, if I do
curl_object.content, i see a lot of "?", and other garbage characters. if I try to parse it using scrapi or hpricot, i get some crap, which makes no sense and is unreadable to all.
Before scraping
curl_object using Hpricot or Scrapi, I should tell you that Hpricot requires input in utf8-encoding and Scrapi has no such condition but it does not always work with other character encoding.
So, we should convert html data in curl_object to utf-8 encoding, for which we need current encoding of the document. One way to do this will be:
1
2
3
4
5
6
7
|
if meta = curl_object.content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i)
if meta = meta[0].match(/charset=([\w-]*)/i)
encoding = meta[1]
end
end
=> 'gb2312'
|
Next, we use ruby's iconv library to convert text from one encoding to the other
1
2
3
|
require 'iconv'
Iconv.conv('utf-8', encoding, curl_object.content)
|
The above conversion will fail for this site. Since, iconv is not able to convert all the characters from one encoding to the required one. One solution to this is to ask Iconv to convert as much as it can
1
2
|
Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)
|
Cool! Now we have content in utf-8 format. Lets scrape it!
Using Hpricot:
1
2
3
|
Hpricot(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).search("meta")
=> #<Hpricot::Elements[{emptyelem <meta content="text/html; charset=gb2312" http-equiv="Content-Type">}, {emptyelem <meta name="description" content="包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站">}, {emptyelem <meta name="keywords" content="茶,茶叶,茶网,茶文化,茶艺培训,茶道,tea,茶文化,茶业,红茶,绿茶,铁观音,茶饮,茶艺,茶餐厅,茶具,紫砂,龙井,碧螺春,乌龙茶,大红袍,普洱茶">}]>
|
Using Scrapi:
1
2
3
4
5
6
7
8
9
|
Scraper.define do
process "meta[name='description']", "r[]"=>:element
result :r
end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).to_s
=> "<meta name=\"description\" content=\"包括茶
å¶åŠèŒ¶é¥®æ–
™,茶食å“,æœè
Œ¶,茶民&.......
|
This is not the expected output. Scrapi comes with 2 parsers: tidy and html_parser. Scrapi by default uses tidy, so lets try using html_parser
1
2
3
4
5
6
|
t=Scraper.define do
process "meta[name='description']", "r[]"=>:element
result :r
end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content), :parser=>:html_parser).to_s
=> "<meta name=\"description\" content=\"包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站\">"
|
Yep, and it that works now!. This is how the site information, now looks like: