Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.comConfiguration & Tools we will be using are rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn
At Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :
Now, its not like only I don't understand it, no one can understand it :P, because they are not valid characters at all. So, we need to fix this issue and fetch readable utf-8 data while scraping.
The first step will be to fetch the html document using any of the ruby's URL library (net/http, open-uri or curb). We will be using curb here.
1 2 |
curl_object = Curl::Easy.perform("http://196m.cn") |
curl_object.content, i see a lot of "?", and other garbage characters. if I try to parse it using scrapi or hpricot, i get some crap, which makes no sense and is unreadable to all.Before scraping
curl_object using Hpricot or Scrapi, I should tell you that Hpricot requires input in utf8-encoding and Scrapi has no such condition but it does not always work with other character encoding.
So, we should convert html data in curl_object to utf-8 encoding, for which we need current encoding of the document. One way to do this will be:
1 2 3 4 5 6 7 |
if meta = curl_object.content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i) if meta = meta[0].match(/charset=([\w-]*)/i) encoding = meta[1] end end => 'gb2312' |
1 2 3 |
require 'iconv' Iconv.conv('utf-8', encoding, curl_object.content) |
1 2 |
Iconv.conv('utf-8//IGNORE', encoding, curl_object.content) |
Cool! Now we have content in utf-8 format. Lets scrape it!
Using Hpricot:
1 2 3 |
Hpricot(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).search("meta") => #<Hpricot::Elements[{emptyelem <meta content="text/html; charset=gb2312" http-equiv="Content-Type">}, {emptyelem <meta name="description" content="包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站">}, {emptyelem <meta name="keywords" content="茶,茶叶,茶网,茶文化,茶艺培训,茶道,tea,茶文化,茶业,红茶,绿茶,铁观音,茶饮,茶艺,茶餐厅,茶具,紫砂,龙井,碧螺春,乌龙茶,大红袍,普洱茶">}]> |
Using Scrapi:
1 2 3 4 5 6 7 8 9 |
Scraper.define do process "meta[name='description']", "r[]"=>:element result :r end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).to_s => "<meta name=\"description\" content=\"包括茶 å¶åŠèŒ¶é¥®æ– ™,茶食å“,æœè Œ¶,茶民&....... |
This is not the expected output. Scrapi comes with 2 parsers: tidy and html_parser. Scrapi by default uses tidy, so lets try using html_parser
1 2 3 4 5 6 |
t=Scraper.define do process "meta[name='description']", "r[]"=>:element result :r end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content), :parser=>:html_parser).to_s => "<meta name=\"description\" content=\"包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站\">" |

a good tutorial that helped me. thank you admin.
I’m so happy I stumble upon this blog! A lot of helpful info I really needed to know these stuff, I had a hard time with those foreign characters. Thank you so much!