Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.comConfiguration & Tools we will be using are rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn
At Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :
Now, its not like only I don't understand it, no one can understand it :P, because they are not valid characters at all. So, we need to fix this issue and fetch readable utf-8 data while scraping.
The first step will be to fetch the html document using any of the ruby's URL library (net/http, open-uri or curb). We will be using curb here.
1 2 |
curl_object = Curl::Easy.perform("http://196m.cn") |
curl_object.content, i see a lot of "?", and other garbage characters. if I try to parse it using scrapi or hpricot, i get some crap, which makes no sense and is unreadable to all.Before scraping
curl_object using Hpricot or Scrapi, I should tell you that Hpricot requires input in utf8-encoding and Scrapi has no such condition but it does not always work with other character encoding.
So, we should convert html data in curl_object to utf-8 encoding, for which we need current encoding of the document. One way to do this will be:
1 2 3 4 5 6 7 |
if meta = curl_object.content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i) if meta = meta[0].match(/charset=([\w-]*)/i) encoding = meta[1] end end => 'gb2312' |
1 2 3 |
require 'iconv' Iconv.conv('utf-8', encoding, curl_object.content) |
1 2 |
Iconv.conv('utf-8//IGNORE', encoding, curl_object.content) |
Cool! Now we have content in utf-8 format. Lets scrape it!
Using Hpricot:
1 2 3 |
Hpricot(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).search("meta") => #<Hpricot::Elements[{emptyelem <meta content="text/html; charset=gb2312" http-equiv="Content-Type">}, {emptyelem <meta name="description" content="包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站">}, {emptyelem <meta name="keywords" content="茶,茶叶,茶网,茶文化,茶艺培训,茶道,tea,茶文化,茶业,红茶,绿茶,铁观音,茶饮,茶艺,茶餐厅,茶具,紫砂,龙井,碧螺春,乌龙茶,大红袍,普洱茶">}]> |
Using Scrapi:
1 2 3 4 5 6 7 8 9 |
Scraper.define do process "meta[name='description']", "r[]"=>:element result :r end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).to_s => "<meta name=\"description\" content=\"包括茶 å¶åŠèŒ¶é¥®æ– ™,茶食å“,æœè Œ¶,茶民&....... |
This is not the expected output. Scrapi comes with 2 parsers: tidy and html_parser. Scrapi by default uses tidy, so lets try using html_parser
1 2 3 4 5 6 |
t=Scraper.define do process "meta[name='description']", "r[]"=>:element result :r end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content), :parser=>:html_parser).to_s => "<meta name=\"description\" content=\"包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站\">" |

I’m so happy I stumble upon this blog! A lot of helpful info I really needed to know these stuff, I had a hard time with those foreign characters. Thank you so much!
Another very solid tutorial. Thank you. Not something I can use right now but I bookmarked for the future.
Matt Gold
This looks brittle. What if the encoding is defined in the HTTP header? With normal open() instead of curl you get the encoding for free: http://jasonseifer.com/2008/03/18/hpricot-and-utf-8
Hello, the type code really help me much. Thank you for the suggestion and code.
It is very useful, thanks.
Good Luck
A lot of helpful info I really needed to know these stuff, me too
Hello, the type code really help me much. Thank you for the suggestion and code.
excellent information. Quite helpful. I gonna share this with my communities.
He was not a favourite at the court of Lorenzo the Magnificent as Filippino Lippi and Botticelli were. Lorenzo liked those who would flatter him and do as they were bid, while Leonardo took his own way in everything and never said Cat Supplies what he did not mean.
I have been trying to look for a solution to the utf8 problem and glad I searched and found you. Now I can save myself a lot of time by not having to figure out myself.
Good article. Thank you so much to share!
Thanks for sharing this great article.. Like it! Pinetop Cabin Rentals
Another great set, I appreciate all the work you put into this site, helping out others with your fun and creative works.
Thank you so much to share!
This is very good information!
Its always good to learn tips like you share for blog posting. As I just started posting comments for blog and facing problem of lots of rejections. I think your suggestion would be helpful for me. I will let you know if its work for me too. Thank you for sharing this beautiful articles.
These facts are amazing . I was searching before last 5 weaks and ia din’t get the perfect answer. But after all i found from your site. thanks for posting such a interesting topic.
I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you. Cleaners | Painters | Plumbers
Yet another problem with these changes. I am not able to access instance methods of my derived class from Scraper::Base from these blocks. I think these blocks are not “bind” to the local instance properly. Following code is an example from base.rb of scrapi and this doesn’t work.
The Acronym List is a searchable database of over 8 million acronyms, abbreviations and meanings. Covers: business, international, chat, organizations, common acronyms, computers, science, technology, government, telecommunications, and military acronyms.
Try the Burton Blunt or the Capita Scaremaster or a Rome cheap trick. those are all cheap and good for park. e-business
Most will cut them to size for you if you ask. They may charge a small fee. If they don’t know what they are doing usually you can use 1×4 oak boards and they will work fine. e-business
good tuto thanks !
Good website. I like the all pages and all comments. Thanks for all!!! Regards!
I was searching before last 5 weaks and ia din’t get the perfect answer. But after all i found from your site. thanks for posting such a interesting topic.
It’s a very exciting article. I like it so! i was very focused when i’m reading this. Thanks
I am new to scrapping but I think its quite an awesome technique.No doubt there are lot features available in different programming languages which we can use to scrap a web site very easily.
I honestly think that just using Rails has made me a better developer. I can’t say much about Coldfusion because I’ve never used it, but it’s good that you are considering the merits of each, best to use the tools that are right for the job. Speed Dating New York
Shop the latest styles juicy couture handbags, juicy couture tracksuit. Juicy Couture
Really great information and inspiration, both of which I like. Nice article with very strong content. I love reading it over again.
cool information – difficult but interessting!
autoradio navigation, car dvd gps navigation, sat navigation stereo, OEM Factory headunit for all car makes Higher quality car electronics from Qualir Car DVD Player
I like this article very much. Thanks for your share!
FashionStyleOnsale offer high quality Moncler Jackets at low price. Moncler Jackets on sale, shop more discount Moncler Vest, Moncler Coats at FashionStyleOnsale Moncler
FashionStyleOnsale offer high quality Moncler Jackets at low price. Moncler Jackets on sale, shop more discount Moncler Vest, Moncler Coats at FashionStyleOnsale Moncler
Merci pour cet article car l’integration des characters dans les codes html c’est pas evident.