UTF-8 and html screen scraping in Ruby on Rails  36


Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.com

Configuration & Tools we will be using are rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn

At Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :



Now, its not like only I don't understand it, no one can understand it :P, because they are not valid characters at all. So, we need to fix this issue and fetch readable utf-8 data while scraping.
The first step will be to fetch the html document using any of the ruby's URL library (net/http, open-uri or curb). We will be using curb here.

1
2

curl_object = Curl::Easy.perform("http://196m.cn")
Now, if I do curl_object.content, i see a lot of "?", and other garbage characters. if I try to parse it using scrapi or hpricot, i get some crap, which makes no sense and is unreadable to all.
Before scraping curl_object using Hpricot or Scrapi, I should tell you that Hpricot requires input in utf8-encoding and Scrapi has no such condition but it does not always work with other character encoding.

So, we should convert html data in curl_object to utf-8 encoding, for which we need current encoding of the document. One way to do this will be:

1
2
3
4
5
6
7

if meta = curl_object.content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i)
  if meta = meta[0].match(/charset=([\w-]*)/i)
    encoding = meta[1]
  end
end
=> 'gb2312'
Next, we use ruby's iconv library to convert text from one encoding to the other
1
2
3

require 'iconv'
Iconv.conv('utf-8', encoding, curl_object.content) 
The above conversion will fail for this site. Since, iconv is not able to convert all the characters from one encoding to the required one. One solution to this is to ask Iconv to convert as much as it can
1
2

Iconv.conv('utf-8//IGNORE', encoding, curl_object.content) 

Cool! Now we have content in utf-8 format. Lets scrape it!

Using Hpricot:

1
2
3

Hpricot(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).search("meta")
=> #<Hpricot::Elements[{emptyelem <meta content="text/html; charset=gb2312" http-equiv="Content-Type">}, {emptyelem <meta name="description" content="包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站">}, {emptyelem <meta name="keywords" content="茶,茶叶,茶网,茶文化,茶艺培训,茶道,tea,茶文化,茶业,红茶,绿茶,铁观音,茶饮,茶艺,茶餐厅,茶具,紫砂,龙井,碧螺春,乌龙茶,大红袍,普洱茶">}]>

Using Scrapi:
1
2
3
4
5
6
7
8
9

Scraper.define do
  process "meta[name='description']", "r[]"=>:element
  result :r
end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content)).to_s
=> "<meta name=\"description\" content=\"&#229;&#338;&#8230;&#230;&#8249;&#172;&#232;&#338;&#182;
&#229;&#182;&#229;&#352;&#232;&#338;&#182;&#233;&#165;&#174;&#230;&#8211;
&#8482;,&#232;&#338;&#182;&#233;&#163;&#376;&#229;&#8220;,&#230;&#339;&#232;
&#338;&#182;,&#232;&#338;&#182;&#230;&#176;&#8216;&.......

This is not the expected output. Scrapi comes with 2 parsers: tidy and html_parser. Scrapi by default uses tidy, so lets try using html_parser
1
2
3
4
5
6

t=Scraper.define do
  process "meta[name='description']", "r[]"=>:element
  result :r
end.scrape(Iconv.conv('utf-8//IGNORE', encoding, curl_object.content), :parser=>:html_parser).to_s
=> "<meta name=\"description\" content=\"包括茶叶及茶饮料,茶食品,搜茶,茶民周刊,茶文化,音乐红茶馆,下午茶,茶道,茶叶选购指南及茶具陶瓷,茶业种植,加工,营销等频道及专题栏目,是以茶道茶艺茶文化为主题的综合性茶道网站\">"

Yep, and it that works now!. This is how the site information, now looks like:

Filed in ruby on rails
Posted on 22 September
36 comment Bookmark   AddThis Social Bookmark Button
Comments

Leave a response

  1. de groazaOctober 27, 2009 @ 02:57 PM

    I’m so happy I stumble upon this blog! A lot of helpful info I really needed to know these stuff, I had a hard time with those foreign characters. Thank you so much!

  2. Sell goldApril 17, 2010 @ 02:43 PM

    Another very solid tutorial. Thank you. Not something I can use right now but I bookmarked for the future.

    Matt Gold

  3. Buy glitterMay 17, 2010 @ 01:23 AM

    This looks brittle. What if the encoding is defined in the HTTP header? With normal open() instead of curl you get the encoding for free: http://jasonseifer.com/2008/03/18/hpricot-and-utf-8

  4. website designMay 24, 2010 @ 07:48 AM

    Hello, the type code really help me much. Thank you for the suggestion and code.

  5. Small BoxesMay 25, 2010 @ 08:49 AM

    It is very useful, thanks.

    Good Luck

  6. KeychainMay 25, 2010 @ 08:50 AM

    A lot of helpful info I really needed to know these stuff, me too

  7. ViolaMay 29, 2010 @ 11:30 AM

    Hello, the type code really help me much. Thank you for the suggestion and code.

  8. cook county jail inmate searchJune 01, 2010 @ 06:49 PM

    excellent information. Quite helpful. I gonna share this with my communities.

  9. nadeem jeeJune 02, 2010 @ 07:42 PM

    He was not a favourite at the court of Lorenzo the Magnificent as Filippino Lippi and Botticelli were. Lorenzo liked those who would flatter him and do as they were bid, while Leonardo took his own way in everything and never said Cat Supplies what he did not mean.

  10. surveysJune 03, 2010 @ 07:44 PM

    I have been trying to look for a solution to the utf8 problem and glad I searched and found you. Now I can save myself a lot of time by not having to figure out myself.

  11. cheap prada sneakersJune 08, 2010 @ 11:32 AM

    Good article. Thank you so much to share!

  12. Pinetop Cabin RentalsJune 08, 2010 @ 10:09 PM

    Thanks for sharing this great article.. Like it! Pinetop Cabin Rentals

  13. search plumbersJune 15, 2010 @ 09:18 AM

    Another great set, I appreciate all the work you put into this site, helping out others with your fun and creative works.

  14. talkpcJune 21, 2010 @ 03:20 AM

    Thank you so much to share!

  15. labradoriteJune 25, 2010 @ 09:45 PM

    This is very good information!

  16. liberal arts collegesJune 28, 2010 @ 12:19 PM

    Its always good to learn tips like you share for blog posting. As I just started posting comments for blog and facing problem of lots of rejections. I think your suggestion would be helpful for me. I will let you know if its work for me too. Thank you for sharing this beautiful articles.

  17. Electrician ServicesJune 30, 2010 @ 06:45 AM

    These facts are amazing . I was searching before last 5 weaks and ia din’t get the perfect answer. But after all i found from your site. thanks for posting such a interesting topic.

  18. Builder Services SydneyJuly 01, 2010 @ 10:01 AM

    I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you. Cleaners | Painters | Plumbers

  19. Pandora CharmsJuly 02, 2010 @ 01:11 AM

    Yet another problem with these changes. I am not able to access instance methods of my derived class from Scraper::Base from these blocks. I think these blocks are not “bind” to the local instance properly. Following code is an example from base.rb of scrapi and this doesn’t work.

  20. Acronym ListJuly 07, 2010 @ 01:48 AM

    The Acronym List is a searchable database of over 8 million acronyms, abbreviations and meanings. Covers: business, international, chat, organizations, common acronyms, computers, science, technology, government, telecommunications, and military acronyms.

  21. Connell ConnorJuly 07, 2010 @ 07:04 AM

    Try the Burton Blunt or the Capita Scaremaster or a Rome cheap trick. those are all cheap and good for park. e-business

  22. Buck BudJuly 08, 2010 @ 06:33 AM

    Most will cut them to size for you if you ask. They may charge a small fee. If they don’t know what they are doing usually you can use 1×4 oak boards and they will work fine. e-business

  23. voyanceJuly 12, 2010 @ 12:45 PM

    good tuto thanks !

  24. Search Marketing GroupJuly 14, 2010 @ 05:46 AM

    Good website. I like the all pages and all comments. Thanks for all!!! Regards!

  25. HydrophytenJuly 19, 2010 @ 08:55 PM

    I was searching before last 5 weaks and ia din’t get the perfect answer. But after all i found from your site. thanks for posting such a interesting topic.

  26. HostingJuly 26, 2010 @ 08:54 PM

    It’s a very exciting article. I like it so! i was very focused when i’m reading this. Thanks

  27. loan softwareJuly 27, 2010 @ 08:08 PM

    I am new to scrapping but I think its quite an awesome technique.No doubt there are lot features available in different programming languages which we can use to scrap a web site very easily.

  28. Speed Dating New YorkJuly 28, 2010 @ 06:08 AM

    I honestly think that just using Rails has made me a better developer. I can’t say much about Coldfusion because I’ve never used it, but it’s good that you are considering the merits of each, best to use the tools that are right for the job. Speed Dating New York

  29. Juicy CoutureAugust 03, 2010 @ 07:35 AM

    Shop the latest styles juicy couture handbags, juicy couture tracksuit. Juicy Couture

  30. Internet Marketing CompaniesAugust 04, 2010 @ 07:47 AM

    Really great information and inspiration, both of which I like. Nice article with very strong content. I love reading it over again.

  31. Schweiz GewinnspieleAugust 07, 2010 @ 10:29 PM

    cool information – difficult but interessting!

  32. dressAugust 10, 2010 @ 03:49 AM

    autoradio navigation, car dvd gps navigation, sat navigation stereo, OEM Factory headunit for all car makes Higher quality car electronics from Qualir Car DVD Player

  33. juicy coutureAugust 11, 2010 @ 06:32 AM

    I like this article very much. Thanks for your share!

  34. monclerAugust 11, 2010 @ 06:40 AM

    FashionStyleOnsale offer high quality Moncler Jackets at low price. Moncler Jackets on sale, shop more discount Moncler Vest, Moncler Coats at FashionStyleOnsale Moncler

  35. monclerAugust 11, 2010 @ 06:50 AM

    FashionStyleOnsale offer high quality Moncler Jackets at low price. Moncler Jackets on sale, shop more discount Moncler Vest, Moncler Coats at FashionStyleOnsale Moncler

  36. Site de forexAugust 12, 2010 @ 03:15 PM

    Merci pour cet article car l’integration des characters dans les codes html c’est pas evident.

Comment