Firequark : quick html screen scraping  114



Table of Contents
  1. Introduction
  2. Why Firequark?
    1. XPath vs. CSS Selector
    2. Find CSS Selector manually
    3. Bundle Scraping
  3. Usage - screencast
  4. Installation
  5. Documentation
  6. Todo

Firequark is an extension to Firebug to aid the process of HTML Screen Scraping. Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information. Firequark is built to unleash the power of css selector for use of html screen scraping.

HTML screen scraping is a common technique of extracting information about specific and useful elements from a web page. Independent of programming language, for extracting an element from a web page one need to know its exact location or a key to uniquely identify the element. There are two approaches for uniquely identifying an element: using XPath or CSS Selectors.

Firebug has an inbuilt functionality of generating XPath for an html element. Ilya Grigorik has written a good article on using XPath for HTML screen scraping. Whereas, Firequark extends Firebug for generating CSS Selector for elements on a web page.

Example case : Lets take a practical example where you want to scrape Amazon.com. My goal is to get product name, price and rating for all the products from the Amazon point-and-shoot camera catalog page. I will use this example in screencast and explanation below.

Why Firequark?
XPath vs. CSS Selector
When XPath is already provided by Firebug then why do I need CSS Selector? Xpath is great for scaping XML documents but for (x)html documents, it runs into many issues like :
  • Various parsers generate different xpath for the same element depending on their handling of broken markup with badly nested tags, errors in html pages, custom tags etc.
  • Firefox adds <tbody> tag in table nodes, independent of whether html page has tbody tag or not, which makes it difficult to figure out to keep <tbody> tag or not.

CSS Selector does not suffer from these problems because css selector of an html node is based on properties of self and neighboring nodes. Attributes of element and its neighbors like id, class, etc are used to find css selector for an element.

Find CSS Selector manually

Its difficult to find css selector for a html node. Its a trial and error method until you find a right combination of rule (css selector). In the worst case the css selector would be the xpath itself. In our example case, its more difficult to manually find one unique css selector for 24 camera products and one each for their attributes like camera name, price, rating, etc (4 in total).

Bundle Scraping

I am big fan of Scrapi, a html scraping toolkit in Ruby because it supports bundle scraping. Bundle scraping refers to extraction of multiple attributes of an object from a web page in one parse. Bundle scraping is well-defined at Qscraper: a hpricot interface to scrapi.

Continuing with the example case, object is a camera product (there are 24 such objects on the page) and price, rating, product name, etc are attributes of camera object. One way of extracting attributes is to separately get list of product name, price and rating from the web page and then combine these list. But, how do you combine them? What if one of the camera product is not rated or amazon does not provide its price?

Firequark is really powerful in solving this problem. Using Firequark, first get one css selector to identify all the camera objects on the page which contain the attribute information inside them (name, price and rating). Set camera object to parent and find css selector of attributes relative to the parent. Give these css selectors as an input to html screen scraper supporting bundle scraping like Scrapi and bingo! Our screencast below explains this case in detail.

Usage - screencast

Click here to view the screencast [format:avi, size:6.6MB]

Note : Avoid using CSS Selector functionality on first 2-3 products on a page because top 2-3 products are usually displayed differently like top sellers or top rated with same id which causes problems in getting good & simple css selectors.

Even in our screencast we analyze the 4th product on the page to get a simple css selector

Installation

Click here to install, this will overwrite your current Firebug installation (built on Firebug v1.05).

(FF3 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.2).

(FF3.5 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.4.1 and firefox 3.5).

Note: in case you are doing firebug development in you working firefox then please backup your work before installing Firequark.

Documentation

Firequark adds four new functions to each node element in html source tab of Firebug. In html source tab of Firebug, when you click on an html node:

Current menuNew menu

  1. Get U CSS Selector: Get css selector for the selected element which uniquely identifies that element.
  2. Get CSS Selector: Get one css selector for group of elements on a web page. When you select this option, it will prompt you for how many such elements exist on this page. In our example case, right click on camera object node, in the prompt box put 24, as there are 24 camera objects on the page, press enter and you will get one unique css selector to extract 24 camera objects.
  3. Mark as parent: Used in bundle scraping to mark an object as parent. It is followed by Get U CSS Selector for attributes of parent, which will return you with css selector of attributes relative to parent object.
  4. Unmark parentnode: To unmark object as parent.

Todo
  1. The current algorithm to find css selector is not sophisticated but solves my purpose very well.
  2. Firquark is based on CSS Selector 2. We want to move to CSS Selector 3 because it has more selectors.
Filed in our tools
Posted on 05 September
114 comment Bookmark   AddThis Social Bookmark Button Updated on 04 August
Comments

Leave a response

  1. engtechSeptember 07, 2007 @ 06:49 PM

    I use Firebug a lot for screen scraping with Greasemonkey or Perl. Looking forward to trying this out.

  2. jamiewMay 07, 2008 @ 07:23 PM

    This is a really excellent extension to Firebug and I’m sad to see development has dropped off—any plans to submit as a patch to Firebug core?

  3. JimmyJune 27, 2008 @ 09:37 PM

    Any plans to update this for Firefox 3 anytime soon?

  4. JimmyJuly 03, 2008 @ 12:54 PM

    http://groups.google.com/group/firebug/browse_thread/thread/9fbc0002e62eddde?hl=en

  5. RickJuly 16, 2008 @ 10:21 PM

    OMG!! Please update this. My company depends on it!

  6. SilverburnJuly 21, 2008 @ 08:58 PM

    indeed, please update for use of FF3 and PLEASE stop making this thing install the most recent version of Firebug first. What is that even good for? If people don’t bother to read the fact that they need the latest version, then why even help them?

  7. AnarşistJuly 26, 2008 @ 08:27 AM

    Looks very interesting, thanks for article.

  8. oyun indirJuly 29, 2008 @ 03:54 PM

    thanxxx

  9. sonetAugust 22, 2008 @ 11:01 PM

    Please do update this to FF3. We really could use this. What a great tool!

  10. QuarksAugust 23, 2008 @ 03:22 AM

    Please find the firequark for ff3 here: http://www.quarkruby.com/firequark-1.2.xpi

    Sorry for the delayed release and notification.

  11. sonetAugust 23, 2008 @ 03:47 AM

    Woops, yeah, thanks for the note re the google groups announcement. :D

  12. sonetAugust 24, 2008 @ 12:50 AM

    Curious Quarks, could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object? Would just like to see how you’re manipulating this as I’m having a couple of minor hiccups that may be resolved by seeing someone else’s methodology.

  13. Arno.NymOctober 10, 2008 @ 09:10 PM

    i understand it right:

    a) it is a addon for firebug (firebug itself is untouched and i can update firebug if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`

  14. Bedava SohbetJanuary 13, 2009 @ 05:59 PM

    Thanks for article i like mozilla firefox and firefox addons

  15. Animated Logo DesignMarch 31, 2009 @ 11:03 AM

    I like this website I feel that this site could be very useful Web designs.

  16. greyzliiApril 19, 2009 @ 03:44 PM

    Wonderfull tool ! I gain a huge time ! Congratulations !

  17. Meeting rooms heathrowApril 19, 2009 @ 07:47 PM

    This is one of my most favorite tools especially with Firefox..Thanks mate

  18. frasiMay 02, 2009 @ 08:20 PM

    Curious Quarks, could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object? Would just like to see how you’re manipulating this as I’m having a couple of minor hiccups that may be resolved by seeing someone else’s methodology.

  19. interesting factsJune 01, 2009 @ 06:01 PM

    a) it is a addon for firebug (firebug itself is untouched and i can update firebug if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`

  20. famous quotesJune 01, 2009 @ 06:06 PM

    Woops, yeah, thanks for the note re the google groups announcement. :D

  21. frasi celebriJune 10, 2009 @ 06:14 AM

    it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?

  22. graphic design melbourneJune 14, 2009 @ 11:50 AM

    awesome stuff thanks for sharing. I didn’t realise what firequark can do and you have certainly provided a great intro and explanation. Will definitely consider using it

  23. Custom LogoJune 25, 2009 @ 05:04 AM

    if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`

  24. ceaiJuly 24, 2009 @ 08:57 PM

    @Frasi – no, I;m afraid you have to remove firequark, there is no way arund it!

  25. mobilaJuly 24, 2009 @ 08:59 PM

    I’m going to try this out right now!

  26. ChunAugust 04, 2009 @ 03:52 AM

    firequark-1.2.xpi still can not work with FireFox 3.5.1, Any updates??

  27. QuarksAugust 04, 2009 @ 09:35 AM

    yeah right now, there is no way to upgrade firebug without removing firequark :(

    The new firequark is here: http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi (works with ff3.5)

  28. Russian girls seeking menAugust 08, 2009 @ 10:49 AM

    Could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object?

  29. AliAugust 14, 2009 @ 11:45 AM

    Why isn’t this handy addon posted on Mozilla Addons where it reaches a greater audience.

  30. Justin McClellandAugust 18, 2009 @ 10:40 PM

    Also, in regards to screen-scraping Mozenda is great software as well. I started out using the screen-scraper.com software and it worked fine, there was just a huge learning curve. I have limited experience with Firequark. I switched to http://www.getmozenda.com and the documentation for the software is much better, and it is far more intuitive.

  31. Master Key SystemSeptember 13, 2009 @ 05:59 AM

    I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??

  32. Master Key SystemSeptember 13, 2009 @ 06:00 AM

    I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??

  33. Marketing Jobs in DubaiSeptember 13, 2009 @ 06:04 AM

    One way to achieve a popup calendar is to open the calendar in a new popup window. But this will be a server side calendar. For a client side calendar look for already created controls in the ASP.NET control gallery.

  34. graphic design companies DubaiSeptember 13, 2009 @ 11:15 AM

    I tried to install it but didn’t worked, don’t know where i go wrong.

  35. social bookmarkingSeptember 17, 2009 @ 02:21 AM

    Firequark is a great extension to Firebug. Firebug alone is just brilliant, but with Firequark bundle, it’s like on the fly.

  36. Arabian ranches propertySeptember 30, 2009 @ 11:45 AM

    I’m using it on my firebug too.

  37. Web Design DubaiSeptember 30, 2009 @ 11:47 AM

    Firebug is a freeware.. totally “Free” with no deadlines or time-limits.. and yes it increases productivity 300%- at least in my opinion…. it is mainly for firefox.. but there is a version for Safari, IE and Opera called Firebug Lite.. i was exploring the firebug light version on Safari and IE in the past 3 days.. they crashed several times and didn’t have all the features of the Firefox’s version.. so the conclusion, Firebug is only good for firefox..

  38. InterventionOctober 04, 2009 @ 02:08 AM

    Thanks for the very useful downloads. They are great tools specially this one : http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi (works with ff3.5)

  39. Auto BrokerOctober 12, 2009 @ 05:26 PM

    Excellent problem solver and time saver. Much appreciated and thanks for the updates.

  40. Ricette Dolci October 16, 2009 @ 02:46 PM

    I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??

  41. Future PakistanOctober 18, 2009 @ 09:22 AM

    I also have a calendar and soon I will introduce.

  42. Term PaperOctober 23, 2009 @ 08:36 AM

    they crashed several times and didn’t have all the features of the Firefox’s version.. so the conclusion, Firebug is only good for firefox..

  43. Make Money For KidsOctober 30, 2009 @ 10:05 PM

    I tried to install it, but it would not work. What is wrong?

  44. BriansNovember 03, 2009 @ 05:22 PM

    This is a good tool, but i can not seem to get the bundle scraping working on my machine. a friend is using the same thing and it works fine for him.

  45. EricNovember 04, 2009 @ 08:32 AM

    So far so god. But it wont work with Flock Browser based on Firefox engine. So I have to install a standalone FF, but for this great tool I will do it – now!

  46. Venue HireNovember 04, 2009 @ 01:27 PM

    Good work! Your article is an excellent example of why I keep comming back to read your excellent quality content that is forever updated. Thank you!

  47. medical travelNovember 13, 2009 @ 06:26 AM

    Great stuff – but I wish there was an implementation for Google Chrome. It seems that plugins tend to lag behind for this browser.

  48. russian girlNovember 16, 2009 @ 12:47 PM

    I tried to install it but didn’t worked.

  49. chisinauNovember 18, 2009 @ 05:33 PM

    Bucharest, Romania’s biggest city and capital, is nowadays a bustling metropolis. Today when a vast flow of people are coming to “Cazare Bucuresti in regim hotelier” are successfully leased. We may provideyou with perfect servicing and refinement, soyou will appear to be just like at home in our bucharest apartments. You will spend your time perfectly in such rooms, as the design was organized for your wellbeing. You can select where to stay, it may center with it busy growth or not busy city area not far from the downtown. There is a broad multiplicity of rooms, from studio to 3 of 4 rooms, a big dominance of such accommodations comparing with hotels is kitchen, living room and charge. The payment and cut price is based on which interval you are going to stay, you may rent an apartmentfor 1 day or two months or even longer. The payment is accordingly form 45 – 80 Euro for one day. We will be glad to serve you to spend unforgettable holiday in Bucuresti, contact us by mail, phone or on our Internet site with chisinau apartment for rent.

  50. burning caloriesNovember 26, 2009 @ 06:11 PM

    Great article man. It was really interesting to read this post . Thanks a lot for this.

  51. Web design MelbourneNovember 26, 2009 @ 07:32 PM

    Russian girl, check if you are using correct version of FF. Also I am not sure is there a Russian version of it? Some of my friends had problems with non English versions.

  52. farmville cheatsNovember 28, 2009 @ 08:53 AM

    I have installed it, works well.

  53. weddingDecember 02, 2009 @ 06:29 PM

    Good sharing. It works like a charm

  54. fotografia ślubnaDecember 07, 2009 @ 03:58 PM

    Thanks a lot !!

  55. FloridaDecember 08, 2009 @ 07:33 AM

    Been looking for such a tool for some time. It works great and I am also a Firebug user and use it to check site performance.

  56. FloridaDecember 08, 2009 @ 07:38 AM

    This tool is extremely useful for my work. Thanks!

  57. Free AdvertisingDecember 08, 2009 @ 07:40 AM

    Awesome. It works well and glad I found this resource.

  58. BalatonDecember 08, 2009 @ 09:00 PM

    I can see, somebody had problems with some versions. I just tested in FF in 3 and 3.5 versions (and on 2 languages – Hu and En). It works perfectly, and it is very useful tool. Thx.

  59. cam chat softwareDecember 11, 2009 @ 01:07 PM

    I want to express my admiration of your writing skill and ability to make reader to read the while thing to the end. I would like to read more of your blogs and to share my thoughts with you. I will be your frequent visitor, that’s for sure.

  60. KarlDecember 14, 2009 @ 11:52 AM

    FireQuark is what we’ve since long expected as an improvement from FireBug developers, Using CSS selector when XPath proves insufficient to HTML scraping.

  61. JanDecember 17, 2009 @ 04:23 PM

    Why is it not released as a real Firebug extension? Replacing existing Firebugs is really a showstopper.

  62. fotograf slubnyDecember 19, 2009 @ 06:35 PM

    great news since the last time!!

  63. psychologDecember 19, 2009 @ 06:47 PM

    i just really love it :)

  64. discount sunglassesDecember 22, 2009 @ 01:30 AM

    Post is nicely written and it contains many good things for me. I am glad to find your impressive way of writing the post. Now it become easy for me to understand and implement the concept. Thanks for sharing the post.

  65. free online advertisingDecember 22, 2009 @ 08:33 AM

    Good job! THANKS! You guys do a great blog, and have some great contents. Keep up the good work.

    best regards,

  66. mctsDecember 24, 2009 @ 01:50 AM

    Post very nicely written, and it contains useful facts. I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement. Thanks for sharing with us.

  67. juliajinDecember 28, 2009 @ 10:52 AM

    USERID_18233 QuarkRuby: Scrapi enhancements Home Guides Tutorials Scrapi enhancements We have been using ruby library Scrapi quite a lot for HTML Scraping … Road Trip Planner towe Ski

  68. fotografia ślubnaDecember 28, 2009 @ 08:49 PM

    i just really love it :)

  69. NagrobkiJanuary 05, 2010 @ 01:36 PM

    Thats the cool’s themes i have see in a long time. Very nice

  70. NagrobkiJanuary 05, 2010 @ 01:41 PM

    Thats the cool’s themes i have see in a long time. Very nice

  71. casino frJanuary 12, 2010 @ 10:09 AM

    In fact very useful tool and works perfectly with different languages.

  72. Dating websitesJanuary 12, 2010 @ 11:23 AM

    We were waiting for that tool. CSS is much easier with a working selector Thanks, Patrick

  73. Best Buy ShopJanuary 13, 2010 @ 08:49 AM

    I love this post. This is very informative blog.

  74. machine à sousJanuary 13, 2010 @ 11:37 AM

    OMG!! thanxx 4 sharing this, exactly what I need

  75. Girls BeddingJanuary 20, 2010 @ 09:42 PM

    I, too, wish that there was an implementation for Google Chrome. Hopefully in the near future.

  76. celebrityJanuary 24, 2010 @ 01:53 PM

    Thank you for the great article!!

  77. nicolasJanuary 24, 2010 @ 07:00 PM

    This really improves my work. Thx for the hints.

  78. InsuranceJanuary 25, 2010 @ 01:59 PM

    Vote for an implementation in Google Chrome too

  79. bluishJanuary 28, 2010 @ 01:24 AM

    FF 3.5.7 German!—FireBug 1.4.1 & this :

    http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi

    did’nt work, cause of the german vs. i guess—will try english now

  80. Thierry BJanuary 29, 2010 @ 05:06 PM

    Do you plan to support FF 3.6 ?

  81. Video Game ReviewJanuary 30, 2010 @ 05:08 PM

    are there any updates for this?

  82. Video Game ReviewJanuary 30, 2010 @ 05:08 PM

    are there any updates for this?

  83. Dog Tags For DogsFebruary 02, 2010 @ 01:14 AM

    Anyway to use this on dreamweaver?

  84. FlugFebruary 05, 2010 @ 01:10 PM

    Firequark is really a usefull plugin. But sometimes it slows down firefox. Anyone else with such a problem?

  85. baby diapersFebruary 05, 2010 @ 01:36 PM

    I like mozilla firefox and firefox addons. Thanks for this article!

  86. PrintingFebruary 08, 2010 @ 09:57 AM

    Lot of thanks for this article. Its really a very good topic. Its so interesting and attractive. I like it so much.

  87. Laser Hair Removal New York CityFebruary 10, 2010 @ 05:02 AM

    I really impressed by your post. Thank you for this great information, you write very well which i like very much.

  88. resume writersFebruary 12, 2010 @ 01:56 AM

    This is a great add on for fire fox. Please make FireFox 3 updated and please post it as soon as possible. Thanks

  89. flex developmentFebruary 17, 2010 @ 03:10 PM

    I would like to come back to your blog tomorrow and get dome note down for my lab work.

  90. SEOFebruary 19, 2010 @ 02:58 AM

    Thank you for this information and Again thanks for sharing your knowledge with us

  91. thesisFebruary 23, 2010 @ 06:20 AM

    Thanks for sharing this tool. I hope it’ll be helpful to the project am currently working in. God Bless and Good luck!

  92. Blumenversand onlineFebruary 23, 2010 @ 10:29 AM

    Thank you for the nice article

  93. Herrenunterwaesche Damenunterwaesche February 23, 2010 @ 07:02 PM

    I must admit that this post really interesting, thanks for the writing!

  94. russian girlsFebruary 23, 2010 @ 10:08 PM

    I have installed it, works well.

  95. russian girlsFebruary 23, 2010 @ 10:08 PM

    I have installed it, works well.

  96. Thai RestaurantFebruary 24, 2010 @ 11:48 AM

    Thanks. I just install a Firequark. Highly recommended.

  97. cheaploptopFebruary 24, 2010 @ 03:44 PM

    Thanks for good information.

  98. cheaploptopFebruary 24, 2010 @ 03:45 PM

    Thanks for good information.

  99. casino reviewsFebruary 24, 2010 @ 07:12 PM

    Now I don’t have to load up CSSedit (another awesome program, like firebug CSS inspection but you can save your changes) just to pull out a weird selector, I can just launch firequark, use the inspector and right click on an element to get a unique selector…how cool is that !!

  100. Surf Shop UKFebruary 28, 2010 @ 01:17 AM

    Great information about HTML screen scraping. I am really interested in learning more about this. Thanks so much!

  101. Cheap Health InsuranceFebruary 28, 2010 @ 02:04 AM

    Thanks for the good information. I will definitely have to share this site with coworkers.

  102. touring caravan insuranceFebruary 28, 2010 @ 11:39 AM

    It is my great pleasure to visit your website and to enjoy your excellent post here. I like that very much. I can feel that you paid much attention for those articles, as all of them make sense and are very useful. Thank you for sharing with us….........

  103. ausfuhranmeldungFebruary 28, 2010 @ 03:54 PM

    Good inrasting sharing.

  104. lace wedding dressesFebruary 28, 2010 @ 09:14 PM

    Awesome job mate ! Keep us posted

  105. cheap laptopMarch 01, 2010 @ 03:03 AM

    Thanks for good information that comes out to read.

  106. HughMarch 01, 2010 @ 09:51 AM

    Huh… Interesting. I can’t stop myself reading from the beginning till the end, thanks for sharing.

  107. Thailand HotelMarch 01, 2010 @ 04:35 PM

    I like mozilla firefox and firefox addons. Thanks for this article!

  108. Herrenunterwaesche Damenunterwaesche March 06, 2010 @ 10:10 PM

    I really enjoyed this post. I will be back and have bookmarked your site.

  109. shakir anjumMarch 08, 2010 @ 09:25 AM

    Get one css selector for group of elements on a web page. When you select this option, it will prompt you for how many such elements exist on this page. In our example case, right click on camera object node, in the prompt box put 24, as there are 24 camera objects on the page, press enter and you will get one unique css selector to extract 24 camera objects.

    granite countertops nj

  110. USB-FestplatteMarch 08, 2010 @ 11:16 PM

    Hey that was an interesting article for sure. Thank you ver much for putting so much effort in this blog. It really means a lot to me.

    Regards Festplatte from Germany

  111. online advertisingMarch 09, 2010 @ 04:32 AM

    Great information about HTML screen scraping. I am really interested in learning more about this. Thanks so much!

  112. Web Design PerthMarch 10, 2010 @ 12:22 PM

    Thanks for recommending Firequark, I will check it out and give it a thorough test.

  113. Minneapolis Criminal LawyerMarch 10, 2010 @ 08:29 PM

    I like the foundation of this blog has a great variety of comments I really like it, several points of view helps in the appreciation of the subject,is very interesting and I would like learn more.

  114. Sonicare ToothbrushMarch 11, 2010 @ 02:25 AM

    Firequack is very interested, thanks.

Comment