Firequark : quick html screen scraping  58



Table of Contents
  1. Introduction
  2. Why Firequark?
    1. XPath vs. CSS Selector
    2. Find CSS Selector manually
    3. Bundle Scraping
  3. Usage - screencast
  4. Installation
  5. Documentation
  6. Todo

Firequark is an extension to Firebug to aid the process of HTML Screen Scraping. Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information. Firequark is built to unleash the power of css selector for use of html screen scraping.

HTML screen scraping is a common technique of extracting information about specific and useful elements from a web page. Independent of programming language, for extracting an element from a web page one need to know its exact location or a key to uniquely identify the element. There are two approaches for uniquely identifying an element: using XPath or CSS Selectors.

Firebug has an inbuilt functionality of generating XPath for an html element. Ilya Grigorik has written a good article on using XPath for HTML screen scraping. Whereas, Firequark extends Firebug for generating CSS Selector for elements on a web page.

Example case : Lets take a practical example where you want to scrape Amazon.com. My goal is to get product name, price and rating for all the products from the Amazon point-and-shoot camera catalog page. I will use this example in screencast and explanation below.

Why Firequark?
XPath vs. CSS Selector
When XPath is already provided by Firebug then why do I need CSS Selector? Xpath is great for scaping XML documents but for (x)html documents, it runs into many issues like :
  • Various parsers generate different xpath for the same element depending on their handling of broken markup with badly nested tags, errors in html pages, custom tags etc.
  • Firefox adds <tbody> tag in table nodes, independent of whether html page has tbody tag or not, which makes it difficult to figure out to keep <tbody> tag or not.

CSS Selector does not suffer from these problems because css selector of an html node is based on properties of self and neighboring nodes. Attributes of element and its neighbors like id, class, etc are used to find css selector for an element.

Find CSS Selector manually

Its difficult to find css selector for a html node. Its a trial and error method until you find a right combination of rule (css selector). In the worst case the css selector would be the xpath itself. In our example case, its more difficult to manually find one unique css selector for 24 camera products and one each for their attributes like camera name, price, rating, etc (4 in total).

Bundle Scraping

I am big fan of Scrapi, a html scraping toolkit in Ruby because it supports bundle scraping. Bundle scraping refers to extraction of multiple attributes of an object from a web page in one parse. Bundle scraping is well-defined at Qscraper: a hpricot interface to scrapi.

Continuing with the example case, object is a camera product (there are 24 such objects on the page) and price, rating, product name, etc are attributes of camera object. One way of extracting attributes is to separately get list of product name, price and rating from the web page and then combine these list. But, how do you combine them? What if one of the camera product is not rated or amazon does not provide its price?

Firequark is really powerful in solving this problem. Using Firequark, first get one css selector to identify all the camera objects on the page which contain the attribute information inside them (name, price and rating). Set camera object to parent and find css selector of attributes relative to the parent. Give these css selectors as an input to html screen scraper supporting bundle scraping like Scrapi and bingo! Our screencast below explains this case in detail.

Usage - screencast

Click here to view the screencast [format:avi, size:6.6MB]

Note : Avoid using CSS Selector functionality on first 2-3 products on a page because top 2-3 products are usually displayed differently like top sellers or top rated with same id which causes problems in getting good & simple css selectors.

Even in our screencast we analyze the 4th product on the page to get a simple css selector

Installation

Click here to install, this will overwrite your current Firebug installation (built on Firebug v1.05).

(FF3 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.2).

(FF3.5 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.4.1 and firefox 3.5).

Note: in case you are doing firebug development in you working firefox then please backup your work before installing Firequark.

Documentation

Firequark adds four new functions to each node element in html source tab of Firebug. In html source tab of Firebug, when you click on an html node:

Current menuNew menu

  1. Get U CSS Selector: Get css selector for the selected element which uniquely identifies that element.
  2. Get CSS Selector: Get one css selector for group of elements on a web page. When you select this option, it will prompt you for how many such elements exist on this page. In our example case, right click on camera object node, in the prompt box put 24, as there are 24 camera objects on the page, press enter and you will get one unique css selector to extract 24 camera objects.
  3. Mark as parent: Used in bundle scraping to mark an object as parent. It is followed by Get U CSS Selector for attributes of parent, which will return you with css selector of attributes relative to parent object.
  4. Unmark parentnode: To unmark object as parent.

Todo
  1. The current algorithm to find css selector is not sophisticated but solves my purpose very well.
  2. Firquark is based on CSS Selector 2. We want to move to CSS Selector 3 because it has more selectors.
Filed in our tools
Posted on 05 September
58 comment Bookmark   AddThis Social Bookmark Button Updated on 04 August
Comments

Leave a response

  1. engtechSeptember 07, 2007 @ 06:49 PM

    I use Firebug a lot for screen scraping with Greasemonkey or Perl. Looking forward to trying this out.

  2. jamiewMay 07, 2008 @ 07:23 PM

    This is a really excellent extension to Firebug and I’m sad to see development has dropped off—any plans to submit as a patch to Firebug core?

  3. JimmyJune 27, 2008 @ 09:37 PM

    Any plans to update this for Firefox 3 anytime soon?

  4. JimmyJuly 03, 2008 @ 12:54 PM

    http://groups.google.com/group/firebug/browse_thread/thread/9fbc0002e62eddde?hl=en

  5. RickJuly 16, 2008 @ 10:21 PM

    OMG!! Please update this. My company depends on it!

  6. SilverburnJuly 21, 2008 @ 08:58 PM

    indeed, please update for use of FF3 and PLEASE stop making this thing install the most recent version of Firebug first. What is that even good for? If people don’t bother to read the fact that they need the latest version, then why even help them?

  7. AnarşistJuly 26, 2008 @ 08:27 AM

    Looks very interesting, thanks for article.

  8. oyun indirJuly 29, 2008 @ 03:54 PM

    thanxxx

  9. sonetAugust 22, 2008 @ 11:01 PM

    Please do update this to FF3. We really could use this. What a great tool!

  10. QuarksAugust 23, 2008 @ 03:22 AM

    Please find the firequark for ff3 here: http://www.quarkruby.com/firequark-1.2.xpi

    Sorry for the delayed release and notification.

  11. sonetAugust 23, 2008 @ 03:47 AM

    Woops, yeah, thanks for the note re the google groups announcement. :D

  12. sonetAugust 24, 2008 @ 12:50 AM

    Curious Quarks, could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object? Would just like to see how you’re manipulating this as I’m having a couple of minor hiccups that may be resolved by seeing someone else’s methodology.

  13. Arno.NymOctober 10, 2008 @ 09:10 PM

    i understand it right:

    a) it is a addon for firebug (firebug itself is untouched and i can update firebug if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`

  14. Bedava SohbetJanuary 13, 2009 @ 05:59 PM

    Thanks for article i like mozilla firefox and firefox addons

  15. Animated Logo DesignMarch 31, 2009 @ 11:03 AM

    I like this website I feel that this site could be very useful Web designs.

  16. greyzliiApril 19, 2009 @ 03:44 PM

    Wonderfull tool ! I gain a huge time ! Congratulations !

  17. Meeting rooms heathrowApril 19, 2009 @ 07:47 PM

    This is one of my most favorite tools especially with Firefox..Thanks mate

  18. frasiMay 02, 2009 @ 08:20 PM

    Curious Quarks, could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object? Would just like to see how you’re manipulating this as I’m having a couple of minor hiccups that may be resolved by seeing someone else’s methodology.

  19. interesting factsJune 01, 2009 @ 06:01 PM

    a) it is a addon for firebug (firebug itself is untouched and i can update firebug if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`

  20. famous quotesJune 01, 2009 @ 06:06 PM

    Woops, yeah, thanks for the note re the google groups announcement. :D

  21. frasi celebriJune 10, 2009 @ 06:14 AM

    it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?

  22. graphic design melbourneJune 14, 2009 @ 11:50 AM

    awesome stuff thanks for sharing. I didn’t realise what firequark can do and you have certainly provided a great intro and explanation. Will definitely consider using it

  23. Custom LogoJune 25, 2009 @ 05:04 AM

    if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`

  24. ceaiJuly 24, 2009 @ 08:57 PM

    @Frasi – no, I;m afraid you have to remove firequark, there is no way arund it!

  25. mobilaJuly 24, 2009 @ 08:59 PM

    I’m going to try this out right now!

  26. ChunAugust 04, 2009 @ 03:52 AM

    firequark-1.2.xpi still can not work with FireFox 3.5.1, Any updates??

  27. QuarksAugust 04, 2009 @ 09:35 AM

    yeah right now, there is no way to upgrade firebug without removing firequark :(

    The new firequark is here: http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi (works with ff3.5)

  28. Russian girls seeking menAugust 08, 2009 @ 10:49 AM

    Could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object?

  29. AliAugust 14, 2009 @ 11:45 AM

    Why isn’t this handy addon posted on Mozilla Addons where it reaches a greater audience.

  30. Justin McClellandAugust 18, 2009 @ 10:40 PM

    Also, in regards to screen-scraping Mozenda is great software as well. I started out using the screen-scraper.com software and it worked fine, there was just a huge learning curve. I have limited experience with Firequark. I switched to http://www.getmozenda.com and the documentation for the software is much better, and it is far more intuitive.

  31. Master Key SystemSeptember 13, 2009 @ 05:59 AM

    I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??

  32. Master Key SystemSeptember 13, 2009 @ 06:00 AM

    I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??

  33. Marketing Jobs in DubaiSeptember 13, 2009 @ 06:04 AM

    One way to achieve a popup calendar is to open the calendar in a new popup window. But this will be a server side calendar. For a client side calendar look for already created controls in the ASP.NET control gallery.

  34. graphic design companies DubaiSeptember 13, 2009 @ 11:15 AM

    I tried to install it but didn’t worked, don’t know where i go wrong.

  35. social bookmarkingSeptember 17, 2009 @ 02:21 AM

    Firequark is a great extension to Firebug. Firebug alone is just brilliant, but with Firequark bundle, it’s like on the fly.

  36. Arabian ranches propertySeptember 30, 2009 @ 11:45 AM

    I’m using it on my firebug too.

  37. Web Design DubaiSeptember 30, 2009 @ 11:47 AM

    Firebug is a freeware.. totally “Free” with no deadlines or time-limits.. and yes it increases productivity 300%- at least in my opinion…. it is mainly for firefox.. but there is a version for Safari, IE and Opera called Firebug Lite.. i was exploring the firebug light version on Safari and IE in the past 3 days.. they crashed several times and didn’t have all the features of the Firefox’s version.. so the conclusion, Firebug is only good for firefox..

  38. InterventionOctober 04, 2009 @ 02:08 AM

    Thanks for the very useful downloads. They are great tools specially this one : http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi (works with ff3.5)

  39. Auto BrokerOctober 12, 2009 @ 05:26 PM

    Excellent problem solver and time saver. Much appreciated and thanks for the updates.

  40. Ricette Dolci October 16, 2009 @ 02:46 PM

    I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??

  41. Future PakistanOctober 18, 2009 @ 09:22 AM

    I also have a calendar and soon I will introduce.

  42. Term PaperOctober 23, 2009 @ 08:36 AM

    they crashed several times and didn’t have all the features of the Firefox’s version.. so the conclusion, Firebug is only good for firefox..

  43. Make Money For KidsOctober 30, 2009 @ 10:05 PM

    I tried to install it, but it would not work. What is wrong?

  44. BriansNovember 03, 2009 @ 05:22 PM

    This is a good tool, but i can not seem to get the bundle scraping working on my machine. a friend is using the same thing and it works fine for him.

  45. EricNovember 04, 2009 @ 08:32 AM

    So far so god. But it wont work with Flock Browser based on Firefox engine. So I have to install a standalone FF, but for this great tool I will do it – now!

  46. Venue HireNovember 04, 2009 @ 01:27 PM

    Good work! Your article is an excellent example of why I keep comming back to read your excellent quality content that is forever updated. Thank you!

  47. medical travelNovember 13, 2009 @ 06:26 AM

    Great stuff – but I wish there was an implementation for Google Chrome. It seems that plugins tend to lag behind for this browser.

  48. russian girlNovember 16, 2009 @ 12:47 PM

    I tried to install it but didn’t worked.

  49. chisinauNovember 18, 2009 @ 05:33 PM

    Bucharest, Romania’s biggest city and capital, is nowadays a bustling metropolis. Today when a vast flow of people are coming to “Cazare Bucuresti in regim hotelier” are successfully leased. We may provideyou with perfect servicing and refinement, soyou will appear to be just like at home in our bucharest apartments. You will spend your time perfectly in such rooms, as the design was organized for your wellbeing. You can select where to stay, it may center with it busy growth or not busy city area not far from the downtown. There is a broad multiplicity of rooms, from studio to 3 of 4 rooms, a big dominance of such accommodations comparing with hotels is kitchen, living room and charge. The payment and cut price is based on which interval you are going to stay, you may rent an apartmentfor 1 day or two months or even longer. The payment is accordingly form 45 – 80 Euro for one day. We will be glad to serve you to spend unforgettable holiday in Bucuresti, contact us by mail, phone or on our Internet site with chisinau apartment for rent.

  50. burning caloriesNovember 26, 2009 @ 06:11 PM

    Great article man. It was really interesting to read this post . Thanks a lot for this.

  51. Web design MelbourneNovember 26, 2009 @ 07:32 PM

    Russian girl, check if you are using correct version of FF. Also I am not sure is there a Russian version of it? Some of my friends had problems with non English versions.

  52. farmville cheatsNovember 28, 2009 @ 08:53 AM

    I have installed it, works well.

  53. weddingDecember 02, 2009 @ 06:29 PM

    Good sharing. It works like a charm

  54. fotografia ślubnaDecember 07, 2009 @ 03:58 PM

    Thanks a lot !!

  55. FloridaDecember 08, 2009 @ 07:38 AM

    This tool is extremely useful for my work. Thanks!

  56. Free AdvertisingDecember 08, 2009 @ 07:40 AM

    Awesome. It works well and glad I found this resource.

  57. BalatonDecember 08, 2009 @ 09:00 PM

    I can see, somebody had problems with some versions. I just tested in FF in 3 and 3.5 versions (and on 2 languages – Hu and En). It works perfectly, and it is very useful tool. Thx.

  58. KarlDecember 14, 2009 @ 11:52 AM

    FireQuark is what we’ve since long expected as an improvement from FireBug developers, Using CSS selector when XPath proves insufficient to HTML scraping.

Comment