Table of Contents
Firequark is an extension to Firebug to aid the process of HTML Screen Scraping. Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information. Firequark is built to unleash the power of css selector for use of html screen scraping.
HTML screen scraping is a common technique of extracting information about specific and useful elements from a web page. Independent of programming language, for extracting an element from a web page one need to know its exact location or a key to uniquely identify the element. There are two approaches for uniquely identifying an element: using XPath or CSS Selectors.
Firebug has an inbuilt functionality of generating XPath for an html element. Ilya Grigorik has written a good article on using XPath for HTML screen scraping. Whereas, Firequark extends Firebug for generating CSS Selector for elements on a web page.
Example case : Lets take a practical example where you want to scrape Amazon.com. My goal is to get product name, price and rating for all the products from the Amazon point-and-shoot camera catalog page. I will use this example in screencast and explanation below.
Why Firequark?XPath vs. CSS Selector
When XPath is already provided by Firebug then why do I need CSS Selector? Xpath is great for scaping XML documents but for (x)html documents, it runs into many issues like :
- Various parsers generate different xpath for the same element depending on their handling of broken markup with badly nested tags, errors in html pages, custom tags etc.
- Firefox adds <tbody> tag in table nodes, independent of whether html page has tbody tag or not, which makes it difficult to figure out to keep <tbody> tag or not.
CSS Selector does not suffer from these problems because css selector of an html node is based on properties of self and neighboring nodes. Attributes of element and its neighbors like id, class, etc are used to find css selector for an element.
Find CSS Selector manuallyIts difficult to find css selector for a html node. Its a trial and error method until you find a right combination of rule (css selector). In the worst case the css selector would be the xpath itself. In our example case, its more difficult to manually find one unique css selector for 24 camera products and one each for their attributes like camera name, price, rating, etc (4 in total).
Bundle ScrapingI am big fan of Scrapi, a html scraping toolkit in Ruby because it supports bundle scraping. Bundle scraping refers to extraction of multiple attributes of an object from a web page in one parse. Bundle scraping is well-defined at Qscraper: a hpricot interface to scrapi.
Continuing with the example case, object is a camera product (there are 24 such objects on the page) and price, rating, product name, etc are attributes of camera object. One way of extracting attributes is to separately get list of product name, price and rating from the web page and then combine these list. But, how do you combine them? What if one of the camera product is not rated or amazon does not provide its price?
Firequark is really powerful in solving this problem. Using Firequark, first get one css selector to identify all the camera objects on the page which contain the attribute information inside them (name, price and rating). Set camera object to parent and find css selector of attributes relative to the parent. Give these css selectors as an input to html screen scraper supporting bundle scraping like Scrapi and bingo! Our screencast below explains this case in detail.
Usage - screencastClick here to view the screencast [format:avi, size:6.6MB]
Note : Avoid using CSS Selector functionality on first 2-3 products on a page because top 2-3 products are usually displayed differently like top sellers or top rated with same id which causes problems in getting good & simple css selectors.
Even in our screencast we analyze the 4th product on the page to get a simple css selector
InstallationClick here to install, this will overwrite your current Firebug installation (built on Firebug v1.05).
(FF3 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.2).
(FF3.5 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.4.1 and firefox 3.5).
Note: in case you are doing firebug development in you working firefox then please backup your work before installing Firequark.
DocumentationFirequark adds four new functions to each node element in html source tab of Firebug. In html source tab of Firebug, when you click on an html node:
| Current menu | New menu |
|---|---|
- Get U CSS Selector: Get css selector for the selected element which uniquely identifies that element.
- Get CSS Selector: Get one css selector for group of elements on a web page. When you select this option, it will prompt you for how many such elements exist on this page. In our example case, right click on camera object node, in the prompt box put 24, as there are 24 camera objects on the page, press enter and you will get one unique css selector to extract 24 camera objects.
- Mark as parent: Used in bundle scraping to mark an object as parent. It is followed by Get U CSS Selector for attributes of parent, which will return you with css selector of attributes relative to parent object.
- Unmark parentnode: To unmark object as parent.
Todo
- The current algorithm to find css selector is not sophisticated but solves my purpose very well.
- Firquark is based on CSS Selector 2. We want to move to CSS Selector 3 because it has more selectors.

I use Firebug a lot for screen scraping with Greasemonkey or Perl. Looking forward to trying this out.
This is a really excellent extension to Firebug and I’m sad to see development has dropped off—any plans to submit as a patch to Firebug core?
Any plans to update this for Firefox 3 anytime soon?
http://groups.google.com/group/firebug/browse_thread/thread/9fbc0002e62eddde?hl=en
OMG!! Please update this. My company depends on it!
indeed, please update for use of FF3 and PLEASE stop making this thing install the most recent version of Firebug first. What is that even good for? If people don’t bother to read the fact that they need the latest version, then why even help them?
Looks very interesting, thanks for article.
thanxxx
Please do update this to FF3. We really could use this. What a great tool!
Please find the firequark for ff3 here: http://www.quarkruby.com/firequark-1.2.xpi
Sorry for the delayed release and notification.
Woops, yeah, thanks for the note re the google groups announcement. :D
Curious Quarks, could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object? Would just like to see how you’re manipulating this as I’m having a couple of minor hiccups that may be resolved by seeing someone else’s methodology.
i understand it right:
a) it is a addon for firebug (firebug itself is untouched and i can update firebug if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`
Thanks for article i like mozilla firefox and firefox addons
I like this website I feel that this site could be very useful Web designs.
Wonderfull tool ! I gain a huge time ! Congratulations !
This is one of my most favorite tools especially with Firefox..Thanks mate
Curious Quarks, could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object? Would just like to see how you’re manipulating this as I’m having a couple of minor hiccups that may be resolved by seeing someone else’s methodology.
a) it is a addon for firebug (firebug itself is untouched and i can update firebug if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`
Woops, yeah, thanks for the note re the google groups announcement. :D
it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?
awesome stuff thanks for sharing. I didn’t realise what firequark can do and you have certainly provided a great intro and explanation. Will definitely consider using it
if a new version comes out) b) it replaces the original firebug and i have no chance to update to a newer firebug version without removing firequark?`
@Frasi – no, I;m afraid you have to remove firequark, there is no way arund it!
I’m going to try this out right now!
firequark-1.2.xpi still can not work with FireFox 3.5.1, Any updates??
yeah right now, there is no way to upgrade firebug without removing firequark :(
The new firequark is here: http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi (works with ff3.5)
Could you also post your sample scrAPI code as well which actually pulls out the multiple items into an output object?
Why isn’t this handy addon posted on Mozilla Addons where it reaches a greater audience.
Also, in regards to screen-scraping Mozenda is great software as well. I started out using the screen-scraper.com software and it worked fine, there was just a huge learning curve. I have limited experience with Firequark. I switched to http://www.getmozenda.com and the documentation for the software is much better, and it is far more intuitive.
I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??
I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??
One way to achieve a popup calendar is to open the calendar in a new popup window. But this will be a server side calendar. For a client side calendar look for already created controls in the ASP.NET control gallery.
I tried to install it but didn’t worked, don’t know where i go wrong.
Firequark is a great extension to Firebug. Firebug alone is just brilliant, but with Firequark bundle, it’s like on the fly.
I’m using it on my firebug too.
Firebug is a freeware.. totally “Free” with no deadlines or time-limits.. and yes it increases productivity 300%- at least in my opinion…. it is mainly for firefox.. but there is a version for Safari, IE and Opera called Firebug Lite.. i was exploring the firebug light version on Safari and IE in the past 3 days.. they crashed several times and didn’t have all the features of the Firefox’s version.. so the conclusion, Firebug is only good for firefox..
Thanks for the very useful downloads. They are great tools specially this one : http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi (works with ff3.5)
Excellent problem solver and time saver. Much appreciated and thanks for the updates.
I have a asp:calendar in a div,it is displayed after clicking on a html image.but when i click on the next month the calendar disappears.i don’t want the calendar to go to server for this.Is it possible to have a asp calendar which does not go to server on changing the month??
I also have a calendar and soon I will introduce.
they crashed several times and didn’t have all the features of the Firefox’s version.. so the conclusion, Firebug is only good for firefox..
I tried to install it, but it would not work. What is wrong?
This is a good tool, but i can not seem to get the bundle scraping working on my machine. a friend is using the same thing and it works fine for him.
So far so god. But it wont work with Flock Browser based on Firefox engine. So I have to install a standalone FF, but for this great tool I will do it – now!
Good work! Your article is an excellent example of why I keep comming back to read your excellent quality content that is forever updated. Thank you!
Great stuff – but I wish there was an implementation for Google Chrome. It seems that plugins tend to lag behind for this browser.
I tried to install it but didn’t worked.
Bucharest, Romania’s biggest city and capital, is nowadays a bustling metropolis. Today when a vast flow of people are coming to “Cazare Bucuresti in regim hotelier” are successfully leased. We may provideyou with perfect servicing and refinement, soyou will appear to be just like at home in our bucharest apartments. You will spend your time perfectly in such rooms, as the design was organized for your wellbeing. You can select where to stay, it may center with it busy growth or not busy city area not far from the downtown. There is a broad multiplicity of rooms, from studio to 3 of 4 rooms, a big dominance of such accommodations comparing with hotels is kitchen, living room and charge. The payment and cut price is based on which interval you are going to stay, you may rent an apartmentfor 1 day or two months or even longer. The payment is accordingly form 45 – 80 Euro for one day. We will be glad to serve you to spend unforgettable holiday in Bucuresti, contact us by mail, phone or on our Internet site with chisinau apartment for rent.
Great article man. It was really interesting to read this post . Thanks a lot for this.
Russian girl, check if you are using correct version of FF. Also I am not sure is there a Russian version of it? Some of my friends had problems with non English versions.
I have installed it, works well.
Good sharing. It works like a charm
Thanks a lot !!
Been looking for such a tool for some time. It works great and I am also a Firebug user and use it to check site performance.
This tool is extremely useful for my work. Thanks!
Awesome. It works well and glad I found this resource.
I can see, somebody had problems with some versions. I just tested in FF in 3 and 3.5 versions (and on 2 languages – Hu and En). It works perfectly, and it is very useful tool. Thx.
I want to express my admiration of your writing skill and ability to make reader to read the while thing to the end. I would like to read more of your blogs and to share my thoughts with you. I will be your frequent visitor, that’s for sure.
FireQuark is what we’ve since long expected as an improvement from FireBug developers, Using CSS selector when XPath proves insufficient to HTML scraping.
Why is it not released as a real Firebug extension? Replacing existing Firebugs is really a showstopper.
great news since the last time!!
i just really love it :)
Post is nicely written and it contains many good things for me. I am glad to find your impressive way of writing the post. Now it become easy for me to understand and implement the concept. Thanks for sharing the post.
Good job! THANKS! You guys do a great blog, and have some great contents. Keep up the good work.
best regards,
Post very nicely written, and it contains useful facts. I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement. Thanks for sharing with us.
USERID_18233 QuarkRuby: Scrapi enhancements Home Guides Tutorials Scrapi enhancements We have been using ruby library Scrapi quite a lot for HTML Scraping … Road Trip Planner towe Ski
i just really love it :)
Thats the cool’s themes i have see in a long time. Very nice
Thats the cool’s themes i have see in a long time. Very nice
In fact very useful tool and works perfectly with different languages.
We were waiting for that tool. CSS is much easier with a working selector Thanks, Patrick
I love this post. This is very informative blog.
OMG!! thanxx 4 sharing this, exactly what I need
I, too, wish that there was an implementation for Google Chrome. Hopefully in the near future.
Thank you for the great article!!
This really improves my work. Thx for the hints.
Vote for an implementation in Google Chrome too
FF 3.5.7 German!—FireBug 1.4.1 & this :
http://www.quarkruby.com/assets/2009/8/4/firequark-3.5.2.xpi
did’nt work, cause of the german vs. i guess—will try english now
Do you plan to support FF 3.6 ?
are there any updates for this?
are there any updates for this?
Anyway to use this on dreamweaver?
Firequark is really a usefull plugin. But sometimes it slows down firefox. Anyone else with such a problem?
I like mozilla firefox and firefox addons. Thanks for this article!
Lot of thanks for this article. Its really a very good topic. Its so interesting and attractive. I like it so much.
I really impressed by your post. Thank you for this great information, you write very well which i like very much.
This is a great add on for fire fox. Please make FireFox 3 updated and please post it as soon as possible. Thanks
I would like to come back to your blog tomorrow and get dome note down for my lab work.
Thank you for this information and Again thanks for sharing your knowledge with us
Thanks for sharing this tool. I hope it’ll be helpful to the project am currently working in. God Bless and Good luck!
Thank you for the nice article
I must admit that this post really interesting, thanks for the writing!
I have installed it, works well.
I have installed it, works well.
Thanks. I just install a Firequark. Highly recommended.
Thanks for good information.
Thanks for good information.
Now I don’t have to load up CSSedit (another awesome program, like firebug CSS inspection but you can save your changes) just to pull out a weird selector, I can just launch firequark, use the inspector and right click on an element to get a unique selector…how cool is that !!
Great information about HTML screen scraping. I am really interested in learning more about this. Thanks so much!
Thanks for the good information. I will definitely have to share this site with coworkers.
It is my great pleasure to visit your website and to enjoy your excellent post here. I like that very much. I can feel that you paid much attention for those articles, as all of them make sense and are very useful. Thank you for sharing with us….........
Good inrasting sharing.
Awesome job mate ! Keep us posted
Thanks for good information that comes out to read.
Huh… Interesting. I can’t stop myself reading from the beginning till the end, thanks for sharing.
I like mozilla firefox and firefox addons. Thanks for this article!
I really enjoyed this post. I will be back and have bookmarked your site.
Get one css selector for group of elements on a web page. When you select this option, it will prompt you for how many such elements exist on this page. In our example case, right click on camera object node, in the prompt box put 24, as there are 24 camera objects on the page, press enter and you will get one unique css selector to extract 24 camera objects.
granite countertops nj
Hey that was an interesting article for sure. Thank you ver much for putting so much effort in this blog. It really means a lot to me.
Regards Festplatte from Germany
Great information about HTML screen scraping. I am really interested in learning more about this. Thanks so much!
Thanks for recommending Firequark, I will check it out and give it a thorough test.
I like the foundation of this blog has a great variety of comments I really like it, several points of view helps in the appreciation of the subject,is very interesting and I would like learn more.
Firequack is very interested, thanks.