Not exactly. Why?
- Setting up MS Sql connection from rails application is a serious pain in the a** and I have to do days to research to get it right. I have shared my findings in following two sections.
Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.com
ActiveResource is a great concept which consumes rails-style REST API but unfortunately most of the REST API's are not rails-style. This means that very frequently you will end up modifying ActiveResource to consume non rails-style REST API's. This article is about understanding ActiveResource and how to tweak/extend it to consume non rails-style REST API's. We will mainly concentrate on reading data i.e. the GET method.
sudo gem install active_youtube |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
#### Video ## search for videos on 'ruby' search = Youtube::Video.find(:first, :params => {:vq => 'ruby', :"max-results" => '5'}) puts search.entry.length ## video information of id = ZTUVgYoeN_o vid = Youtube::Video.find("ZTUVgYoeN_o") puts vid.group.content[0].url ## video comments comments = Youtube::Video.find_custom("ZTUVgYoeN_o").get(:comments) puts comments.entry[0].link[2].href ## searching with category/tags results = Youtube::Video.search_by_tags("Comedy") puts results[0].entry[0].title #### STANDARDFEED ## retrieving standard feeds most_viewed = Youtube::Standardfeed.find(:most_viewed, :params => {:time => 'today'}) puts most_viewed.entry[0].group.content[0].url #### USER ## user's profile - guthrie user_profile = Youtube::User.find("guthrie") puts user_profile.link[1].href #### PLAYLIST ## get playlist - multiple elements in playlist playlist = Youtube::Playlist.find("EBF5D6DC4589D7B7") puts playlist.entry[0].group.content[0].url |
We have been using ruby library Scrapi quite a lot for HTML Scraping in QuarkRank and other projects. Most of the times, I want to extract/scrape specific information from a page and directly dump it into the database. There were a few processes which were regularly repeated in my code, so as to make my code more DRY, I have enhanced Scrapi so that manipulations of extracted information becomes easier.
Lets consider an example, for each of the top 250 movies at IMDB, I want to extract and store in DB the following properties :
This article is about consuming YouTube API in your Ruby/Rails project using ActiveResource. Moreover, this article is an example of how to extend ActiveResource to consume non rest-style API.
Benefits of using our extension to ActiveResource :
Web widgets are widely used across the Internet but still lacks good documentation. From online advertisement to videos to blogs, widgets are highly used. Some of the popular widgets being Google Adsense, Youtube, MyBlogLog widgets and Twitter badges.
Note: This page will be slow to load because of many widget examples.
To formally define : The web widget is a portable chunk of code that can be installed and executed within any separate HTML-based web page by an end user without requiring additional compilation. A widget adds some content to that page that is (mostly) not static. Generally widgets are third party originated, though they can be home made. Widgets are also known as modules, snippets, and plug-ins.
This article is my journey of understanding and making a widget myself. I have tried to make things look simple and insightful by taking a lot of examples.
QuarkShop is a next-generation shopping experience to find product of your choice based on opinions across the Internet. You give your preference and Quarkshop will fetch the best matched products for YOU.
For launch, currently we have following products :
You can give your preference by selecting features you LIKE and we will find the best matched products. Selection of preference can be used along with other navigation parameters like brand, price, etc. These features are automatically extracted from reviews.
Then click submit, and you will see the three best matched products with their most related information. There is also an option to see more products. But we think that top 3 products is what consumer care about after they have given their preference, this makes searching for products simple and fast.
You can compare the top three products on the spot!. When you click on Compare Top Three button, you will see QuarkGraph, the graphical comparison showing scores of feature for each product. The graph is just not an image but you can play with it.
Features are keywords that are reviewed in a review. When you read a review you are always looking for such keywords and people's comment about this keywords. Quarkshop gives an option to choose features/keywords that you like and it will give you the best matched products. You can also read sentence from actual reviews giving opinion on keywords (Snippets).

We're back to blogging after taking a leave for more than a month. We have been very busy developing QuarkRank, a summarized reviews repository. It is a result of more than 18 months of dedicated research on Natural language processing, HTML Scrapping and User interface. Finally, we are happy to make this product live!!!
Currently, the repository is accessible using RESTful API or Widget. Moreover, its absolutely free!
"From product reviews, restaurant reviews, hotel reviews, to others, QuarkRank provides the information for making decisions at the point of purchase. Proprietary technology lies at the core of QuarkRank's ability to automatically summarize the opinions of millions of consumer reviews on the internet."
QuarkRank is an intelligent engine which crawls the web for opinions on various products/services and automatically summarize them feature-by-feature using its natural language processing technique.

QuarkRank will help consumers to quickly educate themselves, based on the most unbiased information possible, without spending hours reading review online.
If you use QuarkRank data, your customers will feel confident in making purchase decision at your site, without going to competitors, and at the same time reducing the return rate of impulse purchases.
QuarkRank provides a RESTful API to access our huge repository of summarized reviews. Send us simple HTTP requests and it will send back basic XML responses, which means you can interact with our API from any language.
It provides data in XML and JSON format. There is no limit is using the api. For detailed information, visit : ActiveResource can be used to access QuarkRank's RESTful API in Ruby on Rails. Note : You need to apply this tiny patch to ActiveResource.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
class Product < ActiveResource::Base self.site = "http://username:password@quarkrank.com" def self.list options={:category=>"camera"} find(:all, :params=>options) end def self.show sku, site=nil, all_features=false params = {} params[:site] = site unless site.nil? params[:all_features]="true" if all_features find(sku, :params=>params) end def self.search query find(:all,:params=>{:search=>query}) end end class Snippet < ActiveResource::Base self.site = "http://username:password@www.quarkrank.com/products/:product_id" def self.snippets product, name find name.gsub(" ","%20"),:params=>{:product_id=>product} end end |
More informatin at:
jQuery is a JavaScript library which follows unobtrusive paradigm for application development using JavaScript. jQuery inherently supports Behavior driven development and is based on traversing HTML documents using CSS Selectors. On the other hand, Prototype is a JavaScript library for Class driven development which makes life easier working with JavaScript. Prototype library has a good support in Ruby on Rails via helper functions.
I have always used Prototype library for most of my projects until I was introduced to jQuery three months back ... and it enchanted me.
HTTP is a stateless protocol which creates problem in uniquely tracking a visitor to a web application. The process of managing the state between browser and server is through the use of session IDs which uniquely identifies a client browser.
Session IDs can be stored and communicated in one of the following ways :Information stored between multiple client browser request is called Session Data. Session data for each visitor can be stored at the server or in cookies. Upon client request to server, session data is extracted from session storage using session ID send by client browser. A good common example for session data is user information for authentication.
In the present times, its hard to imagine a good web application not using Sessions.
A wonderful article on implementation techniques of Session ID.Ruby on Rails does a decent job in handling security concerns in the background. You will have to configure your application to avoid few security attacks while plugins would be required for many security concerns which are not at all or poorly managed by rails.
In this article I have described the security issues related to a ruby on rails web application. I have followed DRY by linking to articles with good explanation and solutions to security concerns wherever required. This guide can also be used as a quick security check for your current web application.
This article extends our acts_as_solr : search and faceting tutorial and talks about how to manage rails associations, solr indexes and more with acts_as_solr.
rebuild_solr_index is a class method to re-build your model indexes on import of external data. For large tables rebuilding Solr index is a time consuming process. See the fifth line in the pseudo code below (index optimization call), it makes rebuild_solr_index a slow process. For large tables, you do not want optimization to take place for each object added to the table. Whereas, removing optimization calls slows down the process of updating solr index.
1 2 3 4 5 6 7 8 |
## pseudo code
def rebuild_solr_index
for_each_row_in_table do |doc|
doc.save_to_solr_index
index.optimize
end
end
|
The solution to the problem is to use batch_size in #rebuild_solr_index. With batch size, say for example 100, the index optimization call is executed after indexing 100 rows.
Firequark is an extension to Firebug to aid the process of HTML Screen Scraping. Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information. Firequark is built to unleash the power of css selector for use of html screen scraping.
HTML screen scraping is a common technique of extracting information about specific and useful elements from a web page. Independent of programming language, for extracting an element from a web page one need to know its exact location or a key to uniquely identify the element. There are two approaches for uniquely identifying an element: using XPath or CSS Selectors.
Firebug has an inbuilt functionality of generating XPath for an html element. Ilya Grigorik has written a good article on using XPath for HTML screen scraping. Whereas, Firequark extends Firebug for generating CSS Selector for elements on a web page.
Example case : Lets take a practical example where you want to scrape Amazon.com. My goal is to get product name, price and rating for all the products from the Amazon point-and-shoot camera catalog page. I will use this example in screencast and explanation below.
While working with scrapi, I found there is no external documentation for HTML::Tag class. This article is to ensure no one says this again.
In scrapi, HTML::Node represents a html node which can be of 2 types: HTML::Text and HTML::Tag for a text node and html tag node respectively. Here is a code snippet in which scrapi returns the html node as a HTML::Tag object.