UTF-8 and html screen scraping in Ruby on Rails  6


Before I start, if you have any doubts or you are unaware about character sets (i.e. you are not familiar with words like utf-8, unicode etc), I would recommend you to read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The problem statement of this article is How to handle foreign or accented characters in html screen scraping. We encountered it while working on our website information project Quarkbase.com

Configuration & Tools we will be using are rails - 2.1.2, scrapi - 1.2.0, hpricot - 0.8.1, curb - 0.4.4.0. The example website that we will try to scrape is http://196m.cn

At Quarkbase, when we aggregate information from a website, scraping information (having non-english characters) doesn't always work. For example, when we get description for http://196m.cn from its HTML page, the extracted information is the first line after 'overview' in screenshot below :



Filed in ruby on rails
Posted on 22 September
6 comment Bookmark   AddThis Social Bookmark Button

Scrapi enhancements


We have been using ruby library Scrapi quite a lot for HTML Scraping in QuarkRank and other projects. Most of the times, I want to extract/scrape specific information from a page and directly dump it into the database. There were a few processes which were regularly repeated in my code, so as to make my code more DRY, I have enhanced Scrapi so that manipulations of extracted information becomes easier.

Lets consider an example, for each of the top 250 movies at IMDB, I want to extract and store in DB the following properties :

  1. imdb_id
  2. name
  3. release date
  4. rating
  5. tagline
  6. runtime
  7. director
Filed in ruby on rails
Tagged as   
Posted on 30 January
7 comment Bookmark   AddThis Social Bookmark Button Updated on 23 February

Why I moved from Prototype to jQuery



jQuery is a JavaScript library which follows unobtrusive paradigm for application development using JavaScript. jQuery inherently supports Behavior driven development and is based on traversing HTML documents using CSS Selectors. On the other hand, Prototype is a JavaScript library for Class driven development which makes life easier working with JavaScript. Prototype library has a good support in Ruby on Rails via helper functions.

I have always used Prototype library for most of my projects until I was introduced to jQuery three months back ... and it enchanted me.

Filed in ruby on rails
Tagged as   
Posted on 06 November
33 comment Bookmark   AddThis Social Bookmark Button Updated on 23 February

Sessions and cookies in Ruby on Rails


An important issue rarely talked about with little documentation on Internet. So, here we go ... a guide to session and cookies in Rails. Session and cookies are an integral part of any good web application and rails has a good support for them. Continuing with our DRY approach, this guide contains link to cool articles with good description wherever necessary.

Table of Contents

  1. Introduction
  2. Sessions
    1. Session in rails
    2. Configure your sessions
    3. Storage options
    4. Session storage limitations
    5. Session and Security
    6. HowTo
      1. Implement session expiration
      2. Delete stale sessions
      3. Find out active users
      4. Access session data using session_id
    7. Miscellaneous
  3. Cookies
    1. Cookie on rails
    2. cookies vs. request.cookies
    3. CookieJar
    4. Miscellaneous

Introduction

HTTP is a stateless protocol which creates problem in uniquely tracking a visitor to a web application. The process of managing the state between browser and server is through the use of session IDs which uniquely identifies a client browser.

Session IDs can be stored and communicated in one of the following ways :
  1. Embedded in URL
  2. In form field
  3. Using cookies.

Information stored between multiple client browser request is called Session Data. Session data for each visitor can be stored at the server or in cookies. Upon client request to server, session data is extracted from session storage using session ID send by client browser. A good common example for session data is user information for authentication.

In the present times, its hard to imagine a good web application not using Sessions.

A wonderful article on implementation techniques of Session ID.
Tagged as  
Posted on 21 October
15 comment Bookmark   AddThis Social Bookmark Button Updated on 23 February

Ruby on Rails Security Guide


Ruby on Rails does a decent job in handling security concerns in the background. You will have to configure your application to avoid few security attacks while plugins would be required for many security concerns which are not at all or poorly managed by rails.

In this article I have described the security issues related to a ruby on rails web application. I have followed DRY by linking to articles with good explanation and solutions to security concerns wherever required. This guide can also be used as a quick security check for your current web application.

Table of Contents

  1. Authentication
  2. Model
    1. SQL Injection
    2. Activerecord Validation
    3. Creating records directly from parameters
  3. Controller
    1. Exposing methods
    2. Authorize parameters
    3. Filter sensitive logs
    4. Cross Site Reference(or Request) Forgery (CSRF)
    5. Minimize session attacks
    6. Stop spam on your website from DNS Blacklist
    7. Caching authenticated pages
  4. View
    1. Cross site scripting(XSS) attack
    2. Anti-spam form protection
    3. Hide mailto links
    4. Use password strength evaluators
  5. Miscellaneous
    1. Transmission of Sensitive information
    2. File upload
    3. Secure your setup / environment
    4. Mysql configuration
    5. Use good passwords
  6. Security plugins directory
Tagged as  
Posted on 20 September
85 comment Bookmark   AddThis Social Bookmark Button Updated on 23 February

Advanced acts_as_solr


This article extends our acts_as_solr : search and faceting tutorial and talks about how to manage rails associations, solr indexes and more with acts_as_solr.

Table Of Contents

  1. Rebuild Solr index
  2. Import existing Solr index or your custom Solr schema.xml
  3. Highlight search term
  4. Rails associations and acts_as_solr
  5. Tips

Rebuild Solr Index

rebuild_solr_index is a class method to re-build your model indexes on import of external data.

For large tables rebuilding Solr index is a time consuming process. See the fifth line in the pseudo code below (index optimization call), it makes rebuild_solr_index a slow process. For large tables, you do not want optimization to take place for each object added to the table. Whereas, removing optimization calls slows down the process of updating solr index.

1
2
3
4
5
6
7
8

## pseudo code
def rebuild_solr_index
  for_each_row_in_table do |doc|
    doc.save_to_solr_index
    index.optimize
  end
end

The solution to the problem is to use batch_size in #rebuild_solr_index. With batch size, say for example 100, the index optimization call is executed after indexing 100 rows.

Tagged as   
Posted on 14 September
6 comment Bookmark   AddThis Social Bookmark Button Updated on 23 February

Domain Forwarding or URL Redirection


also known as URL Forwarding or Domain Redirection. Its a technique of making webpage available through many URLs.

Checkout wikipedia article on URL redirection for uses of redirection.

In Short,
  • Client Side Fowarding : URL in client browser changes.
  • Server Side Redirection : URL in client browser does NOT change. User remains on same website/domain.
  • Server Side Forwarding or DNS Forwarding : URL in client browser does NOT change. User is moved to NEW website.

All the above methods are explained below in detail. I will be using Ruby on Rails for illustration.

Filed in ruby on rails
Tagged as  
Posted on 25 August
4 comment Bookmark   AddThis Social Bookmark Button Updated on 09 September

acts_as_solr : search and faceting



Table of Contents

Solr and acts_as_solr

Solr is a search server based on lucene java search library with a HTTP/XML interface. Using Solr, large collections of documents can be indexed based on strongly typed field definitions, thereby taking advantage of Lucene's powerful full-text search features.

acts_as_solr is a ruby on rails plugin adding Solr capabilities to activerecord models. It hides all configuration and manual setting efforts with Solr and provides you with simple find_by... methods. acts_as_solr can be used as a replacement to acts_as_ferret because of inbuilt full text search capabilities ;-) . The purpose of this article is to explain acts_as_solr with examples.


Getting Started

Installation: Installation is well explained on acts_as_solr homepage and getting started with acts_as_solr

Note: acts_as_solr requires jre1.5 on system. Before running any of the solr methods make sure you start solr server with rake solr:start command.

Our example model for this tutorial will be DigitalCamera [classname: Camera] with following fields

  • name (type:string)
  • brand (type:string) [we want faceted browsing on this field]
  • resolution (type:float) [we want faceted browsing on this field]
  • other fields which we do not want to index

Tagged as   
Posted on 12 August
17 comment Bookmark   AddThis Social Bookmark Button Updated on 09 September