Monthly Archives: October 2009

Nutch – features and configuration details

Nutch is a framework for building web-scale crawlers and search applications. It is free and Open Source and uses Lucene for the search and index component. Nutch is built on top of Lucene adding functionality to efficiently crawl the web or intranet. Now the most obvious question is “Why Nutch when there is Google for all our needs?”. Some of the reasons could be:

  • Highly modular architecture allowing developers to create plug-ins for media-type parsing, data retrieval and other features like clustering the search results.

  • Transparency of ranking algorithms.

  • The Google Mini appliance to index about 300,000 documents costs ~$10,000. Ready to invest?

  • Learn how a search engine works and customize it!

  • Add functionalities that Google hasn’t come up with.

  • Document level authentication.

Nutch Architecture:


Data structures for a Nutch Crawl:

  • Crawl Database or Sequence file: <URL,metadata> – set of all URLs to be fetched.

  • Fetch List – subset of CrawlDB – URLs to be fetched in one batch.

  • Segment – Folder containing all data related to one fetching batch. (Fetch list + fetched content + plain text version of content + anchor text + URLs of outlinks + protocol and document level metadata etc.)

  • LinkDB – <URL, inlinks> – contains inverted links.

Setting up Nutch to crawl and search:

  • A shell script is used for creating and maintaining indexes.

  • A search web application is used to perform search using keywords.

Step 1:

Download Nutch and extract to disk, say /home/ABC/nutch directory (NUTCH_HOME).

Step 2:

Download and set up a servlet container

Apache Tomcat

GlassFish server

Step 3:

Get a copy of the Nutch code

Step 4: Creating the index

The nutch ‘crawl’ command expects to be given a directory containing files that list all the root level urls to be crawled. So:

  • Step 4.1: Create a ‘urls’ directory in $NUTCH_HOME.

  • Step 4.2: Create files inside the urls folder with a set of seed urls from which all resources which need to be crawled can be reached.

  • Step 4.3: Crawling Samba shared folders requires you to give the URL of the shared folder like smb://dnsname/nutch/PDFS/ and editing file in $NUTCH_HOME/conf to reflect the properties of the shared folder.

  • Step 4.4: Restrict set of URLs to be crawled/not by writing regular expressions in $NUTCH_HOME/conf/crawl-urlfilter.txt and $NUTCH_HOME/conf/regex-urlfilter.txtExample:







In the above regular expressions, + indicates crawl and – indicates do not crawl the URLs matching that pattern.

  • Step 4.5: Including Plug-ins: Go to $NUTCH_HOME/conf/nutch-site.xml and name all the plug-ins required for your crawl as follows:




protocol-httpclient|protocol-smb|urlfilter-regex|parse-(pdf|html)|index-(basic|anchor| more)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring- opic|urlnormalizer-(pass|regex|basic)



  • Step 4.6: Running the crawl:
    $NUTCH_HOME/bin/nutch crawl urls -dir crawl.sai -depth 10
      -dir dir names the directory to put the crawl in.
      -depth depth indicates the link depth from the root page that should be crawled.

-delay delay determines the number of seconds between accesses to each host.-threads threads determines the number of threads that will fetch in parallel.

A typical output after the Crawl command will be:

  • A new directory named ‘crawl.sai’ in $NUTCH_HOME.

  • NUTCH_HOME will contain the search index for the URLs.

  • The ‘depth’ of the search index will be 10.

A search index in Nutch is represented in the file system as a directory. However, it is much more than that and is similar in functionality to a database. The nutch API will interact with this index making the internal mechanisms transparent to both developers and end-users.

Nutch Crawling internals:

Just in case you would like to do the Nutch crawl using the internals of Nutch instead of using the crawl command, here are the steps:

  1. Inject – Step 2 is bootstrapped by injecting seed urls into CrawlDB.

  2. Loop:

Generate – Generate URLs to fetch from CrawlDB.

Fetch – Fetches the URL content and writes to disk.

Parse – Reads raw fetched content, parses and stores result.

UpdateDB – Update CrawlDB with links from fetched pages.

3. Update and merge segments – update segments with content, scores and links from the CrawlDB.

4. Invert links – <document, keywords[]> to <keyword, documents[]> using segments and update LinkDB.

5. Indexing – one index for each segment.

6. Deduplication – pages at different URLs with same content removed.

Details and syntax of these commands alongwith options used with can be found at

Step 5: Configuring the Nutch Web Application

The search web application is included in your downloaded Nutch archive. The search web application needs to know where to find the indexes.

  • Deploy the Nutch web application as the ROOT context.

  • In the deployment directory, edit the \WEB-INF\classes\nutch-site.xml file.

          <name>searcher.dir </name>
  • Restart the application server.

Step 6: Running a Test Search:

Open a browser and type the Tomcat/Glassfish URL (say and you should get a welcome screen where you can type in a keyword and start searching!



Step 7: Maintaining the index

Create a new shell script with Nutch’s internal commands.

  1. Commands, which set the environment (ex:JAVA_HOME).
  2. Commands to do the index updating:
    1. Generate
    2. Fetch
    3. Parse
    4. UpdateDB
    5. Invert links
    6. Index
    7. Deduplication
  3. Commands to optimize the updated index. – Merge segments.

Note that during the maintenance phase, we “DO NOT” inject urls.

/! NOTE: Index optimization is necessary to prevent the index from becoming too large, which will eventually result in a ‘too many open files’ exception in Lucene.

The exact command syntax can be viewed on the webpage for the latest version of Nutch .

Step 8: Scheduling Index updates.

The shell script created from the above step can be scheduled to be run periodically using a ‘cron’ job.

Step 9: Improving Nutch with plug-ins

Now that we have Nutch set up, the basic functionality provided by Nutch can be extended by writing plug-ins to perform many specific functions like

  • Custom search – (say, Search by e-mail id)

  • Document level authentication – An idea that could be implemented is to index a “authentication level” meta field from every document and when a user sends a search query, return only the results for which this user has access permission. This can be done by using the query-filter plugin in Nutch. This is one feature that is not provided by Google appliances as a user can view a “cached copy” of files he may not have access to.

Sounds interesting? Read more at:

Musings of a SpringOne 2009 Attendee – Day 2

Running a day late on my posts. Here’s day two (yesterday)

Grails Quick Start – David Klien

David walked through the creation of a Grails web application to track a JUG’s meeting schedule. I liked his presentation style or maybe because the room wasn’t very crowded things just registered better. Picked up a few tips such as the Bootstrap class. Grails still has a ways to go in the eclipse tooling. It would’ve been nice to have been able to File –> New Project and follow along. Too bad IntelliJIDEA CE doesn’t support grails though there has been plenty of buzz on the latest STS. Downloading this right now. Only 3 more hours for the download to complete!

I think I’m beginning to dig duck typing. All in all the presentation encouraged me to put my head down and hammer out a sample app to start building some grails knowledge. More homework! Continue reading

Musings of a SpringOne 2009 Attendee – Day 1

It has finally arrived. SpringOne which I have been anticipating for over a month is finally here and it  couldn’t have come sooner. I need one more blast of warm sunny weather before the hibernation months of winter. My day started at 3:30 am, well actually 4:00 am as I managed to roll out of bed. But before long I was sitting on the plane to St. Louis going over the conference schedule doing my first round of eliminations. This is the easy one. I knock out all the sessions that I have absolutely no interest in going. The ones that even the temptation of free beer cannot get me to go to. You get the idea. Even with an 8 track schedule there wasn’t a lot I could eliminate. But I had started the process. Once in St. Louis I had a 4 hour stop-over. Continue reading

Hands-on OSGi and Modular Web Applications – Part I – Toes First

A Brief Introduction

This is the first in a series of blog posts that will attempt to demystify OSGi and demonstrate how it enables the creation of modular web applications. We will explore various aspects of the technology along with the challenges of using this technology. I encourage you to join in the discussion by posting any comments about your own experiences or challenges you have faced developing OSGi applications. We start with the assumption that we understand what OSGi is and the specific modularity problem it tries to solve. Here are some resources you can visit to read up on this.

  1. – this one talks about the problem space
  2. – this one brings Spring and OSGi together

Turn on the ignition

Lets get started. This first post will show you how to launch an OSGi framework and how you can interact with it. You will first need to have a JDK installed. I recommend the Sun JDK. You then need an OSGi implementation. Continue reading

Hidden Dependencies Causing Failures

I saw a really great post on the Freakonomics blog talking about how hidden connections almost sunk Chicago.  I’ve seen situations like this all too often during my career.  At least in the software world, we can try to prevent this by developing modules that are more loosely coupled.  Of course, in the real world you are often inheriting years upon years of past decisions made by anonymous developers.  In the absence of a thorough end-to-end testing suite, is there anything you can do to prevent this?

Hibernate Criteria trick

So here’s the situation.

Let’s say I have this query here:

SELECT * FROM employees
WHERE employee_id NOT IN ( 1234 , 3456 , 5678 );

How do we do that with the Hibernate Criteria object with a Restriction?  You would think that the Restrictions API would have a “not in” method, since it does have a not equals method(ne), but alas, there is nothing…

Well, here’s the solution:

//Create the criteria
Criteria crit = factory.getCurrentSession().createCriteria(Employee.class);
//add my restriction where idList is a list of emp ids that need to be excluded
crit.add(Restrictions.not("employeeId", idList)));
//get some results
List employees = crit.list();

There you go!  Now you know this neat little trick and you can use it in your own app… Be forewarned though, it can be slow…

Windows 7 Review

Recently I installed Windows 7 RC on the Dell Latitude D830 I use at work and I have been slightly impressed.

One of the most useful features so far has been the ability to right click on an image and select “Burn Image” thus I no longer need to use third party burn software.

A big problem for a Windows Tech is that there is no “Tool Kit” available like with XP, Vista. Telnet seems to have disappeared which makes life difficult. Another missing tool I have come to depend on is one that gives me the ability to create a “snap on” mmc that lets me  save multiple RDC sessions in a user friendly split window. Saved passwords and a tree like list that makes toggling from server to server a breeze is a must for the professional Windows admin.

The overall speed and response times have been remarkable. The longest wait is experienced when resuming from a hibernated state.

Program compatibility has caused me no headaches. I have the latest version of OpenOffice, Safari, and all of the usual garb like Adobe Reader and Flash installed and everything works as well as it is expected to work.

It seems whenever Micro$oft creates a new OS the big problem is usually printer drivers. Most of the networks I work with use some form of printer that is no longer supported and is not going to work with Windows 7 in anyway shape or form.

Indexing is probably my second favorite feature. The “Search for Programs or Files” box found in the Start menu can find all similar references to what I am typing faster than I can type (not a big surprise).

Overall I am pleased with the post Vista changes and will most likely install the full version when I have the opportunity.

One last note, Windows 7 and Safari are making impossible to insert this beautiful Windows 7 logo, so you might have to go Google for it.

Reactions to “My nine biggest professional blunders”

In Confessions of an IT pro, Becky Roberts talks about her nine biggest professional blunders.  It really brought back some old memories.  Maybe I’ll write a later post discussing some of my less-than-finest moments.  A minor points in the article really struck me and I wanted to point them out:

Mistake #1 – Okay, you’ve royally screwed up.  Do you try to get out of it?  Or, do you admit your mistake, try to fix it (if you can), and deal with the consequences.  I’d also like to give bonus points to the boss in this case for handling the news so graciously.

Mistake #2 – I assume this story is a bit dated since the program was in basic, but I always am amazed when there are no test systems.    It happened back then, it still happens today.  I’ll be the first to admit that I make mistakes.  Lots of them.  It was also refreshing to hear that she struggled with how hard she should dig in her heels and fight when she found herself in a bad situation.  Do you tell the boss its a bad idea and do it anyways when he tells you to?  Do you refuse?  I’m not sure I have a good answer for that one.

Mistake #6 – I’ll admit that I’m bad at this one.  I have the utmost respect for people who are good at this.  If anyone has any pointers on how they developed this habit, I’m all ears.

Mistake #7 – I think this situation is all-to-common.  Especially in organizations that are matrix-managed and have individuals split across multiple projects.  Her last sentence struck me though.  I’ve never had it work for me.  I don’t know if that is a product of the particular environment I was in at the time, or if I wasn’t effective in presenting my case.

Not a life-changing post, but a refreshing reminder of the bumps and bruises everyone seems to experience along the way.

Image Processing Using ImageMagick and JMagick

Introduction to ImageMagick

ImageMagick® is a software suite to create, edit, and compose bitmap images. It can read, convert and write images in a variety of formats (over 100) including DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, and TIFF. Use ImageMagick to translate, flip, mirror, rotate, scale, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves.

The functionality of ImageMagick is typically utilized from the command line. In this blog I am focussing on how to use Java with ImageMagick. There are two options available to use ImageMagick

1)      JMagick provides an object-oriented Java interface to ImageMagick which I am going to show in this blog.

2)      Calling the ImagicMagick directly as the Command line using Runtime.getRuntime().exec(command);


JMagick is an open source Java interface of ImageMagick. It is implemented in the form of Java Native Interface (JNI) into the ImageMagick API.

JMagick does not attempt to make the ImageMagick API object-oriented. It is merely a thin interface layer into the ImageMagick API.

Image Conversion using Jmagick

This function shows how to convert one file format to other format mainly I am focusing on PDF to TIFF conversion. Conversion of PDF into multiple page TIFF  or single page TIFF and also the function is also extensible for accepting Compression Format such as GROUP4, FAX or JPEG.

public void convert(File inputFile, File outputDirectory, ImageType outputType, boolean multiple) {
if (inputFile != null &amp;&amp; inputFile.exists() &amp;&amp; ImageUtil.isValidMime(inputFile)) {
try {
ImageInfo info = getImageInfo(inputFile);
String fileName = inputFile.getName();
fileName = inputFile.getName().split("\\.")[0];
if (multiple) {
MagickImage image = new MagickImage(info);
MagickImage[] imArray = image.breakFrames();
for (int i = 0; i &lt; imArray.length; i++) {
StringBuilder outputFile = new StringBuilder(outputDirectory.getAbsolutePath());
File file = new File(outputFile.toString());
} else {
StringBuilder outputFile = new StringBuilder(outputDirectory.getAbsolutePath());
File file = new File(outputFile.toString());
MagickImage image = new MagickImage(info);
} catch (Exception e) {
private ImageInfo getImageInfo(File inputName) throws MagickException {
String density = this.getProperties().getProperty(IMAGEMAGIC_DENSITY, "300");
ImageInfo info = new ImageInfo(inputName.getAbsolutePath());
return info;