Nutch

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Lucene Nutch
Lucene Nutch Logo

Nutch Web Interface Search
Developed by Apache Software Foundation
Latest release 1.0.0 / 2009-03-23; 11 days ago
Written in Java
Operating system Cross-platform
Type Search Engine
License Apache License 2.0
Website http://lucene.apache.org/nutch/

Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component.

Contents

[edit] Features

It is coded completely in the Java programming language, but data is written in language-independent formats.

Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch solely for this project.

[edit] History

Nutch originated with Doug Cutting (creator of both Lucene and Hadoop) and Mike Cafarella.

In June 2003, there was a successful 100 million page demo system. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. These two facilities have been spun out into their own subproject called Hadoop.

As of June 2005, Nutch has graduated from the Apache Incubator, and is now a subproject of Lucene.

[edit] Scalability

IBM Research studied the performance[1] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project [2]. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

[edit] Related projects

  • Hadoop - Java framework that supports distributed applications running on large clusters
  • nutchWAX - Uses Nutch to search a web archive
  • Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.

[edit] Search engines built with Nutch

[edit] References

  1. ^ Scalability of the Nutch search engine
  2. ^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer

[edit] External links

Personal tools