Nutch
From Wikipedia, the free encyclopedia
Nutch Web Interface Search |
|
Developed by | Apache Software Foundation |
---|---|
Latest release | 1.0.0 / 2009-03-23 |
Written in | Java |
Operating system | Cross-platform |
Type | Search Engine |
License | Apache License 2.0 |
Website | http://lucene.apache.org/nutch/ |
Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component.
Contents |
[edit] Features
It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "web crawler") has been written from scratch solely for this project.
[edit] History
Nutch originated with Doug Cutting (creator of both Lucene and Hadoop) and Mike Cafarella.
In June 2003, there was a successful 100 million page demo system. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. These two facilities have been spun out into their own subproject called Hadoop.
As of June 2005, Nutch has graduated from the Apache Incubator, and is now a subproject of Lucene.
[edit] Scalability
IBM Research studied the performance[1] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project [2]. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.
[edit] Related projects
- Hadoop - Java framework that supports distributed applications running on large clusters
- nutchWAX - Uses Nutch to search a web archive
- Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.
[edit] Search engines built with Nutch
[edit] References
- ^ Scalability of the Nutch search engine
- ^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer
[edit] External links
- Official website
- Official wiki
- Building Nutch: Open Source Search (2004) - ACM Queue vol. 2, no. 2
- An article about Nutch (2003) - Search Engine Watch
- Another article about Nutch (2003) - Tech News World
- Official page of the Hadoop project
|