What is the architecture of Apache Nutch?
Apache Nutch is a distributed search engine based on the Lucene indexing framework.
It is designed to index large and growing amounts of unstructured data (blogs, news, ebooks, social media, etc). With Nutch, it's possible to index large numbers of documents in parallel without having to re-index all the documents.
At the time of this writing, the latest version of Apache Nutch is 1.10.
This article is a collection of architectural diagrams of Nutch 1. The overall architecture of Nutch is shown in the diagram below. Nutch is a distributed system. Each of its components is divided into modules that work in concert with each other.
The components are as follows: Nutch: The search engine that searches for information. It includes the indexing module and the full text search module.
The search engine that searches for information. Solr: A distributed search engine built on top of Lucene. Solr provides the functionality that Nutch uses to index and search through the web.
A distributed search engine built on top of Lucene. Commons-email: An email library that provides email address lookup and mail service.
An email library that provides email address lookup and mail service. Commons-lang: A collection of libraries that provides functionality for the implementation of the Java programming language.
A collection of libraries that provides functionality for the implementation of the Java programming language. Commons-io: A collection of I/O utilities that provides various interfaces for reading and writing files and streams.
A collection of I/O utilities that provides various interfaces for reading and writing files and streams. Commons-math: A collection of mathematics libraries.
A collection of mathematics libraries. Commons-collections: A collection of collections utilities.
A collection of collections utilities. Commons-lang3: A collection of libraries that provide the necessary functionality for implementing Java 3.0 programs.
Related Answers
Is Spring cloud gateway an API Gateway?
I was reading the Spring Cloud gateway document, and I don't unders...
How to configure Apache proxy?
I am trying to setup SSL for my Apache web server and I was...
What is the difference between Apache Traffic Server and squid?
How does it work? Apache is an HTTP server which runs on Linux. It i...