Web crawlers
There are two major Java-based projects that offer a web crawler implementation—Nutch and Heritrix. Nutch is an Apache Lucene subproject. Heritrix is the Internet Archive’s open source web crawler. It’s used to archive large portions of the Web.
Another open source library is provided by the Apache Tika project. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Advertisement
