Skip to content
February 6, 2010 / Merrin

Web crawlers

There are two major Java-based projects that offer a web crawler implementation—Nutch and Heritrix. Nutch is an Apache Lucene subproject.  Heritrix is the Internet Archive’s open source web crawler. It’s used to archive large portions of the Web.

Another open source library is provided by the Apache Tika project. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.