Apache Nutch is a highly extensible and scalable open source web crawler software project originating from Apache Lucene and Hadoop.
Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.