Abstract

StormCrawler is a collection of resources for building low-latency, customisable and scalable web crawlers on Apache Storm.

Proposal

The aim of StormCrawler is to help build web crawlers that are :

  • scalable
  • resilient
  • low latency
  • easy to extend
  • polite yet efficient

StormCrawler achieves this partly with Apache Storm, which it is based on. To use an analogy, Apache Storm is to StormCrawler what Apache Hadoop is to Apache Nutch.

StormCrawler is mature and is used by many organisations world-wide.

Background

StormCrawler was created by DigitalPebble Ltd in early 2013 and open sourced on GitHub under ASF licence v2.  It is a mature software which has 26 releases over its 10+ years of existence. It is used by many organisations over the world, sometimes operating on a scale of billions of documents.

StormCrawler allows the creation of scalable, distributed, resilient and customisable web crawlers. As defined by Wikipedia

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

StormCrawler was designed to address some of the limitations of existing open source web crawlers, such as Apache Nutch. Instead of being batch-driven and running on Apache Hadoop, it leverages Apache Storm and processes URLs continuously as streams. This allows it to cater for low-latency scenarios and guarantees a better use of the hardware and infrastructure.

Moreover, StormCrawler is based on a modern stack with Apache Maven dependencies and archetypes, meaning that users can build and deploy a scalable web crawler with a minimal number of files. This also makes it easier to separate custom resources written by users from the common core ones. Finally, StormCrawler has been designed with extensibility and modularity in mind. It comes with a set of plugins for external resources, allowing users to choose the storage backend that best suits their need or infrastructure.

Rationale

StormCrawler is mature (26 releases to date). We believe it should become adopted by the ASF, due to its strong dependency on several Apache projects, notably Apache Storm, and the fact that most of its core contributors (see below) already are members or committers at the ASF.  The project has followed the Apache principles of openness and meritocracy since its very beginning and would be a good cultural fit.

Having StormCrawler at the ASF would benefit the whole ecosystem of projects it leverages, such as Apache Storm, Tika, SOLR and their various dependencies.

Current Status

Meritocracy:

StormCrawler has operated as a meritocracy from its very beginning. We have strived to build an environment where users’ contributions are valued and people get promoted to committership, based on a consensus among the core contributors. We believe that our way of operating is totally aligned with the ASF principle of meritocracy. 

Community:
StormCrawler has a steady community of users and contributors. Over its 10+ years of existence, more than 40 people have contributed to its code. As a project we have tried to reach out to as large a number of people as possible over the years, via different channels: mailing lists, blogs, talks at conferences, workshops and more recently a Discord channel. We expect the community to grow if StormCrawler enters Incubation at the ASF.

Core Developers:

StormCrawler was created by Julien Nioche, an emeritus member of the ASF and former PMC chair of Apache Nutch. Julien was also a committer on Apache Tika and Apache Gora. He has recently joined Apache Storm.

Sebastian Nagel is a long-time contributor to the project and would be on the initial committers list. Sebastian is the current PMC Chair for Apache Nutch.

Richard Zowalla is another long term committer on StormCrawler. Richard is a member of the ASF and is a committer/PMC member on various Apache projects, such as Apache TomEE, Apache OpenNLP and is PMC Chair for Apache Storm.

Tim Allison has not yet contributed to StormCrawler but is involved in related projects at the ASF (Apache Lucene, Apache Nutch, Apache OpenNLP,  Apache Tika, Apache POI, Apache PDFBox) and has expressed an interest in being part of the initial committers list. Tim is a member of the ASF.

Finally, the last member of the initial committers list is Michael Dinzinger. Michael has not yet been involved directly in a project at the ASF but has contributed to StormCrawler. He is part of a large research project (https://openwebsearch.eu/) using StormCrawler on a large scale to produce open data. 

Alignment:

We believe that the ASF is the right home for StormCrawler. As a project, it has been operating under the same core values as the ones held at the Foundation (meritocracy, openness, community first). The close dependency on other Apache projects form a healthy ecosystem, which would be strengthened by its adoption in the Incubator. 

Known Risks

Project Name

The name StormCrawler reflects the technical heritage of the project as well as its function. The name is now well-known as one of the leading open-source resources for web crawling and as a result changing it could be damageable.  We will approach the Apache Storm community to check that no-one objects to the name of our project.

Orphaned Products

Web crawling is a relatively niche activity with a quiet community, as seen on other established projects such as Apache Nutch. However, StormCrawler is used in production by several organisations and is the backbone of an ambitious European project (OpenWebSearch.eu), so the need for it will not disappear. It might even increase with the recent activities and hype around AI/LLMs worldwide. We expect that its incubation at the ASF will increase its appeal with potential users and the initial committers will endeavour to foster a healthy and growing community.

Inexperience with Open Source

The initial committer team has extensive experience with open source, particularly within the ASF.

Length of Incubation

Given the maturity of StormCrawler and the experience of its committers, we expect to graduate to TLP relatively quickly (6 to 12 months).

Homogenous Developers

All the initial committers are male and white. Three out of five live in Germany, one in the UK and one in the US. None are employed by the same organisation. Two work in academia.

We will try to increase the racial, gender, geographical and professional diversity of the people involved in the project , following the Apache Way.

Reliance on Salaried Developers

The creator of StormCrawler, Julien Nioche, runs a company providing consultancy, training and support for StormCrawler. Given his deep attachment to the project, Julien would remain committed to it, regardless of his professional activities. None of the remaining committers are paid to work on StormCrawler.

Relationships with Other Apache Products

StormCrawler is related to several Apache products:

  • Apache Storm is the platform on which StormCrawler is based. Both Richard and Julien are also Apache Storm PMC members. Richard is the current PMC chair.
  • Apache Hadoop: StormCrawler has a WARC module which extends the Apache Hadoop resources in Apache Storm
  • Apache SOLR: a specific module in StormCrawler allows to index the documents crawled into Apache SOLR but also to use it to store metrics and the crawl frontier
  • Apache Tika is used in StormCrawler for parsing non-HTML documents but also identify mime-types
  • Apache HttpClient library is used for one of the protocol implementations

Many more Apache projects are used through transitive dependencies.

An Excessive Fascination with the Apache Brand

Although an incubation at the ASF would help increase the appeal of StormCrawler and make it even more sustainable in the long run, StormCrawler has already been a successful project outside of the ASF for more than a decade. Moreover, the initial committers are for the majority already really involved in various projects at the ASF and would not necessarily gain much in terms of profile from this incubation.

Documentation

The documentation is mostly found on the WIKI at Github. There is also a website at https://stormcrawler.net/ which is generated from https://github.com/DigitalPebble/storm-crawler/tree/gh-pages.

Initial Source

StormCrawler was created by DigitalPebble Ltd in early 2013 and open sourced on GitHub under ASF licence v2.  

Source and Intellectual Property Submission Plan

External Dependencies

StormCrawler is a large project with multiple modules and as a result, has around 500 dependencies (direct or transitive). The full list is available at https://github.com/DigitalPebble/storm-crawler/blob/master/THIRD-PARTY.txt

Most dependencies are ASFv2 licensed. There are possible concerns around the Elasticsearch dependencies (Elastic License 2.0 - not open source). We will address these either by downgrading the version of Elasticsearch to 7.10.2, which is under ASFv2, remove the Elastic module altogether or make the dependency optional and instruct the users on how to activate it themselves to avoid having this non ASLv2 compliant licence in the distributions.

See https://www.apache.org/legal/resolved.html for a list of acceptable licences for dependencies.


Cryptography:

https://www.bouncycastle.org/ is a transitive dependency of the project. It is inherited from org.apache.tika:tika-parser-crypto-module.

Required Resources

Mailing lists

Subversion Directory

N/A

Git Repositories


We are planning to build the https://stormcrawler.apache.org website with Jekyll, maybe based on https://github.com/apache/apache-website-template

Issue Tracking

The community would like to continue using GitHub Issues.

Other Resources

The community has already chosen GitHub actions as continuous integration tools.

Initial Committers

Sponsors

Champion:

PJ Fanning [fanningpj]

Nominated Mentors:

  • PJ Fanning [fanningpj]
  • Dave Fisher [wave]
  • Lewis John McGibbney [lewismc]
  • Ayush Saxena [ayushsaxena]

We would welcome additional mentors.

Sponsoring Entity:

The Incubator


  • No labels