Applications and organizations using Hadoop include (alphabetically):
A9.com - Amazon We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
We process millions of sessions daily for analytics, using both the Java and streaming APIs.
Our clusters vary from 1 to 100 nodes.
Able Grape - Vertical search engine for trustworthy wine information We have one of the world's smaller hadoop clusters (2 nodes @ 8 CPUs/node)
Hadoop and Nutch used to analyze and index textual information
backdocsearch.com - search engine for chiropractic information, local chiropractors, products and schools
Cascading - Cascading is a dataset processing API and MapReduce "planner" for Hadoop. It includes a Groovy language scripting interface for rapid assembly of complex Hadoop job workflows. -
Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
-
We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently have around a hundred machines - low end commodity boxes with about 1.5TB of storage each. Our data sets are currently are of the order of 10s of TB and we routine process multiple TBs of data everyday.
In the process of adding a 320 machine cluster with 2,560 cores and about 1.3 PB raw storage. Each (commodity) node will have 8 cores and 4 TB of storage.
We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features (that we will open source at some point). We have also written a read-only FUSE implementation over hdfs.
Hadoop Korean User Group, a Korean Local Community Team Page. 50 node cluster In the Korea university network environment.
Pentium 4 PC, HDFS 4TB Storage
Used for development projects
Retrieving and Analyzing Biomedical Knowledge
Latent Semantic Analysis, Collaborative Filtering
-
12 node cluster (8 cores/node, 1TB/node storage)
10 node cluster (8 cores/node, 1TB/node storage)
In process of creating a third 50-node cluster with more storage/node
Use for log analysis, data mining and machine learning
-
We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning.
Information Sciences Institute (ISI) Used Hadoop and 18 nodes/52 cores to
plot the entire internet.
-
Session analysis and report generation
Koubei.com Large local community and local search at China. Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
-
Source code search engine uses Hadoop and Nutch.
-
25 node cluster (dual xeon LV 2GHz, 4GB RAM, 1TB/node storage)
10 node cluster (dual xeon L5320 1.86GHz, 8GB RAM, 3TB/node storage)
Used for charts calculation and web log analysis
-
We use Hadoop to process clickstream and demographic data in order to create web analytic reports.
Our cluster runs across Amazon's EC2 webservice and makes use of the streaming module to use Python for most operations.
-
12 node cluster (Dual-Core AMD Opteron 1212, 4-8GB RAM, 1.5TB/node storage)
Parses and indexes mail logs for search
-
Another Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more to come (naive bayes classifiers, others)
NetSeer - Up to 1000 instances on
Amazon EC2 Data storage in
Amazon S3 50 node cluster in Coloc
Used for crawling, processing, serving and log analysis
-
Used EC2 to run hadoop on a large virtual cluster
Nutch - flexible web search engine software
Powerset - Natural Language Search up to 400 instances on
Amazon EC2 data storage in
Amazon S3
-
A project to help develop open source social search tools. We run a 125 node hadoop cluster.
SEDNS - Security Enhanced DNS Group We are gathering world wide DNS data in order to discover content distribution networks and configuration issues utilizing Hadoop DFS and MapRed.
-
We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
-
We use a small Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data.
Visible Measures Corporation uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008. -
>10000 computers nodes running Hadoop, each with many cpus
Our biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each)
Used to support research for Ad Systems and Web Search
Also used to do scaling tests to support development of Hadoop on larger clusters
Our Blog - Learn more about how we use Hadoop.
-
10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)
Run Naive Bayes classifiers in parallel over crawl data to discover event information
When applicable, please include details about your cluster hardware and size.