Applications and organizations using Hadoop include (alphabetically):
A9.com - Amazon We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
We process millions of sessions daily for analytics, using both the Java and streaming APIs.
Our clusters vary from 1 to 100 nodes.
Able Grape - Vertical search engine for trustworthy wine information We have one of the world's smaller hadoop clusters (2 nodes @ 8 CPUs/node)
Hadoop and Nutch used to analyze and index textual information
Adknowledge - Ad network Hadoop used to build the recommender system for behavioral targeting, plus other clickstream analytics
We handle 500MM clickstream events per day
Our clusters vary from 50 to 200 nodes, mostly on EC2.
Investigating use of R clusters atop Hadoop for statistical analysis and modeling at scale.
backdocsearch.com - search engine for chiropractic information, local chiropractors, products and schools
Cascading - Cascading is a dataset processing API and MapReduce "planner" for Hadoop. It includes a Groovy language scripting interface for rapid assembly of complex Hadoop job workflows. -
Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
-
We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently have a 320 machine cluster with 2,560 cores and about 1.3 PB raw storage. Each (commodity) node has 8 cores and 4 TB of storage.
We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features called Hive (see the
JIRA ticket). We have also written a read-only FUSE implementation over hdfs.
-
20 machine cluster (8 cores/machine, 1TB/machine storage)
10 machine cluster (8 cores/machine, 1TB/machine storage)
In process of creating a third 50-node cluster
Use for log analysis, data mining and machine learning
Hadoop Korean User Group, a Korean Local Community Team Page. 50 node cluster In the Korea university network environment.
Pentium 4 PC, HDFS 4TB Storage
Used for development projects
Retrieving and Analyzing Biomedical Knowledge
Latent Semantic Analysis, Collaborative Filtering
-
We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning.
-
From
TechCrunch: Rather than put ads in or around the images it hosts, Levin is working on harnessing all the data his service generates about content consumption (perhaps to better target advertising on ImageShack or to syndicate that targetting data to ad networks). Like Google and Yahoo, he is deploying the open-source Hadoop software to create a massive distributed supercomputer, but he is using it to analyze all the data he is collecting.
Information Sciences Institute (ISI) Used Hadoop and 18 nodes/52 cores to
plot the entire internet.
-
Session analysis and report generation
Katta - Katta serves large Lucene indexes in a grid environment. Uses Hadoop FileSytem, RPC and IO
Koubei.com Large local community and local search at China. Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
-
Source code search engine uses Hadoop and Nutch.
-
25 node cluster (dual xeon LV 2GHz, 4GB RAM, 1TB/node storage)
10 node cluster (dual xeon L5320 1.86GHz, 8GB RAM, 3TB/node storage)
Used for charts calculation and web log analysis
-
We use Hadoop to process clickstream and demographic data in order to create web analytic reports.
Our cluster runs across Amazon's EC2 webservice and makes use of the streaming module to use Python for most operations.
-
Using Hadoop and Hbase for storage, log analysis, and pattern discovery/analysis.
-
12 node cluster (Dual-Core AMD Opteron 1212, 4-8GB RAM, 1.5TB/node storage)
Parses and indexes mail logs for search
-
Another Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more to come (naive bayes classifiers, others)
NetSeer - Up to 1000 instances on
Amazon EC2 Data storage in
Amazon S3 50 node cluster in Coloc
Used for crawling, processing, serving and log analysis
-
Used EC2 to run hadoop on a large virtual cluster
Nutch - flexible web search engine software
Powerset - Natural Language Search up to 400 instances on
Amazon EC2 data storage in
Amazon S3
-
4 nodes cluster (32 cores, 1TB).
We use Hadoop to filter and index our listings, removing exact duplicates and grouping similar ones.
We plan to use Pig very shortly to produce statistics.
-
A project to help develop open source social search tools. We run a 125 node hadoop cluster.
SEDNS - Security Enhanced DNS Group We are gathering world wide DNS data in order to discover content distribution networks and configuration issues utilizing Hadoop DFS and MapRed.
-
We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
-
We use a small Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data.
Visible Measures Corporation uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008. -
>10000 computers nodes running Hadoop, each with many cpus
Our biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each)
Used to support research for Ad Systems and Web Search
Also used to do scaling tests to support development of Hadoop on larger clusters
Our Blog - Learn more about how we use Hadoop.
-
10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)
Run Naive Bayes classifiers in parallel over crawl data to discover event information
When applicable, please include details about your cluster hardware and size.