Powered by Apache Hadoop

This page documents an alphabetical list of institutions that are using Apache Hadoop for educational or production uses. Companies that offer services on or based around Hadoop are listed in Commercial Support. Please include details about your cluster hardware and size. Entries without this may be mistaken for spam references and deleted._ _

To add entries you need write permission to the wiki, which you can get by subscribing to the common-dev@hadoop.apache.org mailing list and asking for permissions on the wiki account username you've registered yourself as. If you are using Apache Hadoop in production you ought to consider getting involved in the development process anyway, by filing bugs, testing beta releases, reviewing the code and turning your notes into shared documentation. Your participation in this process will ensure your needs get met.

_

_

A

B

C

D

E

F

G

H

I

J

K

L

M

*_ Markt24 _
**_ We use Apache Hadoop to filter user behaviour, recommendations and trends from externals sites _
**_ Using zkpython to connect with Apache Zookeeper_
**_ Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB) _
*_ MicroCode _
**_ 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage) _
**_ Financial data for search and aggregation _
**_ Customer Relation Management data for search and aggregation _
*_ Media 6 Degrees _
**_ 20 node cluster (dual quad cores, 16GB, 6TB) _
**_ Used log processing, data analysis and machine learning. _
**_ Focus is on social graph analysis and ad optimization. _
**_ Use a mix of Java, Pig and Hive. _
*_ Medical Side Fx _
**_ Use Apache Hadoop to analyze FDA AERS(Adverse Events Reporting System) data and present an easy way to search and query side effects of medicines_
**_ Apache Lucene is used for indexing and searching. _
*_ MeMo News - Online and Social Media Monitoring _
**_ we use Apache Hadoop _
***_ as platform for distributed crawling _
***_ to store and process unstructured data, such as news and social media (Apache Hadoop, Apache Pig, MapRed and Apache HBase) _
***_ log file aggregation and processing (Apache Flume) _
*_ Mercadolibre.com _
**_ 20 nodes cluster (12 * 20 cores, 32GB, 53.3TB) _
**_ Customers log on on-line apps _
**_ Operations log processing _
**_ Use java, Apache Pig, Apache Hive, Apache Oozie _
*_ MobileAnalytic.TV _
**_ We use Hadoop to develop MapReduce algorithms: _
***_ Information retrieval and analytics _
***_ Machine generated content - documents, text, audio, & video _
***_ Natural Language Processing _
**_ Project portfolio includes: * Natural Language Processing _
***_ Mobile Social Network Hacking _
***_ Web Crawlers/Page scrapping _
***_ Text to Speech _
***_ Machine generated Audio & Video with remuxing _
***_ Automatic PDF creation & IR _
***_ 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapReduce programs. _
*_ Moesif API Insights _
**_ We use Hadoop for ETL and processing time series event data for alerts/notifications along with visualizations for frontend._
**_ 2 master nodes and 6 data nodes running on Azure using HDInsight_
*_ MyLife _
**_ 18 node cluster (Quad-Core AMD Opteron 2347, 1TB/node storage) _
**_ Powers data for search and aggregation _
*_ Mail.gr - we use HDFS for hosting our users' mailboxes ._

N

*_ NAVTEQ Media Solutions _
**_ We use Apache Hadoop/Apache Mahout to process user interactions with advertisements to optimize ad selection. _
*_ Neptune _
**_ Another Apache Bigtable cloning project using Hadoop to store large structured data set. _
**_ 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM) _
*_ NetSeer - _
**_ Up to 1000 instances on Amazon EC2 _
**_ Data storage in Amazon S3 _
**_ 50 node cluster in Coloc _
**_ Used for crawling, processing, serving and log analysis _
*_ The New York Times _
**_ Large scale image conversions _
**_ Used EC2 to run hadoop on a large virtual cluster _
*_ Ning _
**_ We use Hadoop to store and process our log files _
**_ We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary JavaScript API for ad-hoc queries _
**_ We use commodity hardware, with 8 cores and 16 GB of RAM per machine _

O

*_ Openstat _
**_ Hadoop is used to run a customizable web analytics log analysis and reporting _
**_ 50-node production workflow cluster (dual quad-core Xeons, 16GB of RAM, 4-6 HDDs) and a couple of smaller clusters for individual analytics purposes _
**_ About 500 mln of events processed daily, 15 bln monthly _
**_ Cluster generates about 25 GB of reports daily _
**_ Technologies used: Cascading, Janino _
*_ optivo - Email marketing software _
**_ We use Apache Hadoop to aggregate and analyse email campaigns and user interactions. _
**_ Development is based on the github repository. _

P

*_ Papertrail - Hosted syslog and app log management _
**_ Hosted syslog and app log service can feed customer logs into Apache Hadoop for their analysis (usually with Hive) _
**_ Most customers load gzipped TSVs from S3 (which are uploaded nightly) into Amazon Elastic MapReduce _
*_ PARC - Used Hadoop to analyze Wikipedia conflicts paper. _
*_ PCPhase - A Japanese mobile integration company _
**_ Using Apache Hadoop/Apache HBase in conjunction with Apache Cassandra to analyze log and generate reports for a large mobile web site. _
**_ 4 nodes in a private cloud with 4 cores, 4G RAM & 500G storage each. _
*_ Performable - Web Analytics Software _
**_ We use Apache Hadoop to process web clickstream, marketing, CRM, & email data in order to create multi-channel analytic reports. _
**_ Our cluster runs on Amazon's EC2 webservice and makes use of Python for most of our codebase. _
*_ Pharm2Phork Project - Agricultural Traceability _
**_ Using Hadoop on EC2 to process observation messages generated by RFID/Barcode readers as items move through supply chain. _
**_ Analysis of BPEL-generated log files for monitoring and tuning of workflow processes. _
*_ Powerset / Microsoft - Natural Language Search _
**_ up to 400 instances on Amazon EC2 _
**_ data storage in Amazon S3 _
**_ Microsoft is now contributing to Apache HBase ( announcement). _
*_ Pressflip - Personalized Persistent Search _
**_ Using Apache Hadoop on EC2 to process documents from a continuous web crawl and distributed training of support vector machines _
**_ Using HDFS for large archival data storage _
*_ Pronux _
**_ 4 nodes cluster (32 cores, 1TB). _
**_ We use Apache Hadoop for searching and analysis of millions of bookkeeping postings _
**_ Also used as a proof of concept cluster for a cloud based ERP system _
*_ PokerTableStats _
**_ 2 nodes cluster (16 cores, 500GB). _
**_ We use Apache Hadoop for analyzing poker players game history and generating gameplay related players statistics _
*_ Portabilité _
**_ 50 node cluster in a colocated site. _
**_ Also used as a proof of concept cluster for a cloud based ERP system. _
*_ PSG Tech, Coimbatore, India _
**_ Multiple alignment of protein sequences helps to determine evolutionary linkages and to predict molecular structures. The dynamic nature of the algorithm coupled with data and compute parallelism of Hadoop data grids improves the accuracy and speed of sequence alignment. Parallelism at the sequence and block level reduces the time complexity of MSA problems. The scalable nature of Hadoop makes it apt to solve large scale alignment problems. _
**_ Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD. _

Q

*_ Quantcast _
**_ 3000 cores, 3500TB. 1PB+ processing each day. _
**_ Apache Hadoop scheduler with fully custom data path / sorter _
**_ Significant contributions to KFS filesystem _

R

*_ Rackspace _
**_ 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage) _
**_ Parses and indexes logs from email hosting system for search: http://blog.racklabs.com/?p=66 _
*_ Rakuten - Japan's online shopping mall _
**_ 69 node cluster _
**_ We use Apache Hadoop to analyze logs and mine data for recommender system and so on. _
*_ Rapleaf _
**_ 80 node cluster (each node has: 2 quad core CPUs, 4TB storage, 16GB RAM) _
**_ We use Hadoop to process data relating to people on the web _
**_ We also involved with Cascading to help simplify how our data flows through various processing stages _
*_ Recruit _
**_ Hardware: 50 nodes (2*4cpu 2TB*4 disk 16GB RAM each) _
**_ We use Apache Hive to analyze logs and mine data for recommendation. _
*_ reisevision _
**_ We use Apache Hadoop for our internal search _
*_ Redpoll _
**_ Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each) _
**_ We intend to parallelize some traditional classification, clustering algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data sets. _
*_ Resu.me _
**_ Hardware: 5 nodes _
**_ We use Apache Hadoop to process user resume data and run algorithms for our recommendation engine. _
*_ RightNow Technologies - Powering Great Experiences _
**_ 16 node cluster (each node has: 2 quad core CPUs, 6TB storage, 24GB RAM) _
**_ We use Apache Hadoop for log and usage analysis _
**_ We predominantly leverage Hive and HUE for data access _
*_ Rodacino_
**_ We use Apache Hadoop for crawling news sites and log analysis._
**_ We also use Apache Cassandra as our back end and Apache Lucene for searching capabilities_
*_ Rovi Corporation_
**_ We use Apache Hadoop, Apache Pig and map/reduce to process extracted SQL data to generate JSON objects that are stored in MongoDB and served through our web services _
**_ We have two clusters with a total of 40 nodes with 24 cores at 2.4GHz and 128GB RAM _
**_ Each night we process over 160 pig scripts and 50 map/reduce jobs that process over 600GB of data _
*_ Rubbellose _

S

*_ SARA, Netherlands _
**_ SARA has initiated a Proof-of-Concept project to evaluate the Hadoop software stack for scientific use. _
*_ Search Wikia _
**_ A project to help develop open source social search tools. We run a 125 node Hadoop cluster. _
*_ SEDNS - Security Enhanced DNS Group _
**_ We are gathering world wide DNS data in order to discover content distribution networks and configuration issues utilizing Hadoop DFS and MapRed. _
*_ Sematext International _
**_ We use Hadoop to store and analyze large amounts search and performance data for our Search Analytics and Scalable Performance Monitoring services. _
*_ SLC Security Services LLC _
**_ 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RAM, RedHat OS) _
**_ We use Hadoop for our high speed data mining applications _
*_ Sling Media _
**_ We have a core analytics group that is using a 10-Node cluster running RedHat OS _
**_ Hadoop is used as an infrastructure to run MapReduce (MR) algorithms on a number of raw data _
**_ Raw data ingest happens hourly. Raw data comes from hardware and software systems out in the field _
**_ Ingested and processed data is stored into a relational DB and rolled up using Hive/Pig _
**_ Plan to implement Mahout to build recommendation engine _
*_ Socialmedia.com _
**_ 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM) _
**_ We use hadoop to process log data and perform on-demand analytics _
*_ Spadac.com _
**_ We are developing the MrGeo (Map/Reduce Geospatial) application to allow our users to bring cloud computing to geospatial processing. _
**_ We use Apache HDFS and MapReduce to store, process, and index geospatial imagery and vector data. _
**_ MrGeo is soon to be open sourced as well. _
*_ Specific Media _
**_ We use Apache Hadoop for log aggregation, reporting and analysis _
**_ Two Apache Hadoop clusters, all nodes 16 cores, 32 GB RAM _
**_ Cluster 1: 27 nodes (total 432 cores, 544GB RAM, 280TB storage) _
**_ Cluster 2: 111 nodes (total 1776 cores, 3552GB RAM, 1.1PB storage) _
**_ We contribute to Hadoop and related projects where possible, see http://code.google.com/p/bigstreams/ and http://code.google.com/p/hadoop-gpl-packing/ _
*_ Spotify _
**_ We use Apache Hadoop for content generation, data aggregation, reporting, analysis (see more: The Evolution of Hadoop at Spotify - Through Failures and Pain) and even for generating music recommendations (How Apache Drives Music Recommendations At Spotify_
**_ 1650 node cluster : 43,000 virtualized cores, ~70TB RAM, ~65 PB storage (read more about our Hadoop issues while growing fast: Hadoop Adventures At Spotify)_
**_ +20,000 daily Hadoop jobs (scheduled by Luigi, our open-sourced job orchestrator - code and video)_
*_ Stampede Data Solutions (Stampedehost.com) _
**_ Hosted Apache Hadoop data warehouse solution provider _
*_ Sthenica _
**_ We use Apache Hadoop for sentiment analysis/social media monitoring and personalized marketing _
**_ Using 3 node cluster in a visualized environment with a 4th node for SQL reporting _
*_ StumbleUpon (StumbleUpon.com) _
**_ We use Apache HBase to store our recommendation information and to run other operations. We have HBase committers on staff. _

T

*_ Taragana - Web 2.0 Product development and outsourcing services _
**_ We are using 16 consumer grade computers to create the cluster, connected by 100 Mbps network. _
**_ Used for testing ideas for blog and other data mining. _
*_ The Lydia News Analysis Project - Stony Brook University _
**_ We are using Apache Hadoop on 17-node and 103-node clusters of dual-core nodes to process and extract statistics from over 1000 U.S. daily newspapers as well as historical archives of the New York Times and other sources. _
*_ Tailsweep - Ad network for blogs and social media _
**_ 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 storage) _
**_ Used as a proof of concept cluster _
**_ Handling i.e. data mining and blog crawling _
*_ Technical analysis and Stock Research _
**_ Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36GB Hard Drive) _
*_ Tegatai _
**_ Collection and analysis of Log, Threat, Risk Data and other Security Information on 32 nodes (8-Core Opteron 6128 CPU, 32 GB RAM, 12 TB Storage per node) _
*_ Telefonica Research _
**_ We use Apache Hadoop in our data mining and user modeling, multimedia, and internet research groups. _
**_ 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per machine. _
*_ Telenav _
**_ 60-Node cluster for our Location-Based Content Processing including machine learning algorithms for Statistical Categorization, Deduping, Aggregation & Curation (Hardware: 2.5 GHz Quad-core Xeon, 4GB RAM, 13TB HDFS storage). _
**_ Private cloud for rapid server-farm setup for stage and test environments.(Using Elastic N-Node cluster) _
**_ Public cloud for exploratory projects that require rapid servers for scalability and computing surges (Using Elastic N-Node cluster) _
*_ Tepgo- E-Commerce Data analysis _
**_ We use Apache Hadoop, Apache Pig and Apache HBase to analyze search log, product view data, and analyze usage logs _
**_ 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each. _
*_ Tianya _
**_ We use Apache Hadoop for log analysis. _
*_ TubeMogul _
**_ We use Apache Hadoop HDFS, Map/Reduce, Apache Hive and Apache HBase _
**_ We manage over 300 TB of HDFS data across four Amazon EC2 Availability Zone _
*_ tufee _
**_ We use Apache Hadoop for searching and indexing _
*_ Twitter _
**_ We use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We store all data as compressed LZO files. _
**_ We use both Scala and Java to access Hadoop's MapReduce APIs _
**_ We use Apache Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements. _
**_ We employ committers on Apache Pig, Apache Avro, Apache Hive, and Apache Cassandra, and contribute much of our internal Hadoop work to opensource (see hadoop-lzo) _
**_ For more on our use of Apache Hadoop, see the following presentations: Hadoop and Pig at Twitter and Protocol Buffers and Hadoop at Twitter _
*_ Tynt _
**_ We use Apache Hadoop to assemble web publishers' summaries of what users are copying from their websites, and to analyze user engagement on the web. _
**_ We use Apache Pig and custom Java map-reduce code, as well as Apache Chukwa. _
**_ We have 94 nodes (752 cores) in our clusters, as of July 2010, but the number grows regularly. _

U

*_ Universidad Distrital Francisco Jose de Caldas (Grupo GICOGE/Grupo Linux UD GLUD/Grupo GIGA) _
**_ 5 node low-profile cluster. We use Hadoop to support the research project: Territorial Intelligence System of Bogota City. _
*_ University of Freiburg - Databases and Information Systems _
**_ 10 nodes cluster (Xeon Dual Core 3.16GHz, 4GB RAM, 3TB/node storage). _
**_ Our goal is to develop techniques for the Semantic Web that take advantage of MapReduce (Hadoop) and its scaling-behavior to keep up with the growing proliferation of semantic data. _
**_ RDFPath is an expressive RDF path language for querying large RDF graphs with MapReduce. _
**_ PigSPARQL is a translation from SPARQL to Pig Latin allowing to execute SPARQL queries on large RDF graphs with MapReduce. _
*_ University of Glasgow - Terrier Team _
**_ 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). We use Hadoop to facilitate information retrieval research & experimentation, particularly for TREC, using the Terrier IR platform. The open source release of Terrier includes large-scale distributed indexing using Hadoop Map Reduce. _
*_ University of Maryland _
**_ We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing. _
*_ University of Nebraska Lincoln, Holland Computing Center _
**_ We currently run one medium-sized Hadoop cluster (1.6PB) to store and serve up physics data for the computing portion of the Compact Muon Solenoid (CMS) experiment. This requires a filesystem which can download data at multiple Gbps and process data at an even higher rate locally. Additionally, several of our students are involved in research projects on Apache Hadoop. _
*_ University of Twente, Database Group _
**_ We run a 16 node cluster (dual-core Xeon E3110 64 bit processors with 6MB cache, 8GB main memory, 1TB disk) as of December 2008. We teach MapReduce and use Apache Hadoop in our computer science master's program, and for information retrieval research. For more information, see: http://mirex.sourceforge.net/_

V

*_ Veoh _
**_ We use a small Apache Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data. _
*_ Bygga hus _
**_ We use an Apache Hadoop cluster to for search and indexing for our projects. _
*_ Visible Measures Corporation

W

*_ Web Alliance _
**_ We use Apache Hadoop for our internal search engine optimization (SEO) tools. It allows us to store, index, search data in a much faster way. _
**_ We also use it for logs analysis and trends prediction. '

X

Y

Z