Hadoop Map/Reduce Data Processing Benchmarks

Group/Sort

SQL > select ipaddress, count(*) from access_log group by ipaddress order by count(*) desc limit 0,100;
σ count. ipaddresscountcount(ipaddress). ipaddress (access_log)))

MapReduce Flow

Benchmarks

1.5 GB access_log on 10 node cluster

This test should include the data load time for the MySql column, not just the SQL time.

http://wiki.apache.org/hadoop-data/attachments/DataProcessingBenchmarks/attachments/C__Users_udanax_Desktop_test-10.png

MySql 5.0.27

Hadoop-0.15.2

Hadoop-0.15.2

Hadoop-0.15.2

Hadoop-0.15.2

Hadoop-0.15.2

Data

B-tree disk table (MyISAM)

Text files (access_log)

Text files (access_log)

Text files (access_log)

Text files (access_log)

Text files (access_log)

Machine

1

2

4

6

8

10

Rows

5,914,669

5,914,669

5,914,669

5,914,669

5,914,669

5,914,669

Results

100

100

100

100

100

100

Time

4.43 sec

172.30 sec

108.01 sec

77.41 sec

66.30 sec

60.78 sec


I also investigate a lot of traditional methods of parallel processing and experiment some high level processing (e.g. matrix algebra, graph algorithm) using Hadoop/Hbase/MapReduce. The only way to increase speed linearly was locality (Do write all data even if there are duplicated efforts). Increased node numbers, there is a linear increase of IO channel.

DataProcessingBenchmarks (last edited 2009-09-20 23:54:19 by localhost)