Reporter : Edward Yoon
Hadoop Map/Reduce Data Processing Benchmarks
Group/Sort
Finds the most connected networks.
SQL > select ipaddress, count(*) from access_log group by ipaddress order by count(*) desc limit 0,100;
σ count. ipaddress (τ count (γ count(ipaddress). ipaddress (access_log)))
MapReduce Flow
Map was used for extract the IP address of the client requesting the web page.
Reduce was used for summation.
1 more Map/Reduce was used for sort by count.
Benchmarks
1.5 GB access_log on 10 node cluster
This test should include the data load time for the MySql column, not just the SQL time.
|
MySql 5.0.27 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
|
|
Data |
B-tree disk table (MyISAM) |
Text files (access_log) |
Text files (access_log) |
Text files (access_log) |
Text files (access_log) |
Text files (access_log) |
|
Machine |
1 |
2 |
4 |
6 |
8 |
10 |
|
Rows |
5,914,669 |
5,914,669 |
5,914,669 |
5,914,669 |
5,914,669 |
5,914,669 |
|
Results |
100 |
100 |
100 |
100 |
100 |
100 |
|
Time |
4.43 sec |
172.30 sec |
108.01 sec |
77.41 sec |
66.30 sec |
60.78 sec |
Hbase Matrix computations Benchmarks
You can download the Hbase Matrix Package for Map/Reduce-based Parallel Matrix Computations (still under development)
MapReduce Flow
The Multiplication requires (n + 1) table full scan irrespective of the number of mapper.
Each map processor requires O(n2) for the communication and O(n3/mappers) the computation.
