Reporter : Edward Yoon
Hadoop Map/Reduce Data Processing Benchmarks
Group/Sort
- Finds the most connected networks.
SQL > select ipaddress, count(*) from access_log group by ipaddress order by count(*) desc limit 0,100;
σ count. ipaddress (τ count (γ count(ipaddress). ipaddress (access_log)))
MapReduce Flow
- Map was used for extract the IP address of the client requesting the web page.
- Reduce was used for summation.
- 1 more Map/Reduce was used for sort by count.
Benchmarks
1.5 GB access_log on 10 node cluster
This test should include the data load time for the MySql column, not just the SQL time.
MySql 5.0.27 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
Hadoop-0.15.2 |
|
Data |
B-tree disk table (MyISAM) |
Text files (access_log) |
Text files (access_log) |
Text files (access_log) |
Text files (access_log) |
Text files (access_log) |
Machine |
1 |
2 |
4 |
6 |
8 |
10 |
Rows |
5,914,669 |
5,914,669 |
5,914,669 |
5,914,669 |
5,914,669 |
5,914,669 |
Results |
100 |
100 |
100 |
100 |
100 |
100 |
Time |
4.43 sec |
172.30 sec |
108.01 sec |
77.41 sec |
66.30 sec |
60.78 sec |
I also investigate a lot of traditional methods of parallel processing and experiment some high level processing (e.g. matrix algebra, graph algorithm) using Hadoop/Hbase/MapReduce. The only way to increase speed linearly was locality (Do write all data even if there are duplicated efforts). Increased node numbers, there is a linear increase of IO channel.