Pig Performance
This document publishes current (as of 11/07/07) performance numbers for Pig. The objective is to have baseline numbers to compare to before we start making major changes to the system.
In addition, the document publishes numbers for Hadoop programs that perform identical computation. One of the objectives of Pig is to provide performance as close as possible (with 10-20%) to native hadoop code. These numbers allow to establish how wide the current gap is.
Test Setup
Both Pig and Hadoop tests ran on an 11 machine cluster with 1 machine dedicated to Name Node and Job Tracker and 10 compute nodes. Each machine had the following HW configuration:
2 dual-core Intel(R) Xeon(R) CPU @2.13GHz
4 GB of memory
The cluster had Hadoop 0.14.1 installed and had the following configuration:
1024MB memory
2 map + 2 reduce jobs per node
Test Data
Two data sets were used. Both contained tab delimited auto-generated data with identical schema:
name - string
age - integer
gpa - float
The first dataset (studenttab200M) contained 200 million rows (4384624709 bytes) and was used for all tests. The second set (studenttab10k) contained 10 thousand rows (219190 bytes) and was used as the second set in cogroup/join.
The data can be generated using [ADD TOOL]
Test Cases
Load and Store
A = load 'studenttab200M' using PigStorage('\t');
store A into my_studenttab200M using PigStorage();
Filter that removes 10% of data
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = filter A by gpa < '3.6';
store B into my_studenttab200M_filter_10 using PigStorage();
Filter that removes 90% of data
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = filter A by age < '25';
store B into my_studenttab200M_filter_90 using PigStorage();
Generate with basic arithmetic
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = foreach A generate age * gpa + 3, age/gpa - 1.5;
store B into my_studenttab200M_projection using PigStorage();
Grouping
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = group A by name;
C = foreach B generate flatten(group), COUNT(A.age);
store C into my_studenttab200M_group using PigStorage();
Cogrouping/Join
A = load 'studenttab200M' using PigStorage('\t') as (name, age, gpa);
B = load 'studenttab10k' using PigStorage('\t') as (name, age, gpa);
C = cogroup A by name inner, B by name inner;
D = foreach C generate flatten(A), flatten(B);
store D into my_studenttab200M_cogroup_small using PigStorage();
Performance Numbers
The wall-clock time was measured for each pig script and hadoop job by using time command on the client. In addition, average CPU and memory utilization for both map and reduce stages were measured by observing top. All processes were run 3 times and the performance numbers averaged.
|
Test |
Output Size(bytes) |
Pig Time(s) |
Map CPU(%) |
Map Memory(MB) |
Reduce CPU(%) |
Reduce Memory(MB) |
Hadoop Time(s) |
Map CPU(%) |
Map Memory(MB) |
Reduce CPU(%) |
Reduce Memory(MB) |
|
LoadStore |
4384624709 |
185 |
99 |
40 |
N/A |
N/A |
103 |
95 |
40 |
N/A |
N/A |
|
Filter 10% |
3940641765 |
207 |
99 |
40 |
N/A |
N/A |
137 |
95 |
40 |
N/A |
N/A |
|
Filter 90% |
511609335 |
170 |
99 |
40 |
N/A |
N/A |
100 |
95 |
40 |
N/A |
N/A |
|
Project |
7083620758 |
427 |
99 |
150 MB |
N/A |
N/A |
127 |
95 |
40 |
N/A |
N/A |
|
Group |
14144 |
1072 |
99 |
400 |
99 |
750 |
180 |
99 |
300 |
15 |
20 |
|
Cogroup |
129701186044 |
2480 |
99 |
450 |
50-90 |
500 |
1425 |
30-99* |
140 |
N/A |
N/A |
*In case of cogroup data node is taking up to 70% of CPU taking away CPU from one of the tasks.
Improvements
|
Chnage |
Test |
Old time(s) |
New time(s) |
Improvement(%) |
|
binary comparator |
Group |
1072 |
655 |
39 |
|
binary comparator |
Cogroup |
2480 |
2040 |
18 |
Additional Work
We need to do also do the following:
Add more tests for:
ORDER BY where data fits in memory
ORDER BY where data does not fit into memory
DISTINCT where data fits in memory
DISTINCT where data does not fit into memory
JOIN between 2 large tables
UDF overhead
Perform code profiling
by using tools
through code inspection