Pig Mix

PigMix is a set of queries used test pig performance from release to release. There are queries that test latency (how long does it take to run this query?), and queries that test scalability (how many fields or records can pig handle before it fails?). In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. These will be used to test the performance gap between direct use of map reduce and using pig.

Runs

The following table includes runs done of the pig mix. All of these runs have been done on a cluster with 26 slaves plus one machine acting as the name node and job tracker. The cluster was running hadoop version 0.18.1. (TODO: Need to get specific hardware info on those machines).

The tests were run against two versions of pig: top of trunk, and top of types branch both as of Nov 21 2008.

The tests were run three times for each version and the results averaged.

tot = top of trunk totb = top of types branch

Version

Map Reduce Java Code

tot 11/21/08

totb 11/21/08

totb 1/20/09

tot 2/23/09

Date Run

11/22/08

11/21/08

11/21/08

1/20/09

2/23/09

L1 explode

116

261

283

218

205

L2 fr join

41

1665

253

168

89

L3 join

97

1912

320

258

254

L4 distinct agg

68

254

193

110

116

L5 anti-join

90

1535

281

209

112

L6 large group by key

61

294

226

126

120

L7 nested split

72

243

204

107

102

L8 group all

56

462

194

104

103

L9 order by 1 field

286

5294

867

851

444

L10 order by multiple fields

634

1403

565

469

447

L11 distinct + union

120

316

255

164

154

L12 multi-store

150

fails

781

499

804

Total time

1791

13638

4420

3284

2950

Compared to hadoop

1.0

7.6

2.5

1.8

1.6

Weighted Average

1.0

11.2

3.26

2.20

1.97

The totb run of 1/20/09 includes the change to make BufferedPositionedInputStream use a buffer instead of relying on hadoop to buffer.

tot run of 2/23/09, top of trunk is now what was on the types branch (that is proto 0.2.0). This run includes fragment replicate join and rework of partitioning for order by.

Run of 5/28/09, placed in a separate table because there were underlying cluster changes, thus the map reduce tests needed to be rerun. This is the same code base that became 0.3.0.

|| Version || Map Reduce Java code || tot 5/27/09 ||

Date Run

5/28/09

5/28/09

L1 explode

119

205

L2 fr join

44

110

L3 join

113

314

L4 distinct agg

76

153

L5 anti-join

96

128

L6 large group by key

67

148

L7 nested split

67

133

L8 group all

64

115

L9 order by 1 field

329

563

L10 order by multiple fields

607

532

L11 distinct + union

106

203

L12 multi-store

139

159

Total time

1826

2764

Compared to hadoop

N/A

1.5

Weighted average

N/A

1.83

Run date: June 28, 2009, run against top of trunk as of that day. Note that the columns got reversed in this one (Pig then MR)

|| Test || Pig run time || Java run time || Multiplier ||

PigMix_1

204

117.33

1.74

PigMix_2

110.33

50.67

2.18

PigMix_3

292.33

125

2.34

PigMix_4

149.67

85.33

1.75

PigMix_5

131.33

105

1.25

PigMix_6

146.33

65.33

2.24

PigMix_7

128.33

82

1.57

PigMix_8

126.33

63.67

1.98

PigMix_9

506.67

312.67

1.62

PigMix_10

555

643

0.86

PigMix_11

206.33

136.67

1.51

PigMix_12

173

161.67

1.07

Total

2729.67

1948.33

1.40

Weighted avg

1.68

Run date: August 27, 2009, run against top of trunk as of that day.

Test

Pig run time

Java run time

Multiplier

PigMix_1

218

133.33

1.635

PigMix_2

99.333

48

2.07

PigMix_3

272

127.67

2.13

PigMix_4

142.33

76.333

1.87

PigMix_5

127.33

107.33

1.19

PigMix_6

135.67

73

1.86

PigMix_7

124.67

78.333

1.59

PigMix_8

117.33

68

1.73

PigMix_9

356.33

323.67

1.10

PigMix_10

511.67

684.33

0.75

PigMix_11

180

121

1.49

PigMix_12

156

160.67

0.97

Total

2440.67

2001.67

1.22

Weighted avg

1.53

Run date: October 18, 2009, run against top of trunk as of that day. With this run we included a new measure, weighted average. Our previous multiplier that we have been publishing takes the total time of running all 12 Pig Latin scripts and compares it to the total time of running all 12 Java Map Reduce programs. This is a valid way to measure, as it shows the total amount of time to do all these operations on both platforms. But it has the drawback that it gives more weight to long running operations (such as joins and order bys) while masking the performance in faster operations such as group bys. The new "weighted average" adds up the multiplier for each Pig Latin script vs. Java program separately and then divides by 12, thus weighting each test equally. In past runs the weighted average had significantly lagged the overall average (for example, in the run above for August 27 it was 1.5 even though the total difference was 1.2). With this latest run it still lags some, but the gap has shrunk noticably.

Test

Pig run time

Java run time

Multiplier

PigMix_1

135.0

133.0

1.02

PigMix_2

46.67

39.33

1.19

PigMix_3

184.0

98.0

1.88

PigMix_4

71.67

77.67

0.92

PigMix_5

70.0

83.0

0.84

PigMix_6

76.67

61.0

1.26

PigMix_7

71.67

61.0

1.17

PigMix_8

43.33

47.67

0.91

PigMix_9

184.0

209.33

0.88

PigMix_10

268.67

283.0

0.95

PigMix_11

145.33

168.67

0.86

PigMix_12

55.33

95.33

0.58

Total

1352.33

1357

1.00

Weighted avg

1.04

Features Tested

Based on a sample of user queries, PigMix includes tests for the following features.

  1. Data with many fields, but only a few are used.
  2. Reading data from maps.
  3. Use of bincond and arithmetic operators.
  4. Exploding nested data.
  5. Load bzip2 data
  6. Load uncompressed data
  7. join with one table small enough to fit into a fragment and replicate algorithm.
  8. join where tables are sorted and partitioned on the same key
  9. Do a cogroup that is not immediately followed by a flatten (that is, use cogroup for something other than a straight forward join).
  10. group by with only algebraic udfs that has nested plan (distinct aggs basically).
  11. foreachs with nested plans including filter and implicit splits.
  12. group by where the key accounts for a large portion of the record.
  13. group all
  14. union plus distinct
  15. order by
  16. multi-store query (that is, a query where data is scanned once, then split and grouped different ways).

The data is generated so that it has a zipf type distribution for the group by and join keys, as this models most human generated data. Some other fields are generated using a uniform data distribution.

Scalability tests test the following:

  1. Join of very large data sets.
  2. Grouping of very large data set.
  3. Query with a very wide (500+ fields) row.
  4. Loading many data sets together in one load

Proposed Data

Initially, four data sets have been created. The first, page_views, is 10 million rows in size, with a schema of:

Name

Type

Average Length

Cardinality

Distribution

Percent Null

user

string

20

1.6M

zipf

7

action

int

X

2

uniform

0

timespent

int

X

20

zipf

0

query_term

string

10

1.8M

zipf

20

ip_addr

long

X

1M

zipf

0

timestamp

long

X

86400

uniform

0

estimated_revenue

double

X

100k

zipf

5

page_info

map

15

X

zipf

0

page_links

bag of maps

50

X

zipf

20

The second, users, was created by taking the unique user keys from page_views and adding additional columns.

Name

Type

Average Length

Cardinality

Distribution

Percent Null

name

string

20

1.6M

unique

7

phone

string

10

1.6M

zipf

20

address

string

20

1.6M

zipf

20

city

string

10

1.6M

zipf

20

state

string

2

1.6M

zipf

20

zip

int

X

1.6M

zipf

20

The third, power_users, has 500 rows, and has the same schema as users. It was generated by skimming 500 unique names from users. This will produce a table that can be used to test fragment replicate type joins.

The fourth, widerow, has a very wide row (500 fields), consisting of one string and 499 integers.

users, power_users, and widerow are written in ASCII format, using Ctrl-A as the field delimiter. They can be read using PigStorage.

page_views is written in as text data, with Ctrl-A as the field delimiter. Maps in the file are delimited by Ctrl-C between key value pairs and Ctrl-D between keys and values. Bags in the file are delimited by Ctrl-B between tuples in the bag. A special loader, PigPerformance loader has been written to read this format.

Proposed Scripts

Scalability

Script S1

This script tests grouping, projecting, udf envocation, and filtering with a very wide row. Covers scalability feature 3.

A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, ..., c500: int);
B = group A by name parallel $parrallelfactor;
C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, ... SUM(A.c500) as c500;
D = filter C by c0 > 100 and c1 > 100 and c2 > 100 ... and c500 > 100;
store D into '$out';

Script S2 This script tests joining two inputs where a given value of the join key appears many times in both inputs. This will test pig's ability to handle large joins. It covers scalability features 1 and 2.

TBD

Features not yet tested: 4.

Latency

Script L1

This script tests reading from a map, flattening a bag of maps, and use of bincond (features 2, 3, and 4).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)action as action, (map[])page_info as page_info,
    flatten((bag{tuple(map[])})page_links) as page_links;
C = foreach B generate user,
    (action == 1 ? page_info#'a' : page_links#'b') as header;
D = group C by user $parallelfactor;
E = foreach D generate group, COUNT(C) as cnt;
store E into '$out';

Script L2

This script tests using a join small enough to do in fragment and replicate (feature 7).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, estimated_revenue;
alpha = load '$power_users' using PigStorage('\u0001') as (name, phone,
        address, city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, A by user $parallelfactor;
store C into '$out';

Script L3

This script tests a join too large for fragment and replicate. It also contains a join followed by a group by on the same key, something that pig could potentially optimize by not regrouping.

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
alpha = load '$users' using PigStorage('\u0001') as (name, phone, address,
        city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, A by user $parallelfactor;
D = group C by $0 $parallelfactor;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into '$out';

Script L4

This script covers foreach generate with a nested distinct (feature 10).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, action;
C = group B by user $parallelfactor;
D = foreach C {
    aleph = B.action;
    beth = distinct aleph;
    generate group, COUNT(beth);
}
store D into '$out';

Script L5

This script does an anti-join. This is useful because it is a use of cogroup that is not a regular join (feature 9).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user;
alpha = load '$users' using PigStorage('\u0001') as (name, phone, address,
        city, state, zip);
beta = foreach alpha generate name;
C = cogroup beta by name, A by user $parallelfactor;
D = filter C by COUNT(beta) == 0;
E = foreach D generate group;
store E into '$out';

Script L6

This script covers the case where the group by key is a significant percentage of the row (feature 12).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, action, (int)timespent as timespent, query_term, ip_addr, timestamp;
C = group B by (user, query_term, ip_addr, timestamp) $parallelfactor;
D = foreach C generate flatten(group), SUM(B.timespent);
store D into '$out';

Script L7

This script covers having a nested plan with splits (feature 11).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, timespent, query_term,
            ip_addr, timestamp, estimated_revenue, page_info, page_links);
B = foreach A generate user, timestamp;
C = group B by user $parallelfactor;
D = foreach C {
    morning = filter B by timestamp < 43200;
    afternoon = filter B by timestamp >= 43200;
    generate group, COUNT(morning), COUNT(afternoon);
}
store D into '$out';

Script L8

This script covers group all (feature 13).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
C = group B all;
D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
store D into '$out';

Script L9

This script covers order by of a single value (feature 15).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = order A by query_term $parallelfactor;
store B into '$out';

Script L10

This script covers order by of multiple values (feature 15).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent:int, query_term, ip_addr, timestamp,
        estimated_revenue:double, page_info, page_links);
B = order A by query_term, estimated_revenue desc, timespent $parallelfactor;
store B into '$out';

Script L11

This script covers distinct and union and reading from a wide row but using only one field (features: 1, 14).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user;
C = distinct B $parallelfactor;
alpha = load '$widerow' using PigStorage('\u0001');
beta = foreach alpha generate $0 as name;
gamma = distinct beta $parallelfactor;
D = union C, gamma;
E = distinct D $parallelfactor;
store E into '$out';

Script L12

This script covers multi-store queries (feature 16).

register pigperf.jar;
A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, action, (int)timespent as timespent, query_term,
    (double)estimated_revenue as estimated_revenue;
split B into C if user is not null, alpha if user is null;
split C into D if query_term is not null, aleph if query_term is null;
E = group D by user $parallelfactor;
F = foreach E generate group, MAX(D.estimated_revenue);
store F into 'highest_value_page_per_user';
beta = group alpha by query_term $parallelfactor;
gamma = foreach beta generate group, SUM(alpha.timespent);
store gamma into 'total_timespent_per_term';
beth = group aleph by action $parallelfactor;
gimel = foreach beth generate group, COUNT(aleph);
store gimel into 'queries_per_action';

Features not yet covered: 5 (bzip data), 8 (sorted join)

Data Generation

If you want to run this queires yourself, please, see https://issues.apache.org/jira/browse/PIG-200 on how to generate the data. See DataGeneratorHadoop for information on how to run data generator in hadoop mode.

PigMix (last edited 2009-11-04 13:12:05 by DmitriyRyaboy)