...
PigMix is a set of queries used test pig performance from release to release. There are queries that test latency (how long does it take to run this query?), and queries that test scalability (how many fields or records can pig handle before it fails?). In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. These will be used to test the performance gap between direct use of map reduce and using pig. In Jun 2010, we release PigMix2, which include 5 more queries in addition to the original 12 queries into PigMix to measure the performance of new Pig features. We will publish the result of both PigMix and PigMix2.
...
Usage
To run PigMix
...
, run the following command from PIG_HOME:
Code Block |
---|
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)
|
You can optionally set HADOOP_CONF_DIR before run.
If you want to change the default size of test dataset, change test/perf/pigmix/conf/config.sh.
Note the PigMix is checked in to Pig 0.12 and beyond. If you want to run it in earlier version of Pig, Please go to https://issues.apache.org/jira/browse/PIG-200 and use PIG-200-0.12.patch.
Runs
PigMix
The following table includes runs done of the pig mix. All of these runs have been done on a cluster with 26 slaves plus one machine acting as the name node and job tracker. The cluster was running hadoop version 0.18.1. (TODO: Need to get specific hardware info on those machines).
The tests were run against two versions of pig: top of trunk, and top of types branch both as of Nov 21 2008.
The tests were run three times for each version and the results averaged.
tot = top of trunk
totb = top of types branch
Version | Map Reduce Java Code | tot 11/21/08 | totb 11/21/08 | totb 1/20/09 | tot 2/23/09 |
---|---|---|---|---|---|
Date Run | 11/22/08 | 11/21/08 | 11/21/08 |
The following table includes runs done of the pig mix. All of these runs have been done on a cluster with 26 slaves plus one machine acting as the name node and job tracker. The cluster was running
hadoop version 0.18.1. (TODO: Need to get specific hardware info on those machines).
The tests were run against two
versions of pig: top of trunk, and top of types branch both as of Nov 21 2008.
The tests were run three times for each version and the results averaged.
tot = top of trunk
totb = top of types branch
Version | Map Reduce Java Code | tot 11/21/08 | totb 11/21/08 | totb 1/20/09 | tot 2/23/09 |
---|---|---|---|---|---|
Date Run | 11/22/08 | 11/21/08 | 11/21/08 | 1/20/09 | 2/23/09 |
L1 explode | 116 | 261 | 283 | 218 | 205 |
L2 fr join | 41 | 1665 | 253 | 168 | 89 |
L3 join | 97 | 1912 | 320 | 258 | 254 |
L4 distinct agg | 68 | 254 | 193 | 110 | 116 |
L5 anti-join | 90 | 1535 | 281 | 209 | 112 |
L6 large group by key | 61 | 294 | 226 | 126 | 120 |
L7 nested split | 72 | 243 | 204 | 107 | 102 |
L8 group all | 56 | 462 | 194 | 104 | 103 |
L9 order by 1 field | 286 | 5294 | 867 | 851 | 444 |
L10 order by multiple fields | 634 | 1403 | 565 | 469 | 447 |
L11 distinct + union | 120 | 316 | 255 | 164 | 154 |
L12 multi-store | 150 | fails | 781 | 499 | 804 |
Total time | 1791 | 13638 | 4420 | 3284 | 2950 |
Compared to hadoop | 1.0 | 7.6 | 2.5 | 1.8 | 1.6 |
Weighted Average | 1.0 | 11.2 | 3.26 | 2.20 | 1.97 |
The totb run of 1/20/09 includes the change to make BufferedPositionedInputStream use a buffer instead of relying on hadoop to buffer.
tot run of 2/23/09, top of trunk is now what was on the types branch (that is proto 0.2.0). This run includes fragment replicate join and rework of partitioning for order by.
Run of 5/28/09, placed in a separate table because there were underlying cluster changes, thus the map reduce tests needed to be rerun. This is the same code base that became 0.3.0.
Version | Map Reduce Java code | tot 5/27/09 |
---|---|---|
Date Run | 5/28/09 | 5/28/09 |
L1 explode | 119 | 205 |
L2 fr join | 44 | 110 |
L3 join | 113 | 314 |
L4 distinct agg | 76 | 153 |
L5 anti-join | 96 | 128 |
L6 large group by key | 67 | 148 |
L7 nested split | 67 | 133 |
L8 group all | 64 | 115 |
L9 order by 1 field | 329 | 563 |
L10 order by multiple fields | 607 | 532 |
L11 distinct + union | 106 | 203 |
L12 multi-store | 139 | 159 |
Total time | 1826 | 2764 |
Compared to hadoop | N/A | 1.5 |
Weighted average | N/A | 1.83 |
split | 72 | 243 | 204 | 107 | 102 |
L8 group all | 56 | 462 | 194 | 104 | 103 |
L9 order by 1 field | 286 | 5294 | 867 | 851 | 444 |
L10 order by multiple fields | 634 | 1403 | 565 | 469 | 447 |
L11 distinct + union | 120 | 316 | 255 | 164 | 154 |
L12 multi-store | 150 | fails | 781 | 499 | 804 |
Total time | 1791 | 13638 | 4420 | 3284 | 2950 |
Compared to hadoop | 1.0 | 7.6 | 2.5 | 1.8 | 1.6 |
Weighted Average | 1.0 | 11.2 | 3.26 | 2.20 | 1.97 |
The totb run of 1/20/09 includes the change to make BufferedPositionedInputStream use a buffer instead of relying on hadoop to buffer.
tot run of 2/23/09, top of trunk is now what was on the types branch (that is proto 0.2.0). This run includes fragment replicate join and rework of partitioning for order by.
Run of 5/28/09, placed in a separate table because there were underlying cluster changes, thus the map reduce tests needed to be rerun. This is the same code base that became 0.3.0.
Version | Map Reduce Java code | tot 5/27/09 |
---|---|---|
Date Run | 5/28/09 | 5/28/09 |
L1 explode | 119 | 205 |
L2 fr join | 44 | 110 |
L3 join | 113 | 314 |
L4 distinct agg | 76 | 153 |
L5 anti-join | 96 | 128 |
L6 large group by key | 67 | 148 |
L7 nested split | 67 | 133 |
L8 group all | 64 | 115 |
L9 order by 1 field | 329 | 563 |
L10 order by multiple fields | 607 | 532 |
L11 distinct + union | 106 | 203 |
L12 multi-store | 139 | 159 |
Total time | 1826 | 2764 |
Compared to hadoop | N/A | 1.5 |
Weighted average | N/A | 1.83 |
Run date: June 28, 2009, run against top of trunk as of that day.
Note that the columns got reversed in this one (Pig then MR)
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 204 | 117.33 | 1.74 |
PigMix_2 | 110.33 | 50.67 | 2.18 |
PigMix_3 | 292.33 | 125 | 2.34 |
PigMix_4 | 149.67 | 85.33 | 1.75 |
PigMix_5 | 131.33 | 105 | 1.25 |
PigMix_6 | 146.33 | 65.33 | 2.24 |
PigMix_7 | 128.33 | 82 | 1.57 |
PigMix_8 | 126.33 | 63.67 | 1.98 |
PigMix_9 | 506.67 | 312.67 | 1.62 |
PigMix_10 | 555 | 643 | 0.86 |
PigMix_11 | 206.33 | 136.67 | 1.51 |
PigMix_12 | 173 | 161.67 | 1.07 |
Total | 2729.67 | 1948.33 | 1.40 |
Weighted avg |
|
| 1.68 |
Run date: August 27, 2009, run against top of trunk as of that day.
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 218 | 133.33 | 1.635 |
PigMix_2 | 99.333 | 48 | 2.07 |
PigMix_3 | 272 | 127.67 | 2.13 |
PigMix_4 | 142.33 | 76.333 | 1.87 |
PigMix_5 | 127.33 | 107.33 | 1.19 |
PigMix_6 | 135.67 | 73 | 1.86 |
PigMix_7 | 124.67 | 78.333 | 1.59 |
PigMix_8 | 117.33 | 68 | 1.73 |
PigMix_9 | 356.33 | 323.67 | 1.10 |
PigMix_10 | 511.67 | 684.33 | 0.75 |
PigMix_11 | 180 | 121 | 1.49 |
PigMix_12 | 156 | 160.67 | 0.97 |
Total | 2440.67 | 2001.67 | 1.22 |
Weighted avg |
|
| 1.53 |
Run date: October 18, 2009, run against top of trunk as of that day.
With this run we included a new measure, weighted average. Our previous multiplier that we have been publishing takes the total time of running all 12 Pig Latin scripts and compares it to the total time of running all 12 Java Map Reduce programs. This is a valid way to measure, as it shows the total amount of time to do all these operations on both platforms. But it has the drawback that it gives more weight to long running operations (such as joins and order bys) while masking the performance in faster operations such as group bys. The new "weighted average" adds up the multiplier for each Pig Latin script vs. Java program separately and then divides by 12, thus weighting each test equally. In past runs the weighted average had significantly lagged the overall average (for example, in the run above for August 27 it was 1.5 even though the total difference was 1.2). With this latest run it still lags some, but the gap has shrunk noticably.
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 135.0 | 133.0 | 1.02 |
PigMix_2 | 46.67 | 39.33 | 1.19 |
PigMix_3 | 184.0 | 98.0 | 1.88 |
PigMix_4 | 71.67 | 77.67 | 0.92 |
PigMix_5 | 70.0 | 83.0 | 0.84 |
PigMix_6 | 76.67 | 61.0 | 1.26 |
PigMix_7 | 71.67 | 61.0 | 1.17 |
PigMix_8 | 43.33 | 47.67 | 0.91 |
PigMix_9 | 184.0 | 209.33 | 0.88 |
PigMix_10 | 268.67 | 283.0 | 0.95 |
PigMix_11 | 145.33 | 168.67 | 0.86 |
PigMix_12 | 55.33 | 95.33 | 0.58 |
Total | 1352.33 | 1357 | 1.00 |
Weighted avg |
|
| 1.04 |
Run date: January 4, 2010, run against 0.6 branch as of that day
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 138.33 | 112.67 | 1.23 |
PigMix_2 | 66.33 | 39.33 | 1.69 |
PigMix_3 | 199 | 83.33 | 2.39 |
PigMix_4 | 59 | 60.67 | 0.97 |
PigMix_5 | 80.33 | 113.67 | 0.71 |
PigMix_6 | 65 | 77.67 | 0.84 |
PigMix_7 | 63.33 | 61 | 1.04 |
PigMix_8 | 40 | 47.67 | 0.84 |
PigMix_9 | 214 | 215.67 | 0.99 |
PigMix_10 | 284.67 | 284.33 | 1.00 |
PigMix_11 | 141.33 | 151.33 | 0.93 |
PigMix_12 | 55.67 | 115 | 0.48 |
Total | 1407 | 1362.33 | 1.03 |
Weighted Avg |
|
| 1.09 |
PigMix2
Run date: May 29, 2010Run date: June 28, 2009, run against top of trunk as of that day.
Note that the columns got reversed in this one (Pig then MR)
Test | Pig run time | Java run time Multiplier time | Multiplier | ||
---|---|---|---|---|---|
PigMix_1 | 122.33 | 117 | 1.05 | ||
PigMix_2 | 50.33 | 42.67 | 1.18 | ||
PigMix_1 3 | 204 189 | 117 100.33 | 1.74 88 | ||
PigMix_2 4 | 75.67 | 61 | 1.24 | ||
PigMix_5 | 64 | 138 | 110.33 | 50.67 | 2 0.18 46 |
PigMix_3 6 | 65.67 | 69 292.33 | 125 | 2 0.34 95 | |
PigMix_4 7 | 149 88.67 33 | 85 84.33 | 1.75 .05 | ||
PigMix_8 | 39 | 47.67 | 0.82 | ||
PigMix_5 9 | 131 274.33 | 105 215.33 | 1.25 27 | ||
PigMix_6 10 | 146 333.33 | 65 311.33 | 2 1.24 07 | ||
PigMix_7 11 | 128 151.33 | 82 157 | 1 0.57 96 | ||
PigMix_8 12 | 126 70.33 67 | 63 97.67 | 1 0.98 72 | ||
PigMix_9 | 506.67 | 312.67 | 1.62 | ||
PigMix_10 | 555 | 643 | 0.86 | ||
PigMix_11 | 206.33 | 136.67 | 1.51 | ||
PigMix_12 | 173 | 161.67 | 1.07 | ||
Total | 2729.67 | 1948.33 | 1.40 | ||
13 | 80 | 33 | 2.42 | ||
PigMix_14 | 69 | 86.33 | 0.80 | ||
PigMix_15 | 80.33 | 69.33 | 1.16 | ||
PigMix_16 | 82.33 | 69.33 | 1.19 | ||
PigMix_17 | 286 | 229.33 | 1.25 | ||
Total | 2121.67 | 1929.67 | 1.10 | ||
Weighted Avg Weighted avg |
|
| 1.68 15 |
Run date: August 27Jun 11, 20092011, run against top of trunk as of that day.
Test | Pig run time | Java run time | Multiplier | |||
---|---|---|---|---|---|---|
PigMix_1 | 218 130 | 133.33 139 | 1 0.635 94 | |||
PigMix_2 | 99.333 66 | 48.67 | 2 1.07 36 | |||
PigMix_3 | 272 138 | 127 107.67 33 | 2 1.13 29 | |||
PigMix_4 | 106 | 142 78.33 76.333 | 1.35 | |||
PigMix_5 | 135.67 | 114 | 1.87 19 | |||
PigMix_5 6 | 127 103.33 67 | 107 74.33 | 1.19 39 | |||
PigMix_6 7 | 135 77.67 | 73 77.33 | 1.86 .00 | |||
PigMix_8 | 56.33 | 57 | 0.99 | |||
PigMix_7 9 | 124 384.67 | 78 280.333 33 | 1.59 37 | |||
PigMix_8 10 | 380 | 117 354.33 68 67 | 1.73 07 | |||
PigMix_9 11 | 356.33 | 164 | 141 323.67 | 1.10 16 | ||
PigMix_10 12 | 511 109.67 | 684 187.33 | 0.75 59 | |||
PigMix_11 13 | 180 78 121 | 44.33 | 1.49 76 | |||
PigMix_12 14 | 156 105.33 | 160 111.67 | 0.97 94 | |||
Total PigMix_15 | 2440 89.67 2001.67 | 87 | 1.22 03 | |||
PigMix_16 | 87.67 | 75.33 | Weighted avg |
|
| 1.53 |
Run date: October 18, 2009, run against top of trunk as of that day.
With this run we included a new measure, weighted average. Our previous multiplier that we have been publishing takes the total time of running all 12 Pig Latin scripts and compares it to the total time of running all 12 Java Map Reduce programs. This is a valid way to measure, as it shows the total amount of time to do all these operations on both platforms. But it has the drawback that it gives more weight to long running operations (such as joins and order bys) while masking the performance in faster operations such as group bys. The new "weighted average" adds up the multiplier for each Pig Latin script vs. Java program separately and then divides by 12, thus weighting each test equally. In past runs the weighted average had significantly lagged the overall average (for example, in the run above for August 27 it was 1.5 even though the total difference was 1.2). With this latest run it still lags some, but the gap has shrunk noticably.
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 135.0 | 133.0 | 1.02 |
PigMix_2 | 46.67 | 39.33 | 1.19 |
PigMix_3 | 184.0 | 98.0 | 1.88 |
PigMix_4 | 71.67 | 77.67 | 0.92 |
PigMix_5 | 70.0 | 83.0 | 0.84 |
PigMix_6 | 76.67 | 61.0 | 1.26 |
PigMix_7 | 71.67 | 61.0 | 1.17 |
PigMix_8 | 43.33 | 47.67 | 0.91 |
PigMix_9 | 184.0 | 209.33 | 0.88 |
PigMix_10 | 268.67 | 283.0 | 0.95 |
PigMix_11 | 145.33 | 168.67 | 0.86 |
PigMix_12 | 55.33 | 95.33 | 0.58 |
Total | 1352.33 | 1357 | 1.00 |
Weighted avg |
|
| 1.04 |
Run date: January 4, 2010, run against 0.6 branch as of that day
16 | |||
PigMix_17 | 171.33 | 152.33 | 1.12 |
Total | 2383.67 | 2130 | 1.12 |
Weighted Avg |
|
| 1.16 |
Pig 0.9.2
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 146 | 147 | 0.993197278911565 |
PigMix_2 | 73 | 61 | 1.19672131147541 |
PigMix_3 | 134 | 158 | 0.848101265822785 |
PigMix_4 | 91 | 87 | 1.04597701149425 |
PigMix_5 | 81 | 153 | 0.529411764705882 |
PigMix_6 | 91 | 81 | 1.12345679012346 |
PigMix_7 | 71 | 86 | 0.825581395348837 |
PigMix_8 | 56 | 61 | 0.918032786885246 |
PigMix_9 | 302 | 192 | 1.57291666666667 |
PigMix_10 | 312 | 226 | 1.38053097345133 |
PigMix_11 | 207 | 222 | 0.932432432432432 |
PigMix_12 | 96 | 163 | 0.588957055214724 |
PigMix_13 | 76 | 127 | 0.598425196850394 |
PigMix_14 | 94 | 157 | 0.598726114649682 |
PigMix_15 | 86 | 92 | 0.934782608695652 |
PigMix_16 | 80 | 82 | 0.975609756097561 |
PigMix_17 | 196 | 176 | 1.11363636363636 |
Total | 2192 | 2271 | 0.965213562 |
Weighted Avg |
|
| 0.951558634 |
Pig 0.10.1
Test | Pig run time | Java run time | Multiplier |
---|---|---|---|
PigMix_1 | 147 | 146 | 1.00684931506849 |
PigMix_2 | 74 | 62 | 1.19354838709677 |
PigMix_3 | 140 | 158 | 0.886075949367089 |
PigMix_4 | 87 | 86 | 1.01162790697674 |
PigMix_5 | 81 | 153 | 0.529411764705882 |
PigMix_6 | 92 | 262 | 0.351145038167939 |
PigMix_7 | 76 | 86 | 0.883720930232558 |
PigMix_8 | 62 | 61 | 1.01639344262295 |
PigMix_9 | 303 | 187 | 1.62032085561497 |
PigMix_10 | 303 | 232 | 1.30603448275862 |
PigMix_11 | 188 | 218 | 0.862385321100917 |
PigMix_12 | 101 | 157 | 0.643312101910828 |
PigMix_13 | 82 | 132 | 0.621212121212121 |
PigMix_14 | 99 | 158 | 0.626582278481013 |
PigMix_15 | 82 | 91 | 0.901098901098901 |
PigMix_16 | 82 | 82 | 1 |
PigMix_17 | 206 | 177 | 1.1638418079096 |
Total | 2205 | 2448 | 0.900735294117647 |
Test | Pig run time | Java run time | Multiplier |
PigMix_1 | 138.33 | 112.67 | 1.23 |
PigMix_2 | 66.33 | 39.33 | 1.69 |
PigMix_3 | 199 | 83.33 | 2.39 |
PigMix_4 | 59 | 60.67 | 0.97 |
PigMix_5 | 80.33 | 113.67 | 0.71 |
PigMix_6 | 65 | 77.67 | 0.84 |
PigMix_7 | 63.33 | 61 | 1.04 |
PigMix_8 | 40 | 47.67 | 0.84 |
PigMix_9 | 214 | 215.67 | 0.99 |
PigMix_10 | 284.67 | 284.33 | 1.00 |
PigMix_11 | 141.33 | 151.33 | 0.93 |
PigMix_12 | 55.67 | 115 | 0.48 |
Total | 1407 | 1362.33 | 1.03 |
Weighted Avg |
|
| 1 0.09 |
PigMix2
919032977 |
Pig 0.11.1Run date: May 29, 2010, run against top of trunk as of that day.
Test | Pig run time | Java run time | Multiplier | ||
---|---|---|---|---|---|
PigMix_1 | 122.33 163 | 117 141 | 1.05 15602836879433 | ||
PigMix_2 | 50.33 | 66 | 61 42.67 | 1.18 08196721311475 | |
PigMix_3 | 189 141 | 100.33 158 | 1 0.88 892405063291139 | ||
PigMix_4 | 75.67 87 | 61 86 | 1.24 01162790697674 | ||
PigMix_5 | 64 82 | 138.67 158 | 0.46 518987341772152 | ||
PigMix_6 | 65.67 | 69.33 | 92 | 81 | 1.1358024691358 0.95 |
PigMix_7 | 88.33 | 84.33 | 82 | 87 | 0.942528735632184 1.05 |
PigMix_8 | 39 63 | 47.67 62 | 0 1.82 01612903225806 | ||
PigMix_9 | 274.33 | 320 | 207 215.33 | 1.27 54589371980676 | |
PigMix_10 | 333.33 | 311 | 226 311.33 | 1.07 37610619469027 | |
PigMix_11 | 151.33 184 | 157 218 | 0.96 844036697247706 | ||
PigMix_12 70.67 12 | 97 .67 | 158 | 0.72 613924050632911 | ||
PigMix_13 | 80 78 | 33 127 | 2 0.42 614173228346457 | ||
PigMix_14 | 69 101 | 86.33 158 | 0.80 639240506329114 | ||
PigMix_15 | 80.33 | 69.33 | 87 | 91 | 0.956043956043956 1.16 |
PigMix_16 | 82 .33 69.33 | 87 | 1 0.19 942528735632184 | ||
PigMix_17 | 286 203 | 229.33 176 | 1.25 15340909090909 | ||
Total | 2121.67 | 1929.67 | 2239 | 2282 | 0.981156879929886 1.10 |
Weighted Avg |
|
| 1 0.15 967107783 |
Pig 0.12 (4/4/2013)Run date: Jun 11, 2011, run against top of trunk as of that day.
Test | Pig run time | Java run time | Multiplier | |
---|---|---|---|---|
PigMix_1 | 130 168 | 139 142 | 0 1.94 1830985915493 | |
PigMix_2 | 66 71 | 48.67 62 | 1.36 14516129032258 | |
PigMix_3 | 138 141 | 107.33 158 | 1 0.29 892405063291139 | |
PigMix_4 | 106 93 | 78.33 87 | 1.35 06896551724138 | |
PigMix_5 | 135.67 87 | 114 158 | 1 0.19 550632911392405 | |
PigMix_6 | 103.67 | 93 | 81 74.33 | 1.39 14814814814815 |
PigMix_7 | 77 .67 77.33 | 87 | 1 0.00 885057471264368 | |
PigMix_8 | 56.33 62 | 57 | 0 1.99 08771929824561 | |
PigMix_9 | 384.67 | 310 | 192 280.33 | 1.37 61458333333333 |
PigMix_10 | 380 311 | 354.67 221 | 1.07 40723981900452 | |
PigMix_11 | 164 190 | 141 217 | 1 0.16 875576036866359 | |
PigMix_12 | 109.67 | 102 | 158 187.33 | 0.59 645569620253165 |
PigMix_13 | 78 77 | 44.33 133 | 1 0.76 578947368421053 | |
PigMix_14 | 105.33 | 101 | 343 111.67 | 0.94 294460641399417 |
PigMix_15 | 89.67 87 87 | 86 | 1.03 01162790697674 | |
PigMix_16 | 87.67 | 82 | 82 75.33 | 1 .16 |
PigMix_17 | 171.33 | 207 | 177 152.33 | 1.12 16949152542373 |
Total | 2383.67 2259 | 2130 2441 | 1 0.12 925440393281442 | |
Weighted Avg |
|
| 1 0.16 974040267 |
Features Tested
Based on a sample of user queries, PigMix includes tests for the following features.
...
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)action as action, (map[])page_info as page_info, flatten((bag{tuple(map[])})page_links) as page_links; C = foreach B generate user, (action == 1 ? page_info#'a' : page_links#'b') as header; D = group C by user parallel $parallelfactor40; E = foreach D generate group, COUNT(C) as cnt; store E into '$outL1out'; |
Script L2
This script tests using a join small enough to do in fragment and replicate (feature 7).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, estimated_revenue; alpha = load '$power/user/pig/tests/data/pigmix/power_users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name; C = join B by user, beta by name, Ausing by'replicated' userparallel $parallelfactor40; store C into '$outL2out'; |
Script L3
This script tests a join too large for fragment and replicate. It also contains a join followed by a group by on the same key,
something that pig could potentially optimize by not regrouping.
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (double)estimated_revenue; alpha = load '$users/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name; C = join beta by name, AB by user parallel $parallelfactor40; D = group C by $0 parallel $parallelfactor40; E = foreach D generate group, SUM(C.estimated_revenue); store E into '$outL3out'; |
Script L4
This script covers foreach generate with a nested distinct (feature 10).
Code Block |
---|
register pigperf.jar; A = load '$page '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, action; C = group B by user parallel $parallelfactor40; D = foreach C { aleph = B.action; beth = distinct aleph; generate group, COUNT(beth); } store D into '$outL4out'; |
Script L5
This script does an anti-join. This is useful because it is a use of cogroup that is not a regular join (feature 9).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user; alpha = load '$users/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name; C = cogroup beta by name, AB by user parallel $parallelfactor40; D = filter C by COUNT(beta) == 0; E = foreach D generate group; store E into '$outL5out'; |
Script L6
This script covers the case where the group by key is a significant percentage of the row (feature 12).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, action, (int)timespent as timespent, query_term, ip_addr, timestamp; C = group B by (user, query_term, ip_addr, timestamp) parallel $parallelfactor40; D = foreach C generate flatten(group), SUM(B.timespent); store D into '$outL6out'; |
Script L7
This script covers having a nested plan with splits (feature 11).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, timestamp; C = group B by user parallel $parallelfactor40; D = foreach C { morning = filter B by timestamp < 43200; afternoon = filter B by timestamp >= 43200; generate group, COUNT(morning), COUNT(afternoon); } store D into '$outL7out'; |
Script L8
This script covers group all (feature 13).
Code Block |
---|
register pigperf.jar; A = load '$page'/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue; C = group B all; D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); store D into '$outL8out'; |
Script L9
This script covers order by of a single value (feature 15).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = order A by query_term parallel $parallelfactor40; store B into '$outL9out'; |
Script L10
This script covers order by of multiple values (feature 15).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:int, query_term, ip_addr, timestamp, estimated_revenue:double, page_info, page_links); B = order A by query_term, estimated_revenue desc, timespent parallel $parallelfactor40; store B into '$outL10out'; |
Script L11
This script covers distinct and union and reading from a wide row but using only one field (features: 1, 14).
Code Block |
---|
register pigperf.jar; A = load '$page/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user; C = distinct B parallel $parallelfactor40; alpha = load '$widerow/user/pig/tests/data/pigmix/widerow' using PigStorage('\u0001'); beta = foreach alpha generate $0 as name; gamma = distinct beta parallel $parallelfactor40; D = union C, gamma; E = distinct D parallel $parallelfactor40; store E into '$outL11out'; |
Script L12
This script covers multi-store queries (feature 16).
Code Block |
---|
register pigperf.jar; A = load '$page '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, action, (int)timespent as timespent, query_term, (double)estimated_revenue as estimated_revenue; split B into C if user is not null, alpha if user is null; split C into D if query_term is not null, aleph if query_term is null; E = group D by user parallel $parallelfactor40; F = foreach E generate group, MAX(D.estimated_revenue); store F into 'highest_value_page_per_user'; beta = group alpha by query_term parallel $parallelfactor40; gamma = foreach beta generate group, SUM(alpha.timespent); store gamma into 'total_timespent_per_term'; beth = group aleph by action parallel $parallelfactor40; gimel = foreach beth generate group, COUNT(aleph); store gimel into 'queries_per_action'; |
...
Code Block |
---|
register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, estimated_revenue; alpha = load ':INPATH:/user/pig/tests/data/pigmix/power_users_samples' using PigStorage('\\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name, phone; C = join B by user left outer, beta by name parallel $parallelfactor40; store C into '$outL13out'; |
Script L14 (PigMix2 only)
...
Code Block |
---|
register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views_sorted' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, estimated_revenue; alpha = load '/user/pig/tests/data/pigmix/users_sorted' using PigStorage('\\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name; C = join B by user, beta by name using "'merge"'; store C into '$outL14out'; |
Script L15 (PigMix2 only)
...
Code Block |
---|
register pigperf.jar; A = load ' load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, action, estimated_revenue, timespent; C = group B by user parallel $parallelfactor40; D = foreach C { beth = distinct B.action; rev = distinct B.estimated_revenue; ts = distinct B.timespent; generate group, COUNT(beth), SUM(rev), (int)AVG(ts); } store D into '$outL15out'; |
Script L16 (PigMix2 only)
...
Code Block |
---|
register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, estimated_revenue; C = group B by user parallel $parallelfactor40; D = foreach C { E = order B by estimated_revenue; F = E.estimated_revenue; generate group, SUM(F); } store D into '$outL16out'; |
Script L17 (PigMix2 only)
...
Code Block |
---|
register pigperf.jar; A = load '/user/pig/tests/data/pigmix/widegroupbydata' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links, user_1, action_1, timespent_1, query_term_1, ip_addr_1, timestamp_1, estimated_revenue_1, page_info_1, page_links_1, user_2, action_2, timespent_2, query_term_2, ip_addr_2, timestamp_2, estimated_revenue_2, page_info_2, page_links_2); B = group A by (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, user_1, action_1, timespent_1, query_term_1, ip_addr_1, timestamp_1, estimated_revenue_1, user_2, action_2, timespent_2, query_term_2, ip_addr_2, timestamp_2, estimated_revenue_2) parallel $parallelfactor40; C = foreach B generate SUM(A.timespent), SUM(A.timespent_1), SUM(A.timespent_2), AVG(A.estimated_revenue), AVG(A.estimated_revenue_1), AVG(A.estimated_revenue_2); store C into '$outL17out'; |
Features not yet covered: 5 (bzip data)
Data Generation
If you want to run this queires yourselfknow the details of data generation, please , see https://issues.apache.org/jira/browse/PIG-200 on how to generate the data.
See DataGeneratorHadoop for information on how to run data generator in hadoop mode.