...
For more information on Pig and Spark projects, see hereReferences.
Motivation
The main motivation for enabling Pig to run on Spark is to:
...
Test | Comment | Owner | ETA | Status |
TestJoin | Joining empty tuples fails | Kelly | 1 day | ✔ |
TestPigContext, TestGrunt | PIG-4295. Deserializing UDF classpath config fails in Spark because it’s thread local. | Kelly | 3 days | ✔ |
TestProjectRange | PIG-4297 Range expressions fail with groupby | Kelly | 2 days | ✔ |
TestAssert | FilterConverter issue? | 1 day | ||
TestLocationInPhysicalPlan | 1 day |
Other Tests
ETA: 1 week
Several tests are failing due to either:
...
Investigate
And fix if needed
Feature | Comment | Owner | ETA | Status |
Streaming | Test are passing, but we need confirmation that this feature works. | 1 day |
Spark Unit Tests
Few Spark engine specific unit tests have been written so far (for features that have Spark specific implementations). Following is a partial list of what we need to add. Need to update this list as we add more Spark specific code. We should also add tests for POConverter implementations.
Test | Comment | Owner | ETA | Status |
TestSparkLauncher | ||||
TestSparkPigStats | ||||
TestSecondarySortSpark | Kelly | ✔ |
Enhance Test Infrastructure
...
Performance Optimizations
Feature | Comment | Owner | ETA | Status |
Split / MultiQuery using RDD.cache() | ||||
In Group + Foreach aggregations, use aggregateByKey or reduceByKey for much better performance | For example: COUNT or DISTINCT aggregation inside nested foreach is handled by Pig code. We should use Spark to do in more efficiently | |||
Compute optimal Shuffle Parallelism | Currently we let Spark pick the default | |||
Combiner support for “Group+ForEach” | ||||
Multiple GROUP BYs on the same data set can avoid multiple shuffles. | See MultiQueryPackager | |||
Switch to Kryo for Spark data serialization | Are all Pig serializable classes compatible with Kryo ? | |||
FR Join | ||||
Merge Join (including sparse merge join) | ||||
Skew Join | ||||
Merge CoGroup | ||||
Collected CoGroup |
Note that there is ongoing work in Spark SQL to support specialized joins: See SPARK-2211. As an example, support for merge join is in Spark SQL in Spark 1.4 (SPARK-2213 and SPARK-7165). This implies that Spark community will not be adding support for these joins in Spark Core library.
...