Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For more information on Pig and Spark projects, see hereReferences.

Motivation

The main motivation for enabling Pig to run on Spark is to:

...

 

Test

Comment

Owner

ETA

Status

TestJoin

Joining empty tuples fails

 Kelly

1 day

 

TestPigContext, TestGrunt

PIG-4295. Deserializing UDF classpath config fails in Spark because it’s thread local.

Kelly

3 days

TestProjectRange

PIG-4297 Range expressions fail with groupby

Kelly

2 days

 

TestAssert

FilterConverter issue?

 

1 day

 

TestLocationInPhysicalPlan

  

1 day

 

 

Other Tests

ETA: 1 week

Several tests are failing due to either:

...

Investigate

And fix if needed

 

Feature

Comment

Owner

ETA

Status

Streaming

Test are passing, but we need confirmation that this feature works.

 

1 day

 

 

Spark Unit Tests

Few Spark engine specific unit tests have been written so far (for features that have Spark specific implementations). Following is a partial list of what we need to add. Need to update this list as we add more Spark specific code. We should also add tests for POConverter implementations.

 

Test

Comment

Owner

ETA

Status

TestSparkLauncher

    

TestSparkPigStats

    

TestSecondarySortSpark

 

Kelly

 

 

Enhance Test Infrastructure

...

Performance Optimizations

 

Feature

Comment

Owner

ETA

Status

Split / MultiQuery using RDD.cache()

    

In Group + Foreach aggregations, use aggregateByKey or reduceByKey for much better performance

For example: COUNT or DISTINCT aggregation inside nested foreach is handled by Pig code. We should use Spark to do in more efficiently

   

Compute optimal Shuffle Parallelism

Currently we let Spark pick the default

   

Combiner support for “Group+ForEach”

    

Multiple GROUP BYs on the same data set can avoid multiple shuffles.

See MultiQueryPackager

   

Switch to Kryo for Spark data serialization

Are all Pig serializable classes compatible with Kryo ?

   

FR Join

    

Merge Join (including sparse merge join)

    

Skew Join

    

Merge CoGroup

    

Collected CoGroup

    

 

Note that there is ongoing work in Spark SQL to support specialized joins: See SPARK-2211. As an example, support for merge join is in Spark SQL in Spark 1.4 (SPARK-2213 and SPARK-7165). This implies that Spark community will not be adding support for these joins in Spark Core library.

...