Page History

...

For more information on Pig and Spark projects, see hereReferences.

Motivation

The main motivation for enabling Pig to run on Spark is to:

...

Test	Comment	Owner	ETA	Status
TestJoin	Joining empty tuples fails	Kelly	1 day	✔
TestPigContext, TestGrunt	PIG-4295. Deserializing UDF classpath config fails in Spark because it’s thread local.	Kelly	3 days	✔
TestProjectRange	PIG-4297 Range expressions fail with groupby	Kelly	2 days	✔
TestAssert	FilterConverter issue?		1 day
TestLocationInPhysicalPlan			1 day

Other Tests

ETA: 1 week

Several tests are failing due to either:

...

Investigate

And fix if needed

Feature	Comment	Owner	ETA	Status
Streaming	Test are passing, but we need confirmation that this feature works.		1 day

Spark Unit Tests

Few Spark engine specific unit tests have been written so far (for features that have Spark specific implementations). Following is a partial list of what we need to add. Need to update this list as we add more Spark specific code. We should also add tests for POConverter implementations.

Test	Comment	Owner	ETA	Status
TestSparkLauncher
TestSparkPigStats
TestSecondarySortSpark		Kelly		✔

Enhance Test Infrastructure

...

Performance Optimizations

Feature	Comment	Owner	ETA	Status
Split / MultiQuery using RDD.cache()
In Group + Foreach aggregations, use aggregateByKey or reduceByKey for much better performance	For example: COUNT or DISTINCT aggregation inside nested foreach is handled by Pig code. We should use Spark to do in more efficiently
Compute optimal Shuffle Parallelism	Currently we let Spark pick the default
Combiner support for “Group+ForEach”
Multiple GROUP BYs on the same data set can avoid multiple shuffles.	See MultiQueryPackager
Switch to Kryo for Spark data serialization	Are all Pig serializable classes compatible with Kryo ?
FR Join
Merge Join (including sparse merge join)
Skew Join
Merge CoGroup
Collected CoGroup

Note that there is ongoing work in Spark SQL to support specialized joins: See SPARK-2211. As an example, support for merge join is in Spark SQL in Spark 1.4 (SPARK-2213 and SPARK-7165). This implies that Spark community will not be adding support for these joins in Spark Core library.

...

Child pages

Versions Compared

Old Version 2

New Version Current

Key