Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Feature

JIRA

Comments

Move GruntParser to Antlr

PIG-2597

 

LIMIT inside nested foreach should have combiner optimizationPIG-4536 
Optimize the case of Order by + Limit in nested foreachPIG-4449 
Support for Tez speculative executionPIG-4411 
Jython algebraic udfsPIG-1804 
local scope set statementPIG-4424 
Error handling  
Pig on SparkPIG-4059 

Proposed Future Work

Work that the Pig project proposes to do in the future is further broken into three categories:

...

Estimated Development Effort: medium

Agreed Work, Unknown Approach

Make Use of HBase

Pig can do bulk reads and writes from HBase. But it cannot use HBase in operators like a hash join. We need operators that make use of HBase where it makes sense. Also, we may need to provide support so that UDFs can efficiently access HBase themselves.

Mavenization

Switch Pig build system from ant to maven. Would like to modularize Pig so we will have module pig-core, mr, tez. Also need to switch the build system for e2e tests

Category: BuildCategory: Integration, Performance

Dependency:

References: PIG-1804

Estimated Development Effort: medium medium

Runtime Optimizations

Currently Pig does all of its optimizations up front before beginning any execution. In a multi-job pipeline information will be learned in initial jobs that could be used in later jobs to make optimization decisions. For example, a join later in the pipeline may turn out to have inputs of a size such that fragment replicate makes sense as a join strategy. Being able to rewrite the plan midway through the execution will provide the ability to optimize for these types of situations.

Category: Performance

Summary query

For some file format such as Orc, we have stats build into the data file, so we can get the data summary such as min/max/sum/avg quickly without scanning the data, 

Category: Performance

Dependency:

References: 

Estimated Development Effort: small

Replicated cross

Pig should be able to do map side cross. Currently, user can emulate it with replicated join. But it would be better to add native support

Category: Performance

Dependency:

References: 

Estimated Development Effort: small

PMML support

Able to consume PMML model and score input data. 

Category: New functionality

Dependency:

References: https://github.com/Netflix/Surus

Estimated Development Effort: small

Performance benchmark

Adding TPC/DI, TPC/DS query into Pig. 

Category: Performance

Dependency:

References:

Estimated Development Effort: medium

Agreed Work, Unknown Approach

Make Use of HBase

Pig can do bulk reads and writes from HBase. But it cannot use HBase in operators like a hash join. We need operators that make use of HBase where it makes sense. Also, we may need to provide support so that UDFs can efficiently access HBase themselves.

Category: Integration, Performance

Dependency:

References:

Estimated Development Effort: medium

Runtime Optimizations

Currently Pig does all of its optimizations up front before beginning any execution. In a multi-job pipeline information will be learned in initial jobs that could be used in later jobs to make optimization decisions. For example, a join later in the pipeline may turn out to have inputs of a size such that fragment replicate makes sense as a join strategy. Being able to rewrite the plan midway through the execution will provide the ability to optimize for these types of situations.

Category: Performance

Dependency:

References:

Estimated Development Effort: large

Support Append in Pig

Appending to HDFS files is supported in Hadoop 0.21. None of Pig's standard load functions support append. We need to decide if append is added to
the language itself (is there an APPEND modifier to the STORE command?) or if each store function needs to decide how to indicate or allow appending on its own. !PigStorage
should support append as users are likely to want it.

Category: New Functionality

Dependency: Hadoop 0.21 or laterDependency:

References:

Estimated Development Effort: large

Support Append in Pig

small

IDE for Pig

!PigPen was developed and released for Pig with 0.2. However, it has not been kept up to date. Users have consistently expressed interest
in an IDE for Pig. Ideally this would also include tools for writing UDFs, not just Pig Latin scripts. One option is to bring !PigPen up to date and maintain it.
Another option is to build a browser based IDE. Some have suggested that this would be better than an Eclipse based oneAppending to HDFS files is supported in Hadoop 0.21. None of Pig's standard load functions support append. We need to decide if append is added to
the language itself (is there an APPEND modifier to the STORE command?) or if each store function needs to decide how to indicate or allow appending on its own. !PigStorage
should support append as users are likely to want it.

Category: New Functionality

Dependency: Hadoop 0.21 or later

References:

Estimated Development Effort: small

IDE for Pig

!PigPen was developed and released for Pig with 0.2. However, it has not been kept up to date. Users have consistently expressed interest
in an IDE for Pig. Ideally this would also include tools for writing UDFs, not just Pig Latin scripts. One option is to bring !PigPen up to date and maintain it.
Another option is to build a browser based IDE. Some have suggested that this would be better than an Eclipse based one.

large and ongoing

Vectorization

Pig shall process operator in a batch manner. One possibility is to use Hive vectorization library.

Category: Performance

Dependency:

References:

Estimated Development Effort: large

Staged replicated join

Currently for replicated join, right table must fit memory. We can borrow idea from Hive staged map join to spill right table to disk if not fit, and process the overflow in map cleanup.

Category: PerformanceCategory: New Functionality

Dependency:

References:

Estimated Development Effort: large and ongoing medium

Experimental

Add List Datatype

...