...
Feature | JIRA | Comments |
---|---|---|
Move GruntParser to Antlr | PIG-2597 |
|
LIMIT inside nested foreach should have combiner optimization | PIG-4536 | |
Optimize the case of Order by + Limit in nested foreach | PIG-4449 | |
Support for Tez speculative execution | PIG-4411 | |
Jython algebraic udfs | PIG-1804 | |
local scope set statement | PIG-4424 | |
Error handling | ||
Pig on Spark | PIG-4059 |
Proposed Future Work
Work that the Pig project proposes to do in the future is further broken into three categories:
...
Estimated Development Effort: medium
Agreed Work, Unknown Approach
Make Use of HBase
Pig can do bulk reads and writes from HBase. But it cannot use HBase in operators like a hash join. We need operators that make use of HBase where it makes sense. Also, we may need to provide support so that UDFs can efficiently access HBase themselves.
Mavenization
Switch Pig build system from ant to maven. Would like to modularize Pig so we will have module pig-core, mr, tez. Also need to switch the build system for e2e tests
Category: BuildCategory: Integration, Performance
Dependency:
References: PIG-1804
Estimated Development Effort: medium medium
Runtime Optimizations
Currently Pig does all of its optimizations up front before beginning any execution. In a multi-job pipeline information will be learned in initial jobs that could be used in later jobs to make optimization decisions. For example, a join later in the pipeline may turn out to have inputs of a size such that fragment replicate makes sense as a join strategy. Being able to rewrite the plan midway through the execution will provide the ability to optimize for these types of situations.
Category: Performance
Summary query
For some file format such as Orc, we have stats build into the data file, so we can get the data summary such as min/max/sum/avg quickly without scanning the data,
Category: Performance
Dependency:
References:
Estimated Development Effort: small
Replicated cross
Pig should be able to do map side cross. Currently, user can emulate it with replicated join. But it would be better to add native support
Category: Performance
Dependency:
References:
Estimated Development Effort: small
PMML support
Able to consume PMML model and score input data.
Category: New functionality
Dependency:
References: https://github.com/Netflix/Surus
Estimated Development Effort: small
Performance benchmark
Adding TPC/DI, TPC/DS query into Pig.
Category: Performance
Dependency:
References:
Estimated Development Effort: medium
Agreed Work, Unknown Approach
Make Use of HBase
Pig can do bulk reads and writes from HBase. But it cannot use HBase in operators like a hash join. We need operators that make use of HBase where it makes sense. Also, we may need to provide support so that UDFs can efficiently access HBase themselves.
Category: Integration, Performance
Dependency:
References:
Estimated Development Effort: medium
Runtime Optimizations
Currently Pig does all of its optimizations up front before beginning any execution. In a multi-job pipeline information will be learned in initial jobs that could be used in later jobs to make optimization decisions. For example, a join later in the pipeline may turn out to have inputs of a size such that fragment replicate makes sense as a join strategy. Being able to rewrite the plan midway through the execution will provide the ability to optimize for these types of situations.
Category: Performance
Dependency:
References:
Estimated Development Effort: large
Support Append in Pig
Appending to HDFS files is supported in Hadoop 0.21. None of Pig's standard load functions support append. We need to decide if append is added to
the language itself (is there an APPEND modifier to the STORE command?) or if each store function needs to decide how to indicate or allow appending on its own. !PigStorage
should support append as users are likely to want it.
Category: New Functionality
Dependency: Hadoop 0.21 or laterDependency:
References:
Estimated Development Effort: large
Support Append in Pig
small
IDE for Pig
!PigPen was developed and released for Pig with 0.2. However, it has not been kept up to date. Users have consistently expressed interest
in an IDE for Pig. Ideally this would also include tools for writing UDFs, not just Pig Latin scripts. One option is to bring !PigPen up to date and maintain it.
Another option is to build a browser based IDE. Some have suggested that this would be better than an Eclipse based oneAppending to HDFS files is supported in Hadoop 0.21. None of Pig's standard load functions support append. We need to decide if append is added to
the language itself (is there an APPEND modifier to the STORE command?) or if each store function needs to decide how to indicate or allow appending on its own. !PigStorage
should support append as users are likely to want it.
Category: New Functionality
Dependency: Hadoop 0.21 or later
References:
Estimated Development Effort: small
IDE for Pig
!PigPen was developed and released for Pig with 0.2. However, it has not been kept up to date. Users have consistently expressed interest
in an IDE for Pig. Ideally this would also include tools for writing UDFs, not just Pig Latin scripts. One option is to bring !PigPen up to date and maintain it.
Another option is to build a browser based IDE. Some have suggested that this would be better than an Eclipse based one.
large and ongoing
Vectorization
Pig shall process operator in a batch manner. One possibility is to use Hive vectorization library.
Category: Performance
Dependency:
References:
Estimated Development Effort: large
Staged replicated join
Currently for replicated join, right table must fit memory. We can borrow idea from Hive staged map join to spill right table to disk if not fit, and process the overflow in map cleanup.
Category: PerformanceCategory: New Functionality
Dependency:
References:
Estimated Development Effort: large and ongoing medium
Experimental
Add List Datatype
...