Using Metadata in Pig
One of the guiding philosophies of pig is that "Pigs eat anything". While this is true and will remain true, this does not preclude pig from selecting better food when it is available. In this vein, pig should make use of metadata when it is available, but continue to work well in situations where it is not available.
This wiki is written assuming the functionality of the pipeline rework http://issues.apache.org/jira/browse/PIG-157. This has not yet been committed to the trunk, but should be some time in the July of 2008.
Definition of Metadata
For the purpose of this discussion, metadata will be divided into two categories, global and file specific. Global metadata records information about the system as a whole. File specific metadata records information about a particular file, or possibly a set of files in a directory. This can include schema information, histograms, etc.
Pig Interface to File Specific Metadata
Pig should support four options with regard to reading file specific metadata:
No file specific metadata available. Pig uses the file as input with no knowledge of its content. All data is assumed to be ByteArrays.
User provides schema in the script. For example, A = load 'myfile' as (a: chararray, b: int);.
Self describing data. Data may be in a format that describes the schema, such as JSON. Users may also have other proprietary ways to store information about the data in a file either in the file itself or in an associated file. Changes to the LoadFunc interface made as part of the pipeline rework support this for data type and column layout only. It will need to be expanded to support other types of information about the file.
- Input from a data catalog. Pig needs to be able to query an external data catalog to acquire information about a file. All the same information available in option 3 should be available via this interface. This interface does not yet exist and needs to be designed.
It should support options 3 and 4 for writing file specific metadata as well.
Pig Interface to Global Metadata
An interface will need to be designed for pig to read from and write to an external data catalog.
Architecture of Pig Interface to External Data Catalog
Pig needs to be able to connect to various types of external data catalogs (databases, catalogs stored in flat files, web services, etc.). To facilitate this pig will develop a generic interface that allows it to query and update a data catalog. Drivers will then need to be written to implement that interface and connect to a specific type of data catalog.
Types of File Specific Metadata Pig Will Use
Pig should be able to acquire and record the following types of information about a file via either self description or an external data catalog. This is not to say that every self describing file or external data catalog must support every one of these items. This is a list of items pig may find useful and should be able to query for and create. If the metadata source cannot provide or store the information, pig will simply not make use of it or record it.
- Field layout (already supported)
- Field types (already supported)
- Sortedness of the data, both key and direction (ascending/descending)
- How file is partitioned, both partition field and hashing function
- Number of records
- File size
- Cardinality of a given field
- Histogram of values in a given field
- Does a field allow NULLs
- Default values for a field
Type of Global Metadata Pig Will Use
Pig should be able to acquire the following types of global information from an external data catalog. This is not to say that every external data catalog must support every one of these items. This is a list of items pig may find useful and should be able to query for. If the metadata source cannot provide the information, pig will simply not make use of it.
- System resources available (not clear, we may be wandering too close to scheduler functionality here)
Given that the usage for global metadata is unclear, the priority will be placed on supporting file specific metadata. The first step should be to define the interface changes in LoadFunc, StoreFunc and the interface to external data catalogs.