Owl, Metadata for Hadoop
A metadata system for the grid will allow users and applications to register and search for data on the grid. Metadata registration will not be required; but if available it can greatly improve locating and managing data on the grid. Certain metadata, such as schema and statistics, can improve robustness and performance of applications such as Pig.
The metadata system will provide an API for use by Map Reduce jobs, Pig, and grid data maintenance tools (such as cleaning tools, archive tools, etc.). It will provide a GUI for end users and grid administrators.
Browsing Available Data
Actor: User Action:
- User clicks browse button on GUI
- User is able to navigate through different Hadoop grids, down the heirarchy to individual files and directories that represent a data set.
- At any point user is able to click on an individual data set to get more details.
Search for Data
Actor: User Action:
- User clicks on search button on GUI.
- User enters attribute data he wishes to search on, such as feed name, date of creation, or other attributes the data has been tagged with.
- System returns a list of files and directories that have the specified tags and for which the user has permissions.
Actor: Data Reader, such as Pig script or Map Reduce job.
- Data reader searches for data via API. Search may be done via pathname, or attributes of the data.
- System returns a list of files and directories that match the search criteria and for which the user has permissions.
Note, searches can be restricted to a portion of the data hierarchy, such as only in a given set of data or in a given administrative domain.
Actor: Data Creator, may be a Pig script, Map Reduce job, or external system loading files onto HDFS. Action:
- Data creator creates data, possibly all in one file or directory or possibly in several directories under a common directory.
- Once data creation is completed the data creator notifies the metadata system that the data is available.
- Metadata system makes data available to other users via notification (see below), browsing, and search.
- In the case where several files or directories all under one directory are being created, the data creator can choose to register each individual file or directory as it becomes available. It can also choose to register only the top level directory when finished, at which point all data under that directory will be available.
Actor: Metadata Action:
- Data made available by data creator.
- Metadata system logs data availability to publicly available feed (such as RSS).
- Interested users can subscribe to feed and discover new data sets.
Processing data via Pig
Actor: Pig Action:
- As part of load statement, Pig contacts the metadata system to find schema, loader, and statistics associated with the data. These can be used to do compile time checking (e.g. type checking), find the correct load function for a file, and perform optimizations.
- When Pig stores data, the user can choose to have it record metadata associated with the stored data.
- It will be possible to browse, search, and create metadata from Map Reduce programs.
- It will be possible to browse, search, and create metadata from Pig Latin programs.
- It will be possible to browse, search, and administrate metadata from a GUI.
- It will be possible for data maintenance tools (such as cleaning tools, archive tools, replication tools, etc.) to use the metadata system to track files and directories that they need to maintain.
- The metadata system will support multiple administrative domains. These domains will allow groups of users to tag their data using a common set of attributes and store their data in a common set of directories.
- The metadata system will allow administrators to control which users can read and write metadata. This control may be done at the administrative domain level rather than at the individual data set level. For example, users would be able to read all metadata included in an administrative domain they have access to.
- The metadata system will support notifying users when new data is available. Note that this notification is not intended to replace or suplant a workflow system, but rather provide necessary information for a work flow system that can offer much more sophisticated features.
- The metadata system will remain optional; Pig, Map Reduce, and HDFS will continue to work with or without it.
- Users will be able to tag their data with key value pairs they define.
Overview of Architecture
A persistent storage mechanism will be needed to store the metadata. Storing the metadata in HDFS was considered, but this requires the use of an indexing system in order to facilitate fast search and the use of a locking mechanism to avoid read and write conflicts. For these reasons, an RDBMS will be used to store the persistent data. The system will be designed in such a way that any SQL-92 compliant RDBMS can be used. Some large metadata items (e.g. a histogram of the keys in a file) may be stored in the HDFS to avoid overloading the RDBMS.
A REST based web services API will be used for communications between clients and the metadata service. Web services was chosen because it frees the metadata system from needing to provide bindings for the various languages that users will want to use to communicate with the system. REST was chosen as a web services protocol because of its ease of use and ubiqituous support.
Metadata will be modeled using the following concepts:
Facets: A Facet is a key value pair that can be associated with data. It can be used by users to tag their data, by tools to record information about the data, etc. For example, a user could assign a Facet of priority: high and an cleaning tools could assign a Facet of expirationdate:20090701. Certain Facets can be required (see below). Users can also define their own Facets and associate them with Data Collections or Data Units.
Catalog: A Catalog is an administrative domain. A group of users working with similar data can create a Catalog. Within a Catalog it will be possible to mandate the use of certain Facets. For example, it could be required that all data in a given Catalog must have a priority Facet or a datestamp Facet. A Catalog is associated with one or more directories in HDFS. It will be possible to define which users can read metadata in a Catalog, and which users can write metadata in a Catalog.
Data Collection: A Data Collection is a logical collection of data. It is contained within a Catalog. It is associated with a directory in HDFS. It defines which Facets are used to partition the data in it into separate Data Units (see below). Facets can be attached to a Data Collection.
Data Unit: A unit of data (either a file or directory of part or map files) that Pig or Map Reduce can operate on. Data units are contained within a Data Collection or other Data Units. Users can associate Facets with Data Units. Schemas and statistcs can be associated with Data Units. Facets attached to a Data Collection or data unit that contains a given Data Unit are inherited by that Data Unit. Data units will not usually be map or part files, but a collection of map or part files making up a unit of data to be processed by Pig or Map Reduce.
Figure 1: Data model.
Figure 2: An example of storing data in owl.
Hive, a subproject of Hadoop, does currently have a metadata management system. It presents a relational model to users as part of Hive's SQL interface. This is a good fit for Hive, but does not fit well with Map Reduce, Pig, and grid data maintenance tools that view the grid as a large file system.