This document specifies the use cases, requirements, and design for Howl import and export format. The purpose of this feature is to allow users to move data into, out of, and between Howl instances while still using Howl tools.
Definition of Terms
Howl Import Export OutputFormat, used to create a data collection when Howl's metastore is not present.
Howl Import Export InputFormat, used to read a data collection when Howl's metastore is not present.
- Howl Import Export Loader used to read a data collection when Howl's metastore is not present.
- Howl Import Export Store Function used to create a data collection when Howl's metastore is not present.
- data collection
- All of the data and metadata created by HIEOutputFormat, HIEStorage, or Howl's export command. Note that unlike many software archives (e.g. tar) this collection is all of the files contained under one directory, not one file itself.
A user creates data on a grid or location where access to the Howl metastore is not available, but he wishes to store the data in such a way that it can easily be imported into Howl at a later time. In Map Reduce, the user uses HIEOutputFormat to store the data, providing information on the file system location to store the data, the storage format to use (i.e. text, !RCFile), the schema of the data, how the data is sorted, and optionally the name of the intended Howl table and values of partition keys for the data (that is, if the destination table for this data is partitioned on datestamp, and this data will eventually be the partition for 20101109 then key of datestamp value of 20101109 could be included as information with this data). In Pig, the user uses HIEStorage, specifying the same information except for the schema and sort order, which Pig will already know.
At a later time the user (or another user) can then import the data into Howl using the Howl CLI IMPORT command, providing the name of the table to import the data to and values for all partition keys for that table if they are not specified in the data. If the data needs to be transported before import, hadoop's distcp utility can be used for grid to grid moves or hadoop fs -copyFromLocal can be used for local to grid. If the specified table does not yet exist, Howl will create it, using the information in the imported metadata to set defaults for the table such as schema, storage format, etc.
A user wishes to take data from Howl and use it on a grid or in a location where access to the Howl metastore is not available. Using the Howl CLI EXPORT command the user can export the data, which will result in a data collection that includes metadata on the data's storage format, the schema, how the data is sorted, what table the data came from, and values of any partition keys from that table. If the user needs to move the data to another grid Hadoop's distcp utility can be used. If the user needs to copy it from the grid to a local file system hadoop fs -copyToLocal can be used.
At a later time the user (or another user) can read the data using HIEInputFormat or HIELoader directly. These will require the file system location where the data collection resides.
Transfer Between Howl Instances
A user wishes to move data between grids with Howl, preserving the table and partition structure. The user can use the Howl CLI to export Howl data, distcp to move the data between grids, and then the Howl CLI to import the data on the new grid.
A user wishes to store metadata with data while not having a Howl metastore. In this case data can be created by HIEOutputFormat or HIEStorage and read by HIEInputFormat or HIELoader.
- Using !HEIOutputFormat or !HEIStorage the user will be required to provide a file system location to store the data collection, a schema, whether the data is sorted and by what key, and the storage format to use to store the file.
- Using !HEIOutputFormat or !HEIStorage the user will be able but not required to provide an intended import table and values for partition keys.
- Initially data collections created by !HEIOutputFormat, !HEIStorage, and the Howl CLI export command will be limited to one partition of a table. However, there should not be anything in the design that prevents us from extending these to allow multiple partitions once Howl supports dynamic partition creation.
- These tools will not know the locations of jars containing the required storage formats to read or write data. It will be the responsibility of the user to provide these jars in the classpath of their MR job or Pig invocation. For example, if the data is to be stored in !RCFile, the user must make sure the appropriate jars are in the classpath of the job creating the data collection.
- If a data collection is imported into Howl with no table name, the user doing the import will be required to provide the name of the table to import the data into.
- If a data collection is imported into a Howl table that is partitioned but the data collection provides no partition key values, the user doing the import will be required to provide values for each of the partition keys. In this case, even in the future, import will be limited to a single partition.
- If the data collection specifies a table name but that table does not exist, upon import Howl will create that table using the storage format, schema, and sort information from the archive.
If the data collection specifies a table name and partition key values that duplicate existing key values (for example the value for the datestamp key is 20101109 but a partition for the table already exists with that datestamp) then the import utility will return an error and not do the import.
- If the data collection specifies a table name and partition keys that are incorrect or incomplete (for example the table requires datestamp and region keys and only datestamp is provided, or datestamp and language are provided) then an error will be returned and the import will not be done.
- If the data collection has a schema that does not match the schema of the table being imported to an error will be returned and the import will not be done.
- If the data collection has a sort order that does not match the sort order of the table being imported to an error will be returned and the import will not be done.
- Users must have write permission on a table to import data into a table.
- Users must have read permission on a table to export data from a table.
- Data to be exported will be copied to a new directory so that the user can operate on that directory without affecting the data in Howl.
- All data placed into a collection by HIEOutputFormat, HIEStorage, or export will be located under one directory for easy use with distcp and copyFromLocal, copyToLocal.
- All metadata stored by !HEIOutputFormat, !HEIstorage, and the CLI export command must be in a file that begins with underscore (so that Hadoop jobs will ignore it) and located in the same directory as the data (so that it will be moved by distcp, copyToLocal, and copyFromLocal along with the data).
- Exporting the data will require a valid HDFS location to export the data too.
- Exported metadata will contain full table name and partition information.
- HIEInputFormat and HIELoader will require the file system location of the data.
HIEInputFormat and HIELoader will not know the classpath of the underlying InputFormat and storage handler. It will be the job of the user to provide the jars for these in the classpath.
- HIEInputFormat, HIELoader, HIEOutputFormat, and HIEStorage will be included in the Howl client jar.
- As Hive by definition only works on tables that are known to the metastore, there is no HIESerDe nor any requirement for these data collections to work with Hive outside of Howl.
TBD - should look as much like HowlInputFormat as possible.
TBD - should look as much like HowlOutputFormat as possible.
TBD - should look as much like HowlLoader as possible.
TBD - should look as much like HowlStorage as possible.
Changes to Howl CLI
New IMPORT syntax - TBD
New EXPORT syntax - TBD
Re-use as much of HowlInputFormat, HowlOutputFormat, HowlLoader, and HowlStorage as possible, including storage drivers. Hopefully this can be done by creating a super class of both and having just the metadata access be different between the two.
All metadata should be stored in an underscore file in the same directory as the data.
- Originally we talked about this converting formats on import (if they didn't match the table's default) and export (if the user requested it). I have not included that in this. At the very least I'd like to put it off until a later version. It's not 100% clear to me we should support it at all, as it requires Howl to start independent MR jobs, which we may not want.
- I think we should consider compressing the data we are exporting unless it is already compressed as the case might be with RCFile data.
I would imagine that the most common case of import/export (other than moving to a different grid) would be in the text (PigStorage) format. Since Howl v1 would support this format, making this work would not be an issue. However, I am wondering if storing it as text within Howl is the right approach given that it might not be the most efficient way. I think we need to understand this use case a bit better. This is related to the Open Question #1.