Make DataGenerator A Hadoop Job
Data generator is provided to generate tuples with specified number of fields for testing purpose. You can configure the datatype, length, cardinality, distribution and the percentage of NULL values for each field. Data generator generates random values that matches the configuration. Two types of distribution are supported: uniform and zipf distribution.
The current implementation runs on a single box and is single threaded. Its execution time is linear to the amount of data to be generated. When the amount of data reaches hundreds of gigabytes, the time required becomes unacceptable. In other words, this application is not scalable to deal with large amount of data.
The newer version of implementation allows to generate data in hadoop mode. You can specify the number of mappers, each mapper only needs to generate a fraction of data. This can greatly reduce the execution time.
Tuples generated by data generator can contain fields that are uniformly distributed or Zipf distributed. Both types of fields can be split into multiple processors with each processor generating a fraction of total rows. If M rows are to be generated by N processors, then each processor shall generate M/N rows. When the data from each processor are combined together, the result should still be uniformly distributed or zipf distributed.
- Uniform distributed fields: A random number is generated using Java Random class with specified cardinality. The random numbers are uniformly distributed. For integer and long types, this random number is returned as the value of the tuple field. If N processors use the same cardinality, the combined result is still uniformly distributed. For float, double, and string, this random number is used as a seed to generate corresponding float/double/string. We need to make sure for the same seed, the same float/double/string is returned. When running across N processors, this can be achieved by generating a mapping between random number to actual float/double/string in advance. Then each processor loads this information during startup and uses this data mapping to generate float/double/string fields.
Zipf distributed fields: A 3rd party library is used to generate random numbers between 1 and "cardinality" following zipf distributed. Given a cardinality, this library generates numbers with fixed density. For integer and long types, this number is returned as field value. Therefore, combining data from multiple processors together, the result should have the same density distribution. For float, double and string types, this number is used as a seed to return a float/double/string. We need to make sure for the same number, the same float/double/string is returned across multiple processors. This can be achieved by generating a mapping between random number to actual float/double/string in advance. Then each processor loads this information during startup and uses this data mapping to generate string fields.
Data generator is modified to be a hadoop job.
- Command line change
- An option -m is added to specify the number of mappers to run this job. Reducer is not required.
- Option -f is required to specify the output directory.
When DataGenerator is running in hadoop mode, -e (for seed) is disable. Because multiple mappers are running to generate data, if they share the same seed, the data generated by multiple mappers would be duplicated.
- The sequence of execution
- Data generator pre-generates a mapping file for each string/double/float field for zipf or uniform distribution.
Create a config file which contains data type, length, cardinality, distribution type, and percentage of NULLs for each field. Attach the name of the mapping file at the end if created from step 1. The name of this config file is passed to each mapper through JobConf.
- If input file is not configured, then
- Create N input files for N mappers. Each input file only has one row. It contains the number of rows to be generated by a mapper.
- Mark the job that there is not input file
- If input file is configured, then
- Set the input file as input path.
- Mark the job that there is input file
- Start map-reduce job, and load in field config. For the fields that have mapping file associated with it, build an internal hash for lookups. When mapper gets the input tuple, depending on input type:
- If there is no input file, the tuple that mapper receives is the number of rows to be generated. Therefore, it generates the specified number of rows.
- If there is an input file, the tuple that mapper receives is an tuple from input file, append it with other fields.
Define following env variables:
- $pigjar: pig.jar
- $zipfjar: sdsuLibJKD12.jar
$datagenjar: jar file that contains DataGenerator class
- $conf_file: hadoop-site.xml for your cluster
For Hadoop 0.18
hadoop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file [options] colspec...
For Hadoop 0.20 or Greater
hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -libjars $zipfjar -conf $conf_file [options] colspec..
- -m: number of mappers to run concurrently to generate data. If not configured or equal to 0, it runs in local mode.
- -e: seed value for random numbers, can not be configured if -m is greater than 0. Optional for local mode.
- -f: for local mode, output file and optional with default to stdout, for hadoop model output directory and required
- -i: optional, input file, lines will be read from.
- -r: number of rows to output, not required if -i is configured
- -s: optional, separator character, default is ^A
- colspec Format: columntype:average_size:cardinality:distribution_type:percent_null
- i = int
- l = long
- f = float
- d = double
- s = string
- m = map
- bx = bag of x, where x is a columntype
- u = uniform
- z = zipf
- average_size: average size for string types
- s:20:16000:z:7 specifies a String field whose length is 20, cardinality is 16000, it has zipf distribution and about 7% of NULL values. i:1:20:u:0 specifies an Integer field whose cardinality is 20, it has uniform distribution and no NULL values.
To run it locally, besides using hadoop jar without -m, you can also use:
java -cp $pigjar:$zipfjar:$datagenjar org.apache.pig.test.utils.datagen.DataGenerator [options] colspec...
This implementation is constrained by the memory availability. For now, we assume the cardinality of a field that need a mapping file is less than 2M, and the number of such fields is not more than 5. In this case, the memory required should be less than 1G for most settings. To work with bigger cardinality or more of string fields, the DataGenerator has to generate data with random numbers and then does an explicit join between the mapping file and the data file.