Make DataGenerator A Hadoop Job

Background

Data generator is provided to generate tuples with specified number of fields for testing purpose. You can configure the datatype, length, cardinality, distribution and the percentage of NULL values for each field. Data generator generates random values that matches the configuration. Two types of distribution are supported: uniform and zipf distribution.

The current implementation runs on a single box and is single threaded. Its execution time is linear to the amount of data to be generated. When the amount of data reaches hundreds of gigabytes, the time required becomes unacceptable. In other words, this application is not scalable to deal with large amount of data.

The newer version of implementation allows to generate data in hadoop mode. You can specify the number of mappers, each mapper only needs to generate a fraction of data. This can greatly reduce the execution time.

Algorithm

Tuples generated by data generator can contain fields that are uniformly distributed or Zipf distributed. Both types of fields can be split into multiple processors with each processor generating a fraction of total rows. If M rows are to be generated by N processors, then each processor shall generate M/N rows. When the data from each processor are combined together, the result should still be uniformly distributed or zipf distributed.

Design

Data generator is modified to be a hadoop job.

Usage

Define following env variables:

export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar

For Hadoop 0.18

hadoop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file [options] colspec...

For Hadoop 0.20 or Greater

hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -libjars $zipfjar -conf $conf_file [options] colspec..

Examples:

To run it locally, besides using hadoop jar without -m, you can also use:

java -cp $pigjar:$zipfjar:$datagenjar org.apache.pig.test.utils.datagen.DataGenerator [options] colspec...

Future Works

This implementation is constrained by the memory availability. For now, we assume the cardinality of a field that need a mapping file is less than 2M, and the number of such fields is not more than 5. In this case, the memory required should be less than 1G for most settings. To work with bigger cardinality or more of string fields, the DataGenerator has to generate data with random numbers and then does an explicit join between the mapping file and the data file.

DataGeneratorHadoop (last edited 2010-01-15 00:16:15 by RobStewart)