Differences between revisions 1 and 2
Revision 1 as of 2013-06-14 00:36:39
Size: 2439
Editor: AaronMcCurry
Comment:
Revision 2 as of 2013-06-14 01:41:03
Size: 2445
Editor: AaronMcCurry
Comment:
Deletions are marked like this. Additions are marked like this.
Line 40: Line 40:
Line 42: Line 43:
Line 44: Line 46:

MapReduce Indexing

Here is an example of the typical usage of the BlurOutputFormat. The Blur table has to be created before the MapReduce job is started. The setupJob method configures the following:

  • The reducer class to be DefaultBlurReducer

  • The number of reducers to be equal to the number of shards in the table.
  • The output key class to a standard Text writable from the Hadoop library
  • The output value class is a BlurMutate writable from the Blur library

  • The output format to be BlurOutputFormat

  • Sets the TableDescriptor in the Configuration

  • Sets the output path to the TableDescriptor.getTableUri() value

  • Also the job will use the BlurOutputCommitter class to commit or rollback the MapReduce job

Example Usage

Iface client = BlurClient.getClient("controller1:40010");

TableDescriptor tableDescriptor = client.describe(tableName);

Job job = new Job(jobConf, "blur index");
job.setJarByClass(BlurOutputFormatTest.class);
job.setMapperClass(CsvBlurMapper.class);
job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(job, new Path(input));
CsvBlurMapper.addColumns(job, "cf1", "col");

BlurOutputFormat.setupJob(job, tableDescriptor);
BlurOutputFormat.setIndexLocally(job, true);
BlurOutputFormat.setOptimizeInFlight(job, false);

job.waitForCompletion(true);

Options

  • BlurOutputFormat.setIndexLocally(Job,boolean)

    • Enabled by default, this will enable local indexing on the machine where the task is running. Then when the RecordWriter closes the index is copied to the remote destination in HDFS.

  • BlurOutputFormat.setMaxDocumentBufferSize(Job,int)

    • Sets the maximum number of documents that the buffer will hold in memory before overflowing to disk. By default this is 1000 which will probably be very low for most systems.
  • BlurOutputFormat.setOptimizeInFlight(Job,boolean)

    • Enabled by default, this will optimize the index while copying from the local index to the remote destination in HDFS. Used in conjunction with the setIndexLocally.
  • BlurOutputFormat.setReducerMultiplier(Job,int)

    • This will multiple the number of reducers for this job. For example if the table has 256 shards the normal number of reducers is 256. However if the reducer multiplier is set to 4 then the number of reducers will be 1024 and each shard will get 4 new segments instead of the normal 1.

MapReduce (last edited 2013-06-14 01:41:03 by AaronMcCurry)