Using the default TimeStamp Parser

By default, Chukwa will use the default TsProcessor.

This parser will try to extract the real log statement from the log entry using the %d{ISO8601} date format. If it fails, it will use the time at which the chunk as been written to disk (collector timestamp).

Your log will be automatically available from the Web Log viewer under the <YourRecordTypeHere> directory

Using a specific Parser

If you want to extract some specific information and perform more processing you need to write your own parser. Like any M/R program, your have to write at least the Map side for your parser. The reduce side is Identity by default.

MAP side of the parser

Your can write your own parser from scratch or extend the AbstractProcessor class that hides all the low level action on the chunk. Then you have to register your parser to the demux (link between the RecordType and the parser)

Parser registration

(Tips: You can use the same parser for different recordType)

Parser implementation

   1 public class MyParser extends AbstractProcessor
   2 {
   3        protected void parse(String recordEntry,
   4                                        OutputCollector<ChukwaRecordKey, ChukwaRecord> output,
   5                                        Reporter reporter)
   6         {
   8            // Extract Log4j information, i.e timestamp, logLevel, logger, ...
   9            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm");
  10            // Extract log timestamp & Log4j information
  11            String dStr = recordEntry.substring(0, 23);
  12            int start = 24;
  13            int idx = recordEntry.indexOf(' ', start);
  14            String logLevel = recordEntry.substring(start, idx);
  15            start = idx + 1;
  16            idx = recordEntry.indexOf(' ', start);
  17            String className = recordEntry.substring(start, idx-1);
  18            String body = recordEntry.substring(idx + 1);
  20            Date d = sdf.parse(dStr);
  21            key = new ChukwaRecordKey();
  22            record = new ChukwaRecord();
  24            key = new ChukwaRecordKey();
  25            key.setKey("<YOUR_KEY_HERE>"));
  26            key.setReduceType("<YOUR_RECORD_TYPE_HERE>");
  28            record = new ChukwaRecord();
  29            record.setTime(d.getTime());
  31            // Parse your line here and extract useful information
  32            // Add your {key,value} pairs
  33            record.add(key1, value1);
  34            record.add(key2, value2);
  35            record.add(key3, value3);
  37            // Output your record
  38            output.collect(key, record);
  39         }
  40 }

(Tips: see org.apache.hadoop.chukwa.extraction.demux.processor.mapper.Df class, for an example of Parser class)

REDUCE side of the parser

You only need to implement a reduce side if you need to group records together. Here the interface that your need to implement:

The link between the Map side and the reduce is done by setting your reduce class into the reduce type: key.setReduceType("<YourReduceClassHere>");

   1 public interface ReduceProcessor
   2 {
   3            public String getDataType();
   4            public void process(ChukwaRecordKey key,Iterator<ChukwaRecord> values,
   5                       OutputCollector<ChukwaRecordKey, 
   6                       ChukwaRecord> output, Reporter reporter);
   7 }

(Tips: see org.apache.hadoop.chukwa.extraction.demux.processor.reducer.SystemMetrics class, for an example of Reduce class)

Parser key field

Your data is going to be sorted by RecordType then by the key field. The default implementation use the following grouping for all records:

  1. Time partition (Time up to the hour)
  2. Machine name (physical input source)
  3. Record timestamp

Output directory

The demux process will use the recordType to save similar records together (same recordType) to the same directory: <Your_Cluster_Information>/<Your_Record_Type>/

DemuxModification (last edited 2009-11-10 22:30:05 by dhcp-131-250)