Index

Introduction

After the MetadataDiscussion page was created, Jukka Zitting offered an example of how to get to recursive metadata when parsing with an AutoDetectParser. In addition to sharing Jukka's example, this page also offers some additional details on how, if you are willing to write your own ContentHandler, you can capture both text and metadata for each recursive document.

NOTE - This discussion of recursive metadata is from the point of view of what might be an oddball use case. The assumption of this page is NOT that you would want to take a container file, maybe a zip file, and extract all of the text and metadata into a single mega-representation of all of the text and metadata found in that container. Instead, this page assumes that what you really want to do is to extract the text for each document in the container, and be able to see each of these nested documents as a separate entity with its own text and metadata.

Jukka's Example

Here is the full source for Jukka's example for how to get access to nested metadata. This example writes the metadata for each nested document to standard output. More details about how Jukka's example works are available in subsections below.

  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }
   }

   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }

       @Override
       public void parse(
               InputStream stream, ContentHandler handler,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           super.parse(stream, handler, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
       }

   }

Main from Jukka's Example

Setting up Recursive Parsing

  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);

The example starts by setting up recursive parsing. If you are parsing text files, word documents, etc. then you'll never notice if recursive parsing is enable or not. If you are parsing containers like zip files and tar.gz files, the only way to get the text for the files contained by the containers is to enable recursive parsing.

The way to enable recursive parsing is to create a ParseContext and add a parser to it as shown on the line context.set(Parser.class, parser). This is the parser that will be used to parse any nested documents.

In this case the parser is a RecursiveMetadataParser that is a wrapper around an AutoDetectParser. The RecursiveMetadata parser is part of Jukka's example and more details are given below.

Parsing a File

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }

The rest of the main function parses a file. The parser used to parse the root document is the same parser that was added to the ParseContext as the parser to use for nested documents.

Looking at the Tika API (http://tika.apache.org/0.7/api/), I don't see a DefaultHandler class or a TikaInputStream. In the place of DefaultHandler you could use BodyContentHandler, and in the place of TikaInputStream you could use FileInputStream.

Jukka's RecursiveMetadata Parser

RecursiveMetadataParser Constructor

   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }

The RecursiveMetadataParser extends ParserDecorator. All the constructor has to do is let the ParserDecorator superclass know which parser object is being decorated.

RecursiveMetadataParser parse

       @Override
       public void parse(
               InputStream stream, ContentHandler handler,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           super.parse(stream, handler, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
       }

   }

The parse method is where you get access to the metadata. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse method is called. Before calling super.parse, the metadata object is empty. After super.parse returns, the metadata object contains all of the metadata the decorated parser found and System.out.println(metadata) prints all of the metadata to standard output.

What's Missing from Jukka's Example?

Jukka's example shows how you can get metadata for a nested document, but it doesn't show how you can get that metadata along with the text for that nested document.

If you only need the metadata, then this example is great. If instead you want to extract complete documents from containers including both text and metadata, then you need to do more.

Extracting Text is an Exorcise for the Reader

A way to match up the metadata for a document with its text requires you to write your own ContentHandler that is able to identify text for individual nested documents. Since this page is called RecursiveMetadata and not HowToGetASeparateTextBodyForEachNestedDocument, no details are offered for how to implement that ContentHandler. While I was hoping there would be help for this in Tika's library, after quickly scanning all the handlers I could find in http://tika.apache.org/0.7/api/ I didn't see any that offered easy ways to get to the text for each contained document as a separate set of text.

Until someone writes a page on how to get the text for each separate document in a container as a separate body of text, writing this ContentHandler is an exercise left to the reader. I have written a ContentHandler that does this for the kinds of files and containers I have tested with, and if no one comes forward with an easy way to write this kinds of ContentHandler, my experiences might become the start of yet another wiki page.

How to get Metadata with Text

Assuming that you have written your own ContentHandler, and that ContentHandler can be used to get the text for individual documents in a container, how can you get associate the metadata for a document with that document's text?

The solution I currently use is to create a RecursiveMetadataParser class that is constructed with a RecursiveParserListener. The listener is notified just before and just after each parse call, and my ContentHandler can implement both the ContentHandler and the RecursiveParserListener interfaces. Here is a rough example:

public interface RecursiveParserListener {
    void startSubDocument(Metadata metadata);
    void endSubDocument();
}

public class RecursiveMetadataParser extends ParserDecorator {
    private final RecursiveParserListener listener;

    public RecursiveMetadataParser(Parser parser, RecursiveParserListener listener) {
        super(parser);
        this.listener = listener;
    }

    public void parse(InputStream stream, ContentHandler handler, Metadata metadata,
                      ParseContext context) throws IOException, SAXException, TikaException {
        listener.startSubDocument(metadata);
        super.parse(stream, handler, metadata, context);
        listener.endSubDocument();
    }
}

class TikaContentHandler implements ContentHandler, RecursiveParserListener {
    //...
    public void startSubDocument(Metadata metadata) {stack.push(metadata);}
    public void endSubDocument() {stack.pop();}
    //...
    public void endElement(String uri, String localName, String qName) throws SAXException {
        //...
        // if this end element means a document is ending
        Metadata metadata = stack.peek();
        // do something with metadata and document text
    }

}

The basic idea is that if you have gone to the trouble of implementing a ContentHandler capable of identifying text for each individual nested document, then if you can also get notifications for when a subdocument with separate metadata starts and ends, you can keep track of this metadata and associate it with the text you extract.

Hopefully this example offers an idea of what you would have to do to get both the text and metadata for a nested document.

A Possibly Misplaced or Inappropriate Wish for Tika

While it is possible to get the text for each nested document in a container using Tika, and it is possible to get the metadata for each nested document, it would be nice if Tika offered an easy way to get both the text and the metadata for a nested document together as a single entity.

Tika seems to want to turn any file you give it into a single XHTML document, or the stream of ContentHandler events you would get if you were parsing that single XHTML document. Containers that aren't logically a single document (containers that are logically single documents include OLE2 and .xslx) don't live comfortably inside this single document model. Because Tika does a great job of identifying and parsing a wide variety of container types, and because Tika is being extended to identify when a container is logically a single document and when a container is logically many separate documents, it would be nice if there was a better way for Tika to return the metadata and text for containers that are logically many separate documents.