Differences between revisions 1 and 2
Revision 1 as of 2010-08-02 20:09:36
Size: 10664
Editor: cpe-72-181-242-67
Comment:
Revision 2 as of 2010-08-12 20:24:46
Size: 7634
Editor: cpe-72-181-242-67
Comment:
Deletions are marked like this. Additions are marked like this.
Line 10: Line 10:
After the MetadataDiscussion page was created, Jukka Zitting offered an example of how to get to recursive metadata when parsing with an AutoDetectParser. In addition to sharing Jukka's example, this page also offers some additional details on how, if you are willing to write your own ContentHandler, you can capture both text and metadata for each recursive document. After the MetadataDiscussion page was created, Jukka Zitting offered an example of how to get to recursive metadata when parsing with an AutoDetectParser, and later updated that example with how to get both text and metadata for nested documents using the AutoDetectParser.
Line 12: Line 12:
NOTE - This discussion of recursive metadata is from the point of view of what might be an oddball use case. The assumption of this page is NOT that you would want to take a container file, maybe a zip file, and extract all of the text and metadata into a single mega-representation of all of the text and metadata found in that container. Instead, this page assumes that what you really want to do is to extract the text for each document in the container, and be able to see each of these nested documents as a separate entity with its own text and metadata. If you parse an archive (zip, tar, etc.) the parsed document contains other documents, and any of those documents could also be archives containing other documents, and so on. The example on this page shows you how to do the following:

 * Set up the parse context so nested documents will be parsed.
 * Wrap the AutoDetectParser so you can get the text and metadata for each nested document.
Line 15: Line 18:
Here is the full source for Jukka's example for how to get access to nested metadata. This example writes the metadata for each nested document to standard output. More details about how Jukka's example works are available in subsections below. Here is the full source for Jukka's example for how to get access to nested metadata and document body text. This example writes the metadata and body text for each nested document to standard output. More details about how Jukka's example works are in subsections below.
Line 42: Line 45:
               InputStream stream, ContentHandler handler,                InputStream stream, ContentHandler ignore,
Line 45: Line 48:
           super.parse(stream, handler, metadata, context);            ContentHandler content = new BodyContentHandler();
super.parse(stream, content, metadata, context);
Line 49: Line 53:
           System.out.println("----");
           System.out.println(content.toString());
Line 50: Line 56:
Line 103: Line 108:
               InputStream stream, ContentHandler handler,                InputStream stream, ContentHandler ignore,
Line 106: Line 111:
           super.parse(stream, handler, metadata, context);            ContentHandler content = new BodyContentHandler();
super.parse(stream, content, metadata, context);
Line 110: Line 116:
           System.out.println("----");
           System.out.println(content.toString());
Line 115: Line 123:
The parse method is where you get access to the metadata. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse method is called. Before calling {{{super.parse}}}, the metadata object is empty. After {{{super.parse}}} returns, the metadata object contains all of the metadata the decorated parser found  and {{{System.out.println(metadata)}}} prints all of the metadata to standard output. The parse method is where you get access to the metadata and the body text. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse method is called. Before calling {{{super.parse}}}, the metadata object is empty. After {{{super.parse}}} returns, the metadata object contains all of the metadata the decorated parser found and {{{System.out.println(metadata)}}} prints all of the metadata to standard output.
Line 117: Line 125:
= What's Missing from Jukka's Example? =
Jukka's example shows how you can get metadata for a nested document, but it doesn't show how you can get that metadata along with the text for that nested document.
By creating a new BodyContentHandler and passing that to {{{super.parse}}}, the text for each document is captured without mixing it with text from other documents.
Line 120: Line 127:
If you only need the metadata, then this example is great. If instead you want to extract complete documents from containers including both text and metadata, then you need to do more. = Surprise! Zips Have Text Too! =
The great thing about AutoDetectParser is that it can parse and extract text from almost anything. In particular, it can parse zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, using Jukka's example code you can get the text and metadata for each file nested inside of the zip file. What you might not expect is that you also get metadata and body text for the zip file itself.
Line 122: Line 130:
== Extracting Text is an Exorcise for the Reader ==
A way to match up the metadata for a document with its text requires you to write your own ContentHandler that is able to identify text for individual nested documents. Since this page is called RecursiveMetadata and not HowToGetASeparateTextBodyForEachNestedDocument, no details are offered for how to implement that ContentHandler. While I was hoping there would be help for this in Tika's library, after quickly scanning all the handlers I could find in http://tika.apache.org/0.7/api/ I didn't see any that offered easy ways to get to the text for each contained document as a separate set of text.
Maybe this doesn't surprise you at all. My first reaction when I saw both metadata AND text for the zip file itself was "What text could a zip file possibly have?" My naive assumption was that a zip file wouldn't contain any text, and my assumption was wrong.
Line 125: Line 132:
Until someone writes a page on how to get the text for each separate document in a container as a separate body of text, writing this ContentHandler is an exercise left to the reader. I have written a ContentHandler that does this for the kinds of files and containers I have tested with, and if no one comes forward with an easy way to write this kinds of ContentHandler, my experiences might become the start of yet another wiki page. I was thinking that a zip, tar, or other archive file was simply a container for other files, and so didn't have any text of its own. Tika looks at archives differently; Tika sees an archive as being like a directory in a file system, and the text for an archive is a list of the contents of the archive.
Line 127: Line 134:
== How to get Metadata with Text ==
Assuming that you have written your own ContentHandler, and that ContentHandler can be used to get the text for individual documents in a container, how can you get associate the metadata for a document with that document's text?
If you have a zip file that contains 100 text files, after using the code on this page to get the text and metadata for each file, you will get the text and metadata for 101 files: 100 text files, and 1 zip file. The text for the zip file will list the names for each of the 100 text files it contains.
Line 130: Line 136:
The solution I currently use is to create a RecursiveMetadataParser class that is constructed with a RecursiveParserListener. The listener is notified just before and just after each parse call, and my ContentHandler can implement both the ContentHandler and the RecursiveParserListener interfaces. Here is a rough example:


{{{
public interface RecursiveParserListener {
    void startSubDocument(Metadata metadata);
    void endSubDocument();
}

public class RecursiveMetadataParser extends ParserDecorator {
    private final RecursiveParserListener listener;

    public RecursiveMetadataParser(Parser parser, RecursiveParserListener listener) {
        super(parser);
        this.listener = listener;
    }

    public void parse(InputStream stream, ContentHandler handler, Metadata metadata,
                      ParseContext context) throws IOException, SAXException, TikaException {
        listener.startSubDocument(metadata);
        super.parse(stream, handler, metadata, context);
        listener.endSubDocument();
    }
}

class TikaContentHandler implements ContentHandler, RecursiveParserListener {
    //...
    public void startSubDocument(Metadata metadata) {stack.push(metadata);}
    public void endSubDocument() {stack.pop();}
    //...
    public void endElement(String uri, String localName, String qName) throws SAXException {
        //...
        // if this end element means a document is ending
        Metadata metadata = stack.peek();
        // do something with metadata and document text
    }

}

}}}

The basic idea is that if you have gone to the trouble of implementing a ContentHandler capable of identifying text for each individual nested document, then if you can also get notifications for when a subdocument with separate metadata starts and ends, you can keep track of this metadata and associate it with the text you extract.

Hopefully this example offers an idea of what you would have to do to get both the text and metadata for a nested document.

= A Possibly Misplaced or Inappropriate Wish for Tika =
While it is possible to get the text for each nested document in a container using Tika, and it is possible to get the metadata for each nested document, it would be nice if Tika offered an easy way to get both the text and the metadata for a nested document together as a single entity.

Tika seems to want to turn any file you give it into a single XHTML document, or the stream of ContentHandler events you would get if you were parsing that single XHTML document. Containers that aren't logically a single document (containers that are logically single documents include OLE2 and .xslx) don't live comfortably inside this single document model. Because Tika does a great job of identifying and parsing a wide variety of container types, and because Tika is being extended to identify when a container is logically a single document and when a container is logically many separate documents, it would be nice if there was a better way for Tika to return the metadata and text for containers that are logically many separate documents.
If you aren't interested in seeing text and metadata for the zip file itself, you'll want to take a look at {{{metadata.get(Metadata.CONTENT_TYPE))}}} for each file Tika parses so you can skip the archives themselves. For a zip file, the content type is "application/zip".

Index

Introduction

After the MetadataDiscussion page was created, Jukka Zitting offered an example of how to get to recursive metadata when parsing with an AutoDetectParser, and later updated that example with how to get both text and metadata for nested documents using the AutoDetectParser.

If you parse an archive (zip, tar, etc.) the parsed document contains other documents, and any of those documents could also be archives containing other documents, and so on. The example on this page shows you how to do the following:

  • Set up the parse context so nested documents will be parsed.
  • Wrap the AutoDetectParser so you can get the text and metadata for each nested document.

Jukka's Example

Here is the full source for Jukka's example for how to get access to nested metadata and document body text. This example writes the metadata and body text for each nested document to standard output. More details about how Jukka's example works are in subsections below.

  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }
   }

   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }

       @Override
       public void parse(
               InputStream stream, ContentHandler ignore,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           ContentHandler content = new BodyContentHandler();
           super.parse(stream, content, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
           System.out.println("----");
           System.out.println(content.toString());
       }
   }

Main from Jukka's Example

Setting up Recursive Parsing

  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);

The example starts by setting up recursive parsing. If you are parsing text files, word documents, etc. then you'll never notice if recursive parsing is enable or not. If you are parsing containers like zip files and tar.gz files, the only way to get the text for the files contained by the containers is to enable recursive parsing.

The way to enable recursive parsing is to create a ParseContext and add a parser to it as shown on the line context.set(Parser.class, parser). This is the parser that will be used to parse any nested documents.

In this case the parser is a RecursiveMetadataParser that is a wrapper around an AutoDetectParser. The RecursiveMetadata parser is part of Jukka's example and more details are given below.

Parsing a File

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }

The rest of the main function parses a file. The parser used to parse the root document is the same parser that was added to the ParseContext as the parser to use for nested documents.

Looking at the Tika API (http://tika.apache.org/0.7/api/), I don't see a DefaultHandler class or a TikaInputStream. In the place of DefaultHandler you could use BodyContentHandler, and in the place of TikaInputStream you could use FileInputStream.

Jukka's RecursiveMetadata Parser

RecursiveMetadataParser Constructor

   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }

The RecursiveMetadataParser extends ParserDecorator. All the constructor has to do is let the ParserDecorator superclass know which parser object is being decorated.

RecursiveMetadataParser parse

       @Override
       public void parse(
               InputStream stream, ContentHandler ignore,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           ContentHandler content = new BodyContentHandler();
           super.parse(stream, content, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
           System.out.println("----");
           System.out.println(content.toString());
       }

   }

The parse method is where you get access to the metadata and the body text. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse method is called. Before calling super.parse, the metadata object is empty. After super.parse returns, the metadata object contains all of the metadata the decorated parser found and System.out.println(metadata) prints all of the metadata to standard output.

By creating a new BodyContentHandler and passing that to super.parse, the text for each document is captured without mixing it with text from other documents.

Surprise! Zips Have Text Too!

The great thing about AutoDetectParser is that it can parse and extract text from almost anything. In particular, it can parse zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, using Jukka's example code you can get the text and metadata for each file nested inside of the zip file. What you might not expect is that you also get metadata and body text for the zip file itself.

Maybe this doesn't surprise you at all. My first reaction when I saw both metadata AND text for the zip file itself was "What text could a zip file possibly have?" My naive assumption was that a zip file wouldn't contain any text, and my assumption was wrong.

I was thinking that a zip, tar, or other archive file was simply a container for other files, and so didn't have any text of its own. Tika looks at archives differently; Tika sees an archive as being like a directory in a file system, and the text for an archive is a list of the contents of the archive.

If you have a zip file that contains 100 text files, after using the code on this page to get the text and metadata for each file, you will get the text and metadata for 101 files: 100 text files, and 1 zip file. The text for the zip file will list the names for each of the 100 text files it contains.

If you aren't interested in seeing text and metadata for the zip file itself, you'll want to take a look at metadata.get(Metadata.CONTENT_TYPE)) for each file Tika parses so you can skip the archives themselves. For a zip file, the content type is "application/zip".

RecursiveMetadata (last edited 2014-12-19 16:38:49 by TimothyAllison)