Extracting Text from PowerPoint format

Here are different ppt extraction code. No guarantees, please modify list information if you test it.

Using Apache Tika: http://tika.apache.org/

Using POI HSLF: http://jakarta.apache.org/poi/hslf/index.html (see Quick Guide for details on text extraction)

From: poi-users: http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html

From: slide-dev: http://www.mail-archive.com/slide-dev@jakarta.apache.org/msg10445.html

From: http://nagoya.apache.org/eyebrowse/ReadMsg?listName=poi-dev@jakarta.apache.org&msgNo=4326

Here is some sample code that works with some ppt formats. It's basically an implementation of a POIFSReaderListener. There are no guarantees on how well it works - it is known to ignore unicode text records for starters. It requires POI libraries.


PowerPoint (last edited 2010-12-20 13:21:22 by 213-186-245-1)