Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: update jar versions

...

  1. Clean up from any previous runs
    1. Remove tika-app-X-Y.jar from /data1/tools/tika/batch/bin – make sure to leave in the other "optional" jars: jai-imageio-jpeg2000-1.4.0.jar, sqlite-jdbc-3.4345.2.10.jar and zstd-jni-1.5.5-611.jar
    2. Remove or rename /data1/tools/tika/batch/logs
    3. Remove or rename /data1/tools/tika/batch/nohup.out
  2. Run the current "A" version
    1. Place the "A" version of tika-app-X.Y.jar in /data1/tools/tika/batch/bin
    2. Modify appBatchExecutor.sh to
      1. put the output in a new output directory -o /data1/extracts/pdfboxA
      2. if using a file list, confirm that the correct file list is specified -fileList fileLists/ccAndBugTracker_pdfs.txt
    3. Execute: nohup ./appBatchExecutor.sh &
    4. Wait for the "A" version to complete before starting the "B" version
  3. Build and run the "B" version
    1. Update PDFBox from SVN, mvn clean install
    2. Update the PDFBox, Fontbox and jbig2-imageio versions in the Tika project tika-parsers/pom.xml
    3. Run mvn clean on the whole Tika project and make sure that your IDE has picked up the changes
    4. Run the PDFParser tests in tika-parsers/src/test/java/o.a.t.parsers.pdf.* to make sure that at least the Tika unit tests work.
    5. Build the entire Tika project (even though you'll only use tika-app.jar): mvn clean install
    6. On the VM, remove the tika.app-A.jar from /data1/tools/tika/batch/bin, rename the existing nohup.out to nohup-A.out, rename logs/ to logs-A/
    7. Drop the new tika-app-B.jar into (you guessed it!): /data1/tools/tika/batch/bin
    8. Modify appBatchExecutor.sh to
      1. put the output in a new output directory -o /data1/extracts/pdfboxB
      2. if using a file list, confirm that the correct file list is specified -fileList fileLists/ccAndBugTracker_pdfs.txt
    9. Execute: nohup ./appBatchExecutor.sh &
    10. Wait for the "B" version to complete before starting the comparisons and reports
  4. Make the comparisons and report
    1. In /data1/tools/tika/eval, remove the existing db file pdfboxAvsB.mv.db if you don't want to rename it.
    2. nohup java -jar tika-eval-app-X.Y.jar Compare -extractsA /data1/extracts/pdfboxA -extractsB /data1/extracts/pdfboxB -db pdfboxAvsB&
    3. When that completes,
      1. Remove any files left over from the last run in reports/: rm -r reports
      2. Write the reports java -Djava.io.tmpdir=tmp -jar tika-eval-app-X.Y.jar Report -db pdfboxAvsBNote the -Djava.io.tmpdir=tmp – need to set the tmp directory to something writeable by 'collab'

...