Notes on setting up the Tika Virtual Machine (hosted by Rackspace)

See TIKA-1331

install software

1. yum install java-1.7.0-openjdk-devel git svn httpd emacs ant

2. curl -s | bash

3. source "/root/.gvm/bin/"

4. gvm install groovy

5. mv groovy to /usr/share/groovy

6. added /usr/share/groovy/current/bin to personal path via .bashrc

7. Maven:

  1. wget

  2. unzip

  3. mv apache-maven-3.6.0/ /opt/maven

  4. ln -s /opt/maven/bin/mvn /usr/bin/mvn

  5. nano /etc/profile.d/

    • #!/bin/bash



      export PATH MAVEN_HOME

      export CLASSPATH=.

  6. chmod +x /etc/profile.d/

  7. . /etc/profile.d/

8. ExifTool: follow directions here

config/admin stuff

nano /etc/ssh/sshd_config

Install, configure and run fail2ban. Thanks to tecmint.

yum install fail2ban

vi /etc/fail2ban/jail.conf

systemctl start fail2ban

opening port 9998 for TIKA-1301

To open port 9998: firewall-cmd --zone=public --add-port=9998/tcp --permanent

firewall-cmd --reload

permission management

1. adduser <user> sudo

2. passwd <user>

3. groupadd <admingroup>

4. usermod -g <admingroup> <user>

5. modify /etc/ssh/sshd_config to add user to AllowUsers

6. systemctl restart sshd.service --restart sshd on RHEL 7


/public/corpora/govdocs1 ...

prep govdocs1

1. cp zipfilelist /public/corpora/govdocs1/archive

2. wget -i zipfilelist.txt

3. (go get some coffee)

4. cd govdocs1/scripts

5. groovy unzip.groovy 0

6. (go get some more coffee)

7. groovy rmBugged.groovy

prep nsfpolardata

1. scp -r <user> .

2. (go get some coffee)

3. scp -r <user> .

4. (go get some coffee)

5. scp -r <user> .

6. (go get some coffee)

7. cd /data1/public/archives/nsf-polar-data/

8. export NUTCH_OPTS="-Xmx8192m -XX:MaxPermSize=8192m"

9. ./bin/nutch dump -outputDir out -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl/segments/

10. ./bin/nutch dump -outputDir out2 -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl2/segments/

11. ./bin/nutch dump -outputDir out3 -segment /data1/public/archives/nsf-polar-data/nasa-amd/crawlId/segments/

add more disc

From Rackspace website, add block storage volume and attach it to server.

1. mkfs.ext3 /dev/xvdb

2. mkdir /data1

3. mount /dev/xvdb /data1

4. nano /etc/fstab

/dev/xvdb               /data1                  ext3    defaults        1 2

/dev/xvdc               /data2                  ext3    defaults        1 2

4b. When you wreck the fstab file and can't log into your system after a hard reboot and you are in recovery mode:

4c. Before you hit 4b, try mount -fav to see if there are any errors in your fstab file.


config file in the usual place : /etc/httpd/conf/httpd.conf

1. Set robots.txt to disallow all: /var/www/html

2. Link data dir(s) under: /var/www/html

3. Configure config file to allow links and to show directories

4. Show long file names...add to config file: IndexOptions FancyIndexing SuppressDescription NameWidth=*

5. start: apachectl start


Downloads from:

Current version: 4.00; Released: 2017 Aug 10

1. Downloaded 64-bit Linux XpdfReader; executed:; unpacked and cp xpdf to /usr/local/bin

2. Downloaded 64-bit Linux Xpdf tools; unpacked and cp bin64/* to /usr/local/bin

3. Downloaded language support packages: Arabic, Chinese/simplified, Chinese/traditional, Cyrillic, Greek, Hebrew, Japanese, Korean, Latin2, Thai and Turkish; unzipped them all, cat all add-to-xpdfrc >> tmp_xpdfrc and cp all to /usr/local/share/xpdf

4. cat xpdf-tools-linux-4.00/doc/sample-xpdfrc tmp_xpdfrc >> /usr/local/etc/xpdfrc

NOTE: We found that pdftotext was not correctly reading the xpdfrc file in this location. We found no differences in extracted text when we removed the xpdfrc file and when we had it there. We did find a difference, especially in CJK PDFs, when we specified the xpdfrc file from the commandline with the -cfg option.


1. sudo yum install epel-release

2. sudo yum localinstall --nogpgcheck

3. sudo yum install ffmpeg ffmpeg-devel

Other data

See ApacheTikaHtmlEncodingStudy for a description of gathering data for TIKA-2038. See CommonCrawl3 for a description of refreshing data for TIKA-2750. See ComparisonTikaAndPDFToText201811 for a comparison of text extracted from PDFs by Apache Tika/Apache PDFBox and pdftotext.

