Notes on setting up the Tika Virtual Machine (hosted by Rackspace)

See TIKA-1331

install software

1. yum install java-1.7.0-openjdk-devel git svn httpd emacs ant

2. curl -s | bash

3. source "/root/.gvm/bin/"

4. gvm install groovy

5. mv groovy to /usr/share/groovy

6. added /usr/share/groovy/current/bin to personal path via .bashrc

7. Maven:

  1. wget

  2. unzip

  3. mv apache-maven-3.2.5/ /opt/maven

  4. ln -s /opt/maven/bin/mvn /usr/bin/mvn

  5. nano /etc/profile.d/

    • #!/bin/bash



      export PATH MAVEN_HOME

      export CLASSPATH=.

  6. chmod +x /etc/profile.d/

  7. . /etc/profile.d/

8. ExifTool: follow directions here

config/admin stuff

nano /etc/ssh/sshd_config

Install, configure and run fail2ban. Thanks to tecmint.

yum install fail2ban

vi /etc/fail2ban/jail.conf

systemctl start fail2ban

opening port 9998 for TIKA-1301

To open port 9998: firewall-cmd --zone=public --add-port=9998/tcp --permanent

firewall-cmd --reload

permission management

1. adduser <user> sudo

2. passwd <user>

3. groupadd <admingroup>

4. usermod -g <admingroup> <user>

5. modify /etc/ssh/sshd_config to add user to AllowUsers

6. systemctl restart sshd.service --restart sshd on RHEL 7


/public/corpora/govdocs1 ...

prep govdocs1

1. cp zipfilelist /public/corpora/govdocs1/archive

2. wget -i zipfilelist.txt

3. (go get some coffee)

4. cd govdocs1/scripts

5. groovy unzip.groovy 0

6. (go get some more coffee)

7. groovy rmBugged.groovy

prep nsfpolardata

1. scp -r <user> .

2. (go get some coffee)

3. scp -r <user> .

4. (go get some coffee)

5. scp -r <user> .

6. (go get some coffee)

7. cd /data1/public/archives/nsf-polar-data/

8. export NUTCH_OPTS="-Xmx8192m -XX:MaxPermSize=8192m"

9. ./bin/nutch dump -outputDir out -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl/segments/

10. ./bin/nutch dump -outputDir out2 -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl2/segments/

11. ./bin/nutch dump -outputDir out3 -segment /data1/public/archives/nsf-polar-data/nasa-amd/crawlId/segments/

add more disc

From Rackspace website, add block storage volume and attach it to server.

1. mkfs.ext3 /dev/xvdb

2. mkdir /data1

3. mount /dev/xvdb /data1

4. nano /etc/fstab

/dev/xvdb               /data1                  ext3    defaults        1 2

/dev/xvdc               /data2                  ext3    defaults        1 2

4b. When you wreck the fstab file and can't log into your system after a hard reboot and you are in recovery mode:

4c. Before you hit 4b, try mount -fav to see if there are any errors in your fstab file.


config file in the usual place : /etc/httpd/conf/httpd.conf

1. Set robots.txt to disallow all: /var/www/html

2. Link data dir(s) under: /var/www/html

3. Configure config file to allow links and to show directories

4. Show long file names...add to config file: IndexOptions FancyIndexing SuppressDescription NameWidth=*

5. start: apachectl start

Other data

See ApacheTikaHtmlEncodingStudy for a description of gathering data for TIKA-2038.

VirtualMachine (last edited 2017-02-21 01:39:22 by TimothyAllison)