Install Tinyproxy (Ubuntu Dapper)

Install

sudo apt-get install tinyproxy

Configure

sudo vi /etc/tinyproxy/tinyproxy.conf

Sample configuration, make sure you set up the Port and Allow (here, I'm using my localhost)

Port 5555
Allow 127.0.0.1
Allow 192.168.1.110/24
Filter "/etc/tinyproxy/filter"
FilterURLs On
FilterDefaultDeny No #filters will act as a blacklist

User nobody
Group nogroup
ViaProxyName "tinyproxy"
ConnectPort 443
ConnectPort 563
Timeout 600
DefaultErrorFile "/usr/share/tinyproxy/default.html"
StatFile "/usr/share/tinyproxy/stats.html"
Logfile "/var/log/tinyproxy.log"
LogLevel Info
PidFile "/var/run/tinyproxy.pid"
MaxClients 100
MinSpareServers 5
MaxSpareServers 20
StartServers 10
MaxRequestsPerChild 0

Create filters

If necessary (will act as a blacklist, because of FilterDefaultDeny No)

sudo vi /etc/tinyproxy/filter

and add sites urls to be blocked

google.com
apache.org

Commands to Stop,Start, and Restart

sudo /etc/init.d/tinyproxy stop
sudo /etc/init.d/tinyproxy start
sudo /etc/init.d/tinyproxy restart

Test the proxy with your browser

Configure Nutch (Nutch O.8)

Copy the proxy configuration (see below) from conf/nutch-default.xml to conf/nutch-site.xml and fill up with the values of your proxy

<property>
  <name>http.proxy.host</name>
  <value>192.168.0.157</value>
  <description>The proxy hostname.  If empty, no proxy is used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>5555</value>
  <description>The proxy port.</description>
</property>

Now if you crawl sites, Nutch will use your proxy. You can monitor it by looking at the logs of Tinyproxy during a crawl:

sudo tail -f /var/log/tinyproxy.log

More resources

* http://ubuntuforums.org/showthread.php?t=122011 * http://doc.gwos.org/index.php/TinyProxy * http://doc.ubuntu-fr.org/serveur/tinyproxy