Install Tinyproxy (Ubuntu Dapper)
Install
sudo apt-get install tinyproxy
Configure
sudo vi /etc/tinyproxy/tinyproxy.conf
Sample configuration, make sure you set up the Port and Allow (here, I'm using my localhost)
Port 5555 Allow 127.0.0.1 Allow 192.168.1.110/24 Filter "/etc/tinyproxy/filter" FilterURLs On FilterDefaultDeny No #filters will act as a blacklist User nobody Group nogroup ViaProxyName "tinyproxy" ConnectPort 443 ConnectPort 563 Timeout 600 DefaultErrorFile "/usr/share/tinyproxy/default.html" StatFile "/usr/share/tinyproxy/stats.html" Logfile "/var/log/tinyproxy.log" LogLevel Info PidFile "/var/run/tinyproxy.pid" MaxClients 100 MinSpareServers 5 MaxSpareServers 20 StartServers 10 MaxRequestsPerChild 0
Create filters
If necessary (will act as a blacklist, because of FilterDefaultDeny No)
sudo vi /etc/tinyproxy/filter
and add sites urls to be blocked
google.com apache.org
Commands to Stop,Start, and Restart
sudo /etc/init.d/tinyproxy stop sudo /etc/init.d/tinyproxy start sudo /etc/init.d/tinyproxy restart
Test the proxy with your browser
For Firefox, menu Preferences, tab General, button Connection settings. Then select Manual Proxy Configuration and enter the host you defined above and the port.
If you have created the filter above, and browse to google.com, the proxy should block you.
Configure Nutch (Nutch O.8)
Copy the proxy configuration (see below) from conf/nutch-default.xml to conf/nutch-site.xml and fill up with the values of your proxy
<property> <name>http.proxy.host</name> <value>192.168.0.157</value> <description>The proxy hostname. If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>5555</value> <description>The proxy port.</description> </property>
Now if you crawl sites, Nutch will use your proxy. You can monitor it by looking at the logs of Tinyproxy during a crawl:
sudo tail -f /var/log/tinyproxy.log
More resources
*
http://ubuntuforums.org/showthread.php?t=122011 *
http://doc.gwos.org/index.php/TinyProxy *
http://doc.ubuntu-fr.org/serveur/tinyproxy