Install Tinyproxy (Ubuntu Dapper)
Install
sudo apt-get install tinyproxy
Configure
sudo vi /etc/tinyproxy/tinyproxy.conf
Sample configuration, make sure you set up the Port and Allow (here, I'm using my localhost)
Port 5555 Allow 127.0.0.1 Allow 192.168.1.110/24 Filter "/etc/tinyproxy/filter" FilterURLs On FilterDefaultDeny No #filters will act as a blacklist User nobody Group nogroup ViaProxyName "tinyproxy" ConnectPort 443 ConnectPort 563 Timeout 600 DefaultErrorFile "/usr/share/tinyproxy/default.html" StatFile "/usr/share/tinyproxy/stats.html" Logfile "/var/log/tinyproxy.log" LogLevel Info PidFile "/var/run/tinyproxy.pid" MaxClients 100 MinSpareServers 5 MaxSpareServers 20 StartServers 10 MaxRequestsPerChild 0
Create filters
If necessary (will act as a blacklist, because of FilterDefaultDeny No)
sudo vi /etc/tinyproxy/filter
and add sites urls to be blocked
google.com apache.org
Commands to Stop,Start, and Restart
sudo /etc/init.d/tinyproxy stop sudo /etc/init.d/tinyproxy start sudo /etc/init.d/tinyproxy restart
Test the proxy with your browser
- For Firefox, menu Preferences, tab General, button Connection settings. Then select Manual Proxy Configuration and enter the host you defined above and the port.
- If you have created the filter above, and browse to google.com, the proxy should block you.
Configure Nutch (Nutch O.8)
Copy the proxy configuration (see below) from conf/nutch-default.xml to conf/nutch-site.xml and fill up with the values of your proxy
<property> <name>http.proxy.host</name> <value>192.168.0.157</value> <description>The proxy hostname. If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>5555</value> <description>The proxy port.</description> </property>
Now if you crawl sites, Nutch will use your proxy. You can monitor it by looking at the logs of Tinyproxy during a crawl:
sudo tail -f /var/log/tinyproxy.log
More resources
* http://ubuntuforums.org/showthread.php?t=122011 * http://doc.gwos.org/index.php/TinyProxy * http://doc.ubuntu-fr.org/serveur/tinyproxy