Differences between revisions 1 and 3 (spanning 2 versions)
Revision 1 as of 2006-07-27 16:53:48
Size: 2253
Editor: pool-68-160-34-54
Comment: new page about setting up Nutch to use a proxy
Revision 3 as of 2009-09-20 23:09:48
Size: 2281
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Install Tinyproxy = = Install Tinyproxy (Ubuntu Dapper) =
Line 61: Line 61:
= Configure Nutch = = Configure Nutch (Nutch O.8) =

Install Tinyproxy (Ubuntu Dapper)

Install

sudo apt-get install tinyproxy

Configure

sudo vi /etc/tinyproxy/tinyproxy.conf

Sample configuration, make sure you set up the Port and Allow (here, I'm using my localhost)

Port 5555
Allow 127.0.0.1
Allow 192.168.1.110/24
Filter "/etc/tinyproxy/filter"
FilterURLs On
FilterDefaultDeny No #filters will act as a blacklist

User nobody
Group nogroup
ViaProxyName "tinyproxy"
ConnectPort 443
ConnectPort 563
Timeout 600
DefaultErrorFile "/usr/share/tinyproxy/default.html"
StatFile "/usr/share/tinyproxy/stats.html"
Logfile "/var/log/tinyproxy.log"
LogLevel Info
PidFile "/var/run/tinyproxy.pid"
MaxClients 100
MinSpareServers 5
MaxSpareServers 20
StartServers 10
MaxRequestsPerChild 0

Create filters

If necessary (will act as a blacklist, because of FilterDefaultDeny No)

sudo vi /etc/tinyproxy/filter

and add sites urls to be blocked

google.com
apache.org

Commands to Stop,Start, and Restart

sudo /etc/init.d/tinyproxy stop
sudo /etc/init.d/tinyproxy start
sudo /etc/init.d/tinyproxy restart

Test the proxy with your browser

  • For Firefox, menu Preferences, tab General, button Connection settings. Then select Manual Proxy Configuration and enter the host you defined above and the port.
  • If you have created the filter above, and browse to google.com, the proxy should block you.

Configure Nutch (Nutch O.8)

Copy the proxy configuration (see below) from conf/nutch-default.xml to conf/nutch-site.xml and fill up with the values of your proxy

<property>
  <name>http.proxy.host</name>
  <value>192.168.0.157</value>
  <description>The proxy hostname.  If empty, no proxy is used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>5555</value>
  <description>The proxy port.</description>
</property>

Now if you crawl sites, Nutch will use your proxy. You can monitor it by looking at the logs of Tinyproxy during a crawl:

sudo tail -f /var/log/tinyproxy.log

More resources

* http://ubuntuforums.org/showthread.php?t=122011 * http://doc.gwos.org/index.php/TinyProxy * http://doc.ubuntu-fr.org/serveur/tinyproxy

SetupProxyForNutch (last edited 2014-03-05 22:44:55 by LewisJohnMcgibbney)