Differences between revisions 6 and 7
Revision 6 as of 2011-08-02 13:05:04
Size: 2761
Comment:
Revision 7 as of 2011-08-02 13:08:24
Size: 2751
Comment:
Deletions are marked like this. Additions are marked like this.
Line 16: Line 16:
sudo vi /etc/tinyproxy/tinyproxy.conf sudo vi /etc/tinyproxy.conf

Install Tinyproxy

(Ubuntu 11.04 Natty, Kernel Linux 2.6.38-10-generic, GNOME 2.32.1)

Introduction

Tinyproxy is a light-weight HTTP/HTTPS proxy daemon for POSIX operating systems. Designed from the ground up to be fast and yet small, it is an ideal solution for use cases such as embedded deployments where a full featured HTTP proxy is required, but the system resources for a larger proxy are unavailable. Fore more information see here.

Install

sudo apt-get install tinyproxy

Configure

sudo vi /etc/tinyproxy.conf

Sample configuration, make sure you set up the Port and Allow (here, I'm using my localhost)

Port 5555
Allow 127.0.0.1
Allow 192.168.1.110/24
Filter "/etc/tinyproxy/filter"
FilterURLs On
FilterDefaultDeny No #filters will act as a blacklist

User nobody
Group nogroup
ViaProxyName "tinyproxy"
ConnectPort 443
ConnectPort 563
Timeout 600
DefaultErrorFile "/usr/share/tinyproxy/default.html"
StatFile "/usr/share/tinyproxy/stats.html"
Logfile "/var/log/tinyproxy.log"
LogLevel Info
PidFile "/var/run/tinyproxy.pid"
MaxClients 100
MinSpareServers 5
MaxSpareServers 20
StartServers 10
MaxRequestsPerChild 0

Create filters

If necessary (will act as a blacklist, because of FilterDefaultDeny No)

sudo vi /etc/tinyproxy/filter

and add sites urls to be blocked

google.com
apache.org

Commands to Stop,Start, and Restart

sudo /etc/init.d/tinyproxy stop
sudo /etc/init.d/tinyproxy start
sudo /etc/init.d/tinyproxy restart

Test the proxy with your browser

  • For Firefox, menu Preferences, tab General, button Connection settings. Then select Manual Proxy Configuration and enter the host you defined above and the port.
  • If you have created the filter above, and browse to google.com, the proxy should block you.

Configure Nutch (Nutch O.8)

Copy the proxy configuration (see below) from conf/nutch-default.xml to conf/nutch-site.xml and fill up with the values of your proxy

<property>
  <name>http.proxy.host</name>
  <value>192.168.0.157</value>
  <description>The proxy hostname.  If empty, no proxy is used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>5555</value>
  <description>The proxy port.</description>
</property>

Now if you crawl sites, Nutch will use your proxy. You can monitor it by looking at the logs of Tinyproxy during a crawl:

sudo tail -f /var/log/tinyproxy.log

More resources

* http://ubuntuforums.org/showthread.php?t=122011 * http://doc.gwos.org/index.php/TinyProxy * http://doc.ubuntu-fr.org/serveur/tinyproxy

SetupProxyForNutch (last edited 2014-03-05 22:44:55 by LewisJohnMcgibbney)