Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

*install Java
*set JAVA_HOME
*install Apache Ant (brew install ant) if on Mac OSX, apt-get install ant if on Ubuntu/Linux

Steps

  • create a new directory
  • cd to directory
  • svn co git clone https://svngithub.apache.org/repos/asf/nutch/trunk/cd to trunk foldercom/apache/nutch && cd nutch
  • run

    No Format
     $ ant runtime && cd runtime/local/


  • edit conf/nutch-site.xml
  • add below code between <configuration> section and replace "Value_name" with the desire name
No Format

<property>
  <name>http.agent.name</name>
  <value>Value_name</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>
  • run parsecheker for NASA JPL website for example by
No Format

./bin/nutch parsechecker -dumpText httphttps://www.jpl.nasa.gov > jpl_out.txt

...