Introduction

This page provides some performance benchmarks of the regular expressions based URLFilters in Nutch (currently urlfilter-regex and urlfilter-automaton). The urlfilter-regex plugin is based on the standard jdk java.util.regex implementation, whereas the urlfilter-automaton plugin is based on dk.brics.automaton Finite-State Automata for Java.

Performance

Data set

These performance benchmarks were produced by collecting the results of the unit tests of each plugin using the same rule file (Benchmarks.rules) and the same set of urls to filter (Benchmarks.urls).

Raw results

The following matrix shows the urlfilter-regex and urlfilter-automaton plugins processing time in ms for many numbers of loops on the Benchmarks.urls file filtering.

 

50

100

200

400

800

regex

459

899

1917

3703

7873

automaton

335

419

657

1119

1997

Graphical representation

http://frutch.free.fr/images/nutch/regexfilters-benchs.png

Conclusion

urlfilter-automaton supports less operators than urlfilter-regex but provides some really best performance. It can probably be usefull in some contexts.

A next step could be to mix the usage of these two plugins in order to take the best of each one by using the urlfilter.order configuration property.

How to use

You need to enable urlfilter-automaton plugin by editing your conf/nutch-site.xml. You need to edit automaton-urlfilter.txt and enter the rules. The syntax is explained here here. A good and robust grammar. No greedy/lazy kind of modes.

  • No labels