This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev
) and the patches NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch (see NewLanguageIdentifier for more details).
These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your Nutch configuration.
These performance benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2.
The following matrix shows the LanguageIdentifierPlugin processing time in ms for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis. The Data Size row is the size of data in bytes used in each file to perform the identification. Other rows represent the following configurations:
Nutch-0.7
: The nutch-0.7-dev LanguageIdentifierPlugin version (without patch).NUTCH-60-050526
: The LanguageIdentifierPlugin code with NUTCH-60-050526.patch applied.NUTCH-60-050607
: The LanguageIdentifierPlugin code with NUTCH-60-050607.patch applied.
|
Nutch-0.7 |
NUTCH-60-050526 |
|
|
NUTCH-60-050607 |
Data Size |
time |
time |
% |
time |
% |
128 |
2410 |
1485 |
38.38 |
716 |
70.29 |
256 |
2842 |
1836 |
35.40 |
1048 |
63.12 |
512 |
3759 |
2305 |
38.68 |
1649 |
56.13 |
1024 |
5899 |
5130 |
13.04 |
2839 |
51.87 |
2048 |
8581 |
7462 |
13.04 |
4534 |
47.16 |
4096 |
12622 |
10513 |
16.71 |
8031 |
36.37 |
8192 |
21360 |
18289 |
14.38 |
13803 |
35.38 |
16384 |
32073 |
29488 |
8.06 |
23733 |
26.00 |
32768 |
58535 |
49417 |
15.58 |
41994 |
28.26 |
65536 |
99861 |
91285 |
8.59 |
81612 |
18.27 |
131072 |
184083 |
161258 |
12.40 |
140501 |
23.68 |
262144 |
309438 |
289395 |
6.48 |
244369 |
21.03 |
524288 |
504145 |
442028 |
12.32 |
377693 |
25.08 |
Total |
1245608 |
1109891 |
10.90 |
942522 |
24.33 |
Average |
95816 |
85376.23 |
10.90 |
72501.69 |
24.33 |
http://frutch.free.fr/images/nutch/langid-benchs03.jpg
http://frutch.free.fr/images/nutch/langid-benchs04.jpg
18.27%
to 70.29%
with an average of 24.33%
.25%
of the whole process.
These precision benchmarks were produced by testing the LanguageIdentifierPlugin on the Data Size first bytes from a set of :
(These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2).
|
|
Nutch-0.7 |
|
|
|
|
NUTCH-60-050605 |
|
|
NUTCH-60-050607 |
|
|
Data Size |
avg |
fr |
en |
de |
avg |
fr |
en |
de |
avg |
fr |
en |
de |
8 |
38.84 |
36.99 |
10.47 |
69.06 |
14.00 |
2.64 |
2.67 |
36.68 |
51.11 |
48.37 |
19.30 |
85.66 |
16 |
70.38 |
58.74 |
75.15 |
77.25 |
45.64 |
13.41 |
68.17 |
55.33 |
94.06 |
97.36 |
87.68 |
97.13 |
32 |
66.51 |
55.08 |
86.86 |
57.58 |
56.43 |
41.26 |
73.92 |
54.10 |
98.56 |
99.59 |
96.30 |
99.80 |
64 |
97.14 |
97.15 |
97.54 |
96.72 |
65.35 |
53.86 |
84.80 |
57.38 |
99.93 |
100 |
99.79 |
100 |
128 |
97.90 |
94.51 |
99.79 |
99.39 |
77.81 |
70.53 |
89.32 |
73.57 |
100 |
100 |
100 |
100 |
256 |
100 |
100 |
100 |
100 |
90.32 |
90.04 |
92.20 |
88.73 |
100 |
100 |
100 |
100 |
512 |
100 |
100 |
100 |
100 |
96.93 |
98.17 |
97.54 |
95.08 |
100 |
100 |
100 |
100 |
1024 |
100 |
100 |
100 |
100 |
99.59 |
99.80 |
99.79 |
99.18 |
100 |
100 |
100 |
100 |
2048 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
http://frutch.free.fr/images/nutch/langid-benchs05.jpg
TODO