Differences between revisions 8 and 9
Revision 8 as of 2005-06-08 22:09:57
Size: 5047
Comment:
Revision 9 as of 2009-09-20 23:09:32
Size: 5065
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (`nutch-0.7-dev`) and the patches [http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch NUTCH-60-050526.patch], [http://issues.apache.org/jira/secure/attachment/12310539/NUTCH-60-050605.patch NUTCH-60-050605.patch], [http://issues.apache.org/jira/secure/attachment/12310616/NUTCH-60-050607.patch NUTCH-60-050607.patch] (see NewLanguageIdentifier for more details). This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (`nutch-0.7-dev`) and the patches [[http://issues.apache.org/jira/secure/attachment/20236/NUTCH-60-050526.patch|NUTCH-60-050526.patch]], [[http://issues.apache.org/jira/secure/attachment/12310539/NUTCH-60-050605.patch|NUTCH-60-050605.patch]], [[http://issues.apache.org/jira/secure/attachment/12310616/NUTCH-60-050607.patch|NUTCH-60-050607.patch]] (see NewLanguageIdentifier for more details).
Line 5: Line 5:
These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your ["Nutch"] configuration. These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your [[Nutch]] configuration.
Line 11: Line 11:
These ''performance'' benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''. These ''performance'' benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the ''[[http://people.csail.mit.edu/koehn/publications/europarl/|European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]]''.
Line 42: Line 42:
[http://frutch.free.fr/images/nutch/langid-benchs03.jpg] [[http://frutch.free.fr/images/nutch/langid-benchs03.jpg]]
Line 46: Line 46:
[http://frutch.free.fr/images/nutch/langid-benchs04.jpg] [[http://frutch.free.fr/images/nutch/langid-benchs04.jpg]]
Line 51: Line 51:
 * The profiling of the code confirms what SamiSiren suggests in a [http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00501.html previous message]: ''"the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there"''. Profiling confirms this point and shows that the splitting of the text takes around `25%` of the whole process.  * The profiling of the code confirms what SamiSiren suggests in a [[http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00501.html|previous message]]: ''"the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there"''. Profiling confirms this point and shows that the splitting of the text takes around `25%` of the whole process.
Line 62: Line 62:
(These files were extracted from the ''[http://people.csail.mit.edu/koehn/publications/europarl/ European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]''). (These files were extracted from the ''[[http://people.csail.mit.edu/koehn/publications/europarl/|European Parliament Proceedings Parallel Corpus 1996-2003 Release v2]]'').
Line 80: Line 80:
[http://frutch.free.fr/images/nutch/langid-benchs05.jpg] [[http://frutch.free.fr/images/nutch/langid-benchs05.jpg]]

Introduction

This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev) and the patches NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch (see NewLanguageIdentifier for more details).

These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your Nutch configuration.

Performance

Data set

These performance benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo. These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2.

Raw results

The following matrix shows the LanguageIdentifierPlugin processing time in ms for many versions. Each patched version is configured to be comparable with the nutch-0.7-dev version, ie by using both 1-grams, 2-grams, 3-grams and 4-grams for performing analysis. The Data Size row is the size of data in bytes used in each file to perform the identification. Other rows represent the following configurations:

Nutch-0.7

NUTCH-60-050526

NUTCH-60-050607

Data Size

time

time

%

time

%

128

2410

1485

38.38

716

70.29

256

2842

1836

35.40

1048

63.12

512

3759

2305

38.68

1649

56.13

1024

5899

5130

13.04

2839

51.87

2048

8581

7462

13.04

4534

47.16

4096

12622

10513

16.71

8031

36.37

8192

21360

18289

14.38

13803

35.38

16384

32073

29488

8.06

23733

26.00

32768

58535

49417

15.58

41994

28.26

65536

99861

91285

8.59

81612

18.27

131072

184083

161258

12.40

140501

23.68

262144

309438

289395

6.48

244369

21.03

524288

504145

442028

12.32

377693

25.08

Total

1245608

1109891

10.90

942522

24.33

Average

95816

85376.23

10.90

72501.69

24.33

Graphical representation

http://frutch.free.fr/images/nutch/langid-benchs03.jpg

Graphical representation (log axis)

http://frutch.free.fr/images/nutch/langid-benchs04.jpg

Discussion

  • The NUTCH-60-050607.patch increases performances from 18.27% to 70.29% with an average of 24.33%.

  • The profiling of the code confirms what SamiSiren suggests in a previous message: "the most timeconsuming part of language identifier is splitting the text into ngrams and propably the biggest optimization could be done there". Profiling confirms this point and shows that the splitting of the text takes around 25% of the whole process.

Precision

Data set

These precision benchmarks were produced by testing the LanguageIdentifierPlugin on the Data Size first bytes from a set of :

  • 492 french files,
  • 487 english files,
  • 488 deutch files.

(These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2).

Raw results

Nutch-0.7

NUTCH-60-050605

NUTCH-60-050607

Data Size

avg

fr

en

de

avg

fr

en

de

avg

fr

en

de

8

38.84

36.99

10.47

69.06

14.00

2.64

2.67

36.68

51.11

48.37

19.30

85.66

16

70.38

58.74

75.15

77.25

45.64

13.41

68.17

55.33

94.06

97.36

87.68

97.13

32

66.51

55.08

86.86

57.58

56.43

41.26

73.92

54.10

98.56

99.59

96.30

99.80

64

97.14

97.15

97.54

96.72

65.35

53.86

84.80

57.38

99.93

100

99.79

100

128

97.90

94.51

99.79

99.39

77.81

70.53

89.32

73.57

100

100

100

100

256

100

100

100

100

90.32

90.04

92.20

88.73

100

100

100

100

512

100

100

100

100

96.93

98.17

97.54

95.08

100

100

100

100

1024

100

100

100

100

99.59

99.80

99.79

99.18

100

100

100

100

2048

100

100

100

100

100

100

100

100

100

100

100

100

Graphical representation

http://frutch.free.fr/images/nutch/langid-benchs05.jpg

Discussion

TODO

LanguageIdentifierBenchs (last edited 2009-09-20 23:09:32 by localhost)