Differences between revisions 1 and 2
Revision 1 as of 2004-12-01 20:24:08
Size: 1482
Editor: JustinMason
Comment:
Revision 2 as of 2009-09-20 23:16:42
Size: 1486
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
  * [http://www.issco.unige.ch/ewg95/node216.html a search-engine course]
  * [http://www.tbray.org/ongoing/When/200x/2003/06/22/PandR Tim Bray comments]
  * [[http://www.issco.unige.ch/ewg95/node216.html|a search-engine course]]
  * [[http://www.tbray.org/ongoing/When/200x/2003/06/22/PandR|Tim Bray comments]]

Precision and Recall

The traditional method of MeasuringAccuracy in the information-retrieval field is using a two-figure scheme of Precision and Recall.

Given the usual set of 4 numbers (see FpFnPercentages):

  nspam   = number of known-to-be-spam messages in the corpus
  nham    = number of known-to-be-ham (nonspam) messages in the corpus
  fp      = number of ham messages incorrectly marked as spam
  fn      = number of spam messages incorrectly marked as ham

Precision and Recall can be computed as follows:

  nspamspam  = nspam - fp

  recall     = (nspamspam / nspam) * 100
  precision  = ((nspamspam / (nspamspam + fn)) * 100

Again, Precision and Recall are part of the standard SpamAssassin statistics data reported in every release. The 'STATISTICS.txt' files distributed with SpamAssassin versions since about 2.30 include this data, measuring the ruleset's accuracy against a validation ruleset:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  29443  99.97%
# Correctly spam:      27220  97.53%
# False positives:         9  0.03%
# False negatives:       688  2.47%
# TCR(l=50): 24.523726  SpamRecall: 97.535%  SpamPrec: 99.967%

See also MeasuringAccuracy for other methods.

PrecisionAndRecall (last edited 2009-09-20 23:16:42 by localhost)