Differences between revisions 1 and 2
 ⇤ ← Revision 1 as of 2004-12-01 20:24:08 → Size: 1482 Editor: JustinMason Comment: ← Revision 2 as of 2009-09-20 23:16:42 → ⇥ Size: 1486 Editor: localhost Comment: converted to 1.6 markup Deletions are marked like this. Additions are marked like this. Line 5: Line 5: * [http://www.issco.unige.ch/ewg95/node216.html a search-engine course]  * [http://www.tbray.org/ongoing/When/200x/2003/06/22/PandR Tim Bray comments] * [[http://www.issco.unige.ch/ewg95/node216.html|a search-engine course]]  * [[http://www.tbray.org/ongoing/When/200x/2003/06/22/PandR|Tim Bray comments]]

# Precision and Recall

The traditional method of MeasuringAccuracy in the information-retrieval field is using a two-figure scheme of Precision and Recall.

Given the usual set of 4 numbers (see FpFnPercentages):

```  nspam   = number of known-to-be-spam messages in the corpus
nham    = number of known-to-be-ham (nonspam) messages in the corpus
fp      = number of ham messages incorrectly marked as spam
fn      = number of spam messages incorrectly marked as ham```

Precision and Recall can be computed as follows:

```  nspamspam  = nspam - fp

recall     = (nspamspam / nspam) * 100
precision  = ((nspamspam / (nspamspam + fn)) * 100```

Again, Precision and Recall are part of the standard SpamAssassin statistics data reported in every release. The 'STATISTICS.txt' files distributed with SpamAssassin versions since about 2.30 include this data, measuring the ruleset's accuracy against a validation ruleset:

```# SUMMARY for threshold 5.0:
# Correctly non-spam:  29443  99.97%
# Correctly spam:      27220  97.53%
# False positives:         9  0.03%
# False negatives:       688  2.47%
# TCR(l=50): 24.523726  SpamRecall: 97.535%  SpamPrec: 99.967%```