Page History

...

Here's a few methods used to deal with common forms of corpus pollution – messages in a mail corpus that aren't suitable for use in a MassCheck.

What A Corpus Needs To Look Like

SpamAssassin relies on corpus data to generate optimal scores. This is the policy used by all corpora accepted by the SpamAssassin project (moved here from 'masses/CORPUS_POLICY'):

Hand-verified: all mail must be hand-verified into "spam" and "ham" (non-spam) collections, by its recipient. It may not be solely classified using automated spam-classification algorithms such as SpamAssassin and other spam filters; we need the human decision (although it may be aided by SpamAssassin, of course). Also, we can't use data that's been collected from third-party accounts, since we don't know what the recipient may have signed up for.
Reliable source: Ensure that the mails were classified by a trustworthy source; mails marked as spam by users at your ISP, for example, are not reliable enough for use as a SpamAssassin corpus. (It's pretty well-known that some users will use the "report as spam" button instead of unsubscribing from legit newsletters.)
No old spam: please try to avoid including spam older than 6 months, and ham older than 18 months. (We'll filter it out on the server-side anyway.)
Representative mix: each side of the corpus must contain a representative mix of mail of that type. For ham, that includes commercial ham messages, legitimate business discussions, and verified opt-in mail newsletters. A ham corpus consisting of nothing but reported false-positives will produce bad data (especially for Bayes) – and the same applies for spam. If at all possible, try to include as much of the day-to-day spam you receive, including the "easy stuff".
No viruses: SpamAssassin is a spam filter, not a virus scanner – remove viruses from your corpora. (Phishes, however, fall under "spam", even though some virus scanners mark them as viruses.)
No faked bounces: bounces of viruses or spam sent back to forged or faked from addresses, (so-called blowback or joe-job bounces), these typically have an envelope sender of <> or <MAILER-DAEMON.*>, but please include any valid bounces if you can.
No moderation mails: mailing list moderation administrative messages that contain spam subject lines or excerpts.
No spam discussion: anti-spam or anti-virus mailing lists, especially SpamAssassin, that frequently include spam and virus elements, even though they are technically ham, these often appear to be spam and will skew the results, rewriting the tests to avoid triggering on these messages is not realistic at this time.

Cleaning Out False Positives

...

Child pages

Versions Compared

Old Version 9

New Version 10

Key

What A Corpus Needs To Look Like

Cleaning Out False Positives