Like a Bayesian learning system, SpamAssassin's GeneticAlgorithm requires a corpus of hand-classified mail. Our guidelines are (quoting and expanding on "masses/CORPUS_POLICY"):

Once you run MassCheck, see the instructions in CorpusCleaning for details of how to verify that the top scorers are not accidental spam that got through.

(Aside: yes, it's "corpora". See PluralOfCorpus)

Mail to NOT Include in Ham or Spam

Minor things that are nice to have