Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by JustinMason] update with a good bit more documentation, and correct some out-of-date stuff

Preflight mass-checks buildbot

Wiki MarkupThe preflight mass-check buildbot is up and running at \[http://buildbot.spamassassin.org:8011/ this URL\]/preflight/ .

Every time something is checked into SVN, this will wake up and immediately start running mass-checks using that latest code and rules.

...

Wiki Markup
Progress of the mass-checks are visible on \[http://buildbot.spamassassin.org:8011/preflight/ the Buildbot 'waterfall'\]; as they complete, their results become visible on the [RuleQaApp].

The preflight mass-check corpus

The idea is that we have this corpus split into multiple differently-sized chunksThis corpus is built from a selection of mail rsync'd up from various people; it's then "smoothed out" into several subsets. These use differing amounts of mail, starting with a small set of mail in the "mc-fast" chunk, and gradually increasing until we get to the largest block in "mc-slower". This division means that early "fast" results can arrive quickly, with less to scan, and as time goes on, more and more of the "slower" slaves complete their mass-checks and upload the results.

What happens during the preflight buildbot process

Wiki Markup
\[http://buildbot.spamassassin.org/preflight/ As you can see\], there are four steps performed by each buildbot slave, as follows:

Update: This performs an 'svn update' to load the latest code.

Configure: runs 'perl Makefile.PL' and 'make' to compile the rules.

Test: the mass-check takes place here. This is usually the time-consuming part.

Configure; a final summarisation step; first off, a 'FAST FREQS REPORT' is output, the HitFrequencies from the mass-check. Next, the logs from the mass-check are copied to a safe location, and the 'corpus-hourly' script run to generate various reports from them for the RuleQaApp. The URL for viewing the results in the RuleQaApp is printed prominently.

Administrivia: how the corpus is laid out

The filesystem layout of the corpora rsynced up to the server, is like this:

No Format
/home/bbmass/tmpfsrawcor/cor/CORPUSNAMEWHO/TYPE/WHOFOLDER

Each "CORPUSNAME" directory corresponds to one of the 'slaves' listed on http://buildbot.spamassassin.org:8011/ , "mc-fast", "mc-med", "mc-slow", "mc-slower"WHO" is the person who submitted it via rsync, e.g. "doc", "jm", "zmi".

Under that, we have "TYPE", which is either "ham" or "spam".

NextUnder that, "WHO". This is the username of the person whose corpus it is!And under that, is another level of directories, FOLDER", which is whatever the person feels is appropriate. For example, I use date-stamped dirs here.

The result is e.g.:

No Format

/home/bbmass/tmpfs/cor/mc-fast/ham/jm/20051018a
/home/bbmass/tmpfs/cor/mc-fast/spam/jm/20051018a
/home/bbmass/tmpfs/cor/mc-fast/spam/jm/20051018b

How mass-check discovers this – at the selection level, every "CORPUSNAME" dir has a 'targets' file, something like the following in
/home/bbmass/tmpfs/cor/mc-fast/targets:

It is also possible to use mboxes, as long as they are files and their filename ends in ".mbox".

Then, the script 'populate_cor' is run from cron periodically to rebuild the mass-checkable corpus from this. It attempts to 'smooth out' the multiple corpora into several new corpora, named "mc-fast", "mc-med", "mc-slow", "mc-slower", matching the buildbot slave names at http://buildbot.spamassassin.org/preflight/ .

It does this by:

  • extracting mboxes into mail directories of one file per message
  • creating symbolic links to those files in new corpus directories
  • for each new corpus dir, creating a 'targets' file for mass-check listing what files it's created for that corpus.

Wiki Markup
It attempts to use one person's corpus per each output corpus, but seeing as there's usually a glut of spam and a limited quantity of ham, it's not always anywhere near a one-to-one correlation.  All the same, by looking at \[http://buildbot.spamassassin.org/bbmass/corpus_makeup.txt the logs from the build process\], you can see where the correlations lie.

The output looks like this on-disk:

No Format

No Format

ham:dir:/home/bbmass/tmpfs/cor/mc-fastCORPUSNAME/hamTYPE/jm/*
spam:dir:/home/bbmass/tmpfs/cor/mc-fast/spam/jm/*
ham:dir:/home/bbmass/tmpfs/cor/mc-fast/ham/username/*
spam:dir:/home/bbmass/tmpfs/cor/mc-fast/spam/username/*

ie. a file listing all the targets to mass-check.

Uploading corpora

LINKNAME

Each "CORPUSNAME" directory corresponds to one of the slave names, "mc-fast", "mc-med", etc. Under that, we have "TYPE", which is either "ham" or "spam". Next, "LINKNAME". This is a readable filename for the symbolic link, which gives the reader an idea of where the message came from in the source corpora.

Uploading corpora

This is done via rsync.

In terms of getting corpora in there – this is done via rsync. Give somebody on the PMC a shout, since they have privileges to create an rsync area for you to upload stuff to. (If you're on the PMC, just SSH in and copy over a tarball yourself! or create yourself an rsync account using a random password.)

Once they've done this, they'll send you the username and password; you can then sync your files like so:

...