Preflight mass-checks buildbot
The preflight mass-check buildbot is up and running at \[http://buildbot.spamassassin.org:8011/ this URL\] and also proxied at http://buildbot.spamassassin.org/preflight/ (although this has buffering issues due to a bug in httpd mod_proxy). Wiki Markup
Every time something is checked into SVN, this will wake up and immediately start running mass-checks using that latest code and rules.
The corpus it mass-checks is split in a certain way so that results will be available very quickly – typically in under 10 minutes – with increasing quantities of results becoming available as time elapses.unmigrated-wiki-markup
Progress of the mass-checks are visible on \[http://buildbot.spamassassin.org:8011/ the Buildbot 'waterfall'\]; as they complete, their results become visible on the [RuleQaApp].
The preflight mass-check corpus
The idea is that we have this corpus split into multiple differently-sized chunksThis corpus is built from a selection of mail rsync'd up from various people; it's then "smoothed out" into several subsets. These use differing amounts of mail, starting with a small set of mail in the "mc-fast" chunk, and gradually increasing until we get to the largest block in "mc-slower". This division means that early "fast" results can arrive quickly, with less to scan, and as time goes on, more and more of the "slower" slaves complete their mass-checks and upload the results.
The filesystem layout is like this:
No Format |
---|
/home/bbmass/tmpfs/cor/CORPUSNAME/TYPE/WHO
|
Each "CORPUSNAME" directory corresponds to one of the 'slaves' listed on http://buildbot.spamassassin.org:8011/ , "mc-fast", "mc-med", "mc-slow", "mc-slower".
Under that, we have "TYPE", which is either "ham" or "spam".
Next, "WHO". This is the username of the person whose corpus it is!
And under that, is another level of directories, whatever the person feels is appropriate. For example, I use date-stamped dirs here.
The result is e.g.:
No Format |
---|
/home/bbmass/tmpfs/cor/mc-fast/ham/jm/20051018a
/home/bbmass/tmpfs/cor/mc-fast/spam/jm/20051018a
/home/bbmass/tmpfs/cor/mc-fast/spam/jm/20051018b
|
How mass-check discovers this – at the selection level, every "CORPUSNAME" dir has a 'targets' file, something like the following in
/home/bbmass/tmpfs/cor/mc-fast/targets:
No Format |
---|
ham:dir:/home/bbmass/tmpfs/cor/mc-fast/ham/jm/*
spam:dir:/home/bbmass/tmpfs/cor/mc-fast/spam/jm/*
ham:dir:/home/bbmass/tmpfs/cor/mc-fast/ham/username/*
spam:dir:/home/bbmass/tmpfs/cor/mc-fast/spam/username/*
|
ie. a file listing all the targets to mass-check.
Uploading corpora
In terms of getting corpora in there – this is done via rsync. Give somebody on the PMC a shout, since they have privileges to create an rsync area for you to upload stuff to. (If you're on the PMC, just SSH in and copy over a tarball yourself!)
Once they've done this, they'll send you the username and password; you can then sync your files like so:
No Format |
---|
export RSYNC_PASSWORD=$YOURPASS
rsync -vr /path/to/your/files \
rsync://$YOURUSER@rsync.spamassassin.org/mailcorpus_$YOURUSER
|
(where $YOURPASS, $YOURUSER, $YOU are whatever the PMC guy mailed to you.)
It's important that you have 2 dirs in the /path/to/your/files
directory,
ham
and spam
. Any files ending in .mbox
inside those dirs will be treated as UNIX mbox-format files; any other files will be treated as individual messages (one message per file).
Administrivia
Some stuff for PMC people hacking on this...
Admin: Creating a new rsync area for someone to upload corpora
No Format |
---|
CORPUSUSER="[username you want to give out]"
sudo vi /etc/rsyncd.conf
cd /home/bbmass/rawcor/
mkdir $CORPUSUSER
chmod 1777 $CORPUSUSER
|
Then create a random password string, and add a line to /home/corpus-rsync/secrets
with $CORPUSUSER and that password.
Finally, let the submitter know their new username and password.
"smoothing" and subset selection happens in mass-check nowadays.
What happens during the preflight buildbot process
As you can see, there are four steps performed by each buildbot slave, as follows:
Update: This performs an 'svn update' to load the latest code.
Configure: runs 'perl Makefile.PL' and 'make' to compile the rules.
Test: the mass-check takes place here. This is usually the time-consuming part.
Configure; a final summarisation step; first off, a 'FAST FREQS REPORT' is output, the HitFrequencies from the mass-check. Next, the logs from the mass-check are copied to a safe location, and the 'corpus-hourly' script run to generate various reports from them for the RuleQaApp. The URL for viewing the results in the RuleQaApp is printed prominently.
Uploading corpora
See UploadedCorpora.
Admin: Creating a new buildbot slave to perform mass-checks
...
No Format |
---|
PASSWORD=[randompassword] NAME=mc-new sudo mkdir -p /home/bbmass/slaves/$NAME sudo chown bbmassbuildbot /home/bbmass/slaves/$NAME cd /home/bbmass/slaves/$NAME sudo su bbmassbuildbot -c \ "buildbot create-slave --usepty=0 \ "mktap buildbot slave --basedir /home/bbmass/slaves/$NAME \ --master buildbot.spamassassin.org:9988 --name $NAME \ --passwd $PASSWORD --usepty=0" echo $PASSWORD > $HOME/pwd sudo mv $HOME/pwd /home/buildbot/pwds/$NAME sudo chown buildbot /home/buildbot/pwds/$NAME sudo chmod 600 /home/buildbot/pwds/$NAME sudo vi /home/buildbot/bots/bbmass/master.cfg [search for mc-fast and add new lines/entries for $NAME] [don't forget the 'scheduler' part!] sudo vi /etc/init.d/buildbotbbmass [search for mc-fast and add new lines/entries for $NAME] |
(history: this was planned at RulesProjBuildBot)