Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by JustinMason] refactor shared stuff into new page

...

Configure; a final summarisation step; first off, a 'FAST FREQS REPORT' is output, the HitFrequencies from the mass-check. Next, the logs from the mass-check are copied to a safe location, and the 'corpus-hourly' script run to generate various reports from them for the RuleQaApp. The URL for viewing the results in the RuleQaApp is printed prominently.

Administrivia: how the corpus is

...

generated

The filesystem layout of the corpora rsynced up to the server, is like this:

No Format

/home/bbmass/rawcor/WHO/TYPE/FOLDER

"WHO" is the person who submitted it via rsync, e.g. "doc", "jm", "zmi".

Under that, we have "TYPE", which is either "ham" or "spam".

Under that, "FOLDER", which is whatever the person feels is appropriate. For example, I use date-stamped dirs here. It is also possible to use mboxes, as long as they are files and their filename ends in ".mbox".

Then, the corpus is created from the UploadedCorpora. The script 'populate_cor' is run from cron periodically to rebuild the mass-checkable corpus from this. It attempts to 'smooth out' the multiple corpora into several new corpora, named "mc-fast", "mc-med", "mc-slow", "mc-slower", matching the buildbot slave names at http://buildbot.spamassassin.org/preflight/ .

...

Each "CORPUSNAME" directory corresponds to one of the slave names, "mc-fast", "mc-med", etc. Under that, we have "TYPE", which is either "ham" or "spam". Next, "LINKNAME". This is a readable filename for the symbolic link, which gives the reader an idea of where the message came from in the source corpora.

Uploading corpora

This is done via rsync.

Give somebody on the PMC a shout, since they have privileges to create an rsync area for you to upload stuff to. (If you're on the PMC, just SSH in and copy over a tarball yourself! or create yourself an rsync account using a random password.)

Once they've done this, they'll send you the username and password; you can then sync your files like so:

No Format

  export RSYNC_PASSWORD=$YOURPASS
  rsync -vr /path/to/your/files \
      rsync://$YOURUSER@rsync.spamassassin.org/mailcorpus_$YOURUSER

(where $YOURPASS, $YOURUSER, $YOU are whatever the PMC guy mailed to you.)

It's important that you have 2 dirs in the /path/to/your/files directory,
ham and spam. Any files ending in .mbox inside those dirs will be treated as UNIX mbox-format files; any other files will be treated as individual messages (one message per file).

Administrivia

Some stuff for PMC people hacking on this...

Admin: Creating a new rsync area for someone to upload corpora

No Format

sudo vi /etc/rsyncd.conf

add something like this to the end, changing "CORPUSUSER" to the username you want to give out:

No Format

[mailcorpus_CORPUSUSER]
        path = /home/bbmass/rawcor/CORPUSUSER
        read only = false
        auth users = CORPUSUSER
        secrets file = /home/corpus-rsync/secrets
No Format

CORPUSUSER="[username you want to give out]"
cd /home/bbmass/rawcor/
mkdir $CORPUSUSER
chmod 1777 $CORPUSUSER

Then create a random password string, and add a line to /home/corpus-rsync/secrets with $CORPUSUSER and that password.

Finally, let the submitter know their new username and passwordSee UploadedCorpora.

Admin: Creating a new buildbot slave to perform mass-checks

...