Differences between revisions 7 and 8
Revision 7 as of 2008-01-30 23:30:50
Size: 10198
Editor: MattKettler
Comment: Added note about SVN for masses in 3.2.0+
Revision 8 as of 2009-09-20 23:16:30
Size: 10195
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
The masses folder was included in the source tarball prior to SA 3.2.0. Since few users actually use these tools, they have been dropped from the distribution. If you have need of them, they can be publicly fetched from [wiki:DownloadFromSvn SVN]. The masses folder was included in the source tarball prior to SA 3.2.0. Since few users actually use these tools, they have been dropped from the distribution. If you have need of them, they can be publicly fetched from [[DownloadFromSvn|SVN]].
Line 7: Line 7:
set0: no bayes, no net [[BR]]
set1: no bayes, net [[BR]]
set2: bayes, no net [[BR]]
set0: no bayes, no net <<BR>>
set1: no bayes, net <<BR>>
set2: bayes, no net <<BR>>
Line 20: Line 20:
usage:[[BR]] usage:<<BR>>
Line 23: Line 23:
cpucount calls:[[BR]] cpucount calls:<<BR>>
Line 32: Line 32:
usage:[[BR]] usage:<<BR>>
Line 38: Line 38:
||--spam=file || spam logfile, default: "spam.log"[[BR]]||
||--ham=file || ham logfile, default: "ham.log"[[BR]]||
||--scoreset=value || scoreset (0-3), default: 0 [[BR]]||
||--fplog=file || false-positives logfile (list of false negatives)[[BR]]||
||--fnlog=file || false-negatives logfile (list of false positives)[[BR]]||
||--spam=file || spam logfile, default: "spam.log"<<BR>>||
||--ham=file || ham logfile, default: "ham.log"<<BR>>||
||--scoreset=value || scoreset (0-3), default: 0 <<BR>>||
||--fplog=file || false-positives logfile (list of false negatives)<<BR>>||
||--fnlog=file || false-negatives logfile (list of false positives)<<BR>>||
Line 45: Line 45:
fp-fn-statistics calls: [[BR]] fp-fn-statistics calls: <<BR>>
Line 53: Line 53:
hit-frequencies calls:[[BR]] hit-frequencies calls:<<BR>>
Line 63: Line 63:
rules are marked as bad:[[BR]] rules are marked as bad:<<BR>>
Line 68: Line 68:
spam-hit rate, rules with score = 0.[[BR]] spam-hit rate, rules with score = 0.<<BR>>
Line 71: Line 71:
usage:[[BR]] usage:<<BR>>
Line 74: Line 74:
||-f falsefreqs || also use a "falsfreqs" file for the analysis that was generated with hit-frequencies and -x -p -f options. [[BR]]||
||-s scoreset || scoreset (0-3). [[BR]]||

lint-rules-from-freqs calls:[[BR]]
||-f falsefreqs || also use a "falsfreqs" file for the analysis that was generated with hit-frequencies and -x -p -f options. <<BR>>||
||-s scoreset || scoreset (0-3). <<BR>>||

lint-rules-from-freqs calls:<<BR>>
Line 87: Line 87:
false-negatives (use --count).[[BR]]

usage:[[BR]]
false-negatives (use --count).<<BR>>

usage:<<BR>>
Line 92: Line 92:
||--cffile=file || path to *.cf files. Defalut: "../rules"[[BR]] ||
||--count || create fp-fn statistic[[BR]]||
||--lambda=value || lambda value, default: 50[[BR]]||
||--threshold=value || mails above the threshold are classified as spam [[BR]]||
||--spam=file || spam logfile, default: "spam.log"[[BR]]||
||--ham=file || ham logfile, default: "ham.log"[[BR]]||
||--scoreset=value || scoreset (0-3), default: 0 [[BR]]||
||--fplog=file ||false-positives logfile (list of false negatives)[[BR]]||
||--fnlog=file ||false-negatives logfile (list of false positives)[[BR]]||

logs-to-c calls :[[BR]]
parse-rules-for-masses[[BR]]

score-ranges-from-freqs[[BR]]
||--cffile=file || path to *.cf files. Defalut: "../rules"<<BR>> ||
||--count || create fp-fn statistic<<BR>>||
||--lambda=value || lambda value, default: 50<<BR>>||
||--threshold=value || mails above the threshold are classified as spam <<BR>>||
||--spam=file || spam logfile, default: "spam.log"<<BR>>||
||--ham=file || ham logfile, default: "ham.log"<<BR>>||
||--scoreset=value || scoreset (0-3), default: 0 <<BR>>||
||--fplog=file ||false-positives logfile (list of false negatives)<<BR>>||
||--fnlog=file ||false-negatives logfile (list of false positives)<<BR>>||

logs-to-c calls :<<BR>>
parse-rules-for-masses<<BR>>

score-ranges-from-freqs<<BR>>
Line 111: Line 111:
mass-check calls: [[BR]] mass-check calls: <<BR>>
Line 122: Line 122:
thresholds. [[BR]] thresholds. <<BR>>
Line 124: Line 124:
usage: [[BR]] usage: <<BR>>
Line 129: Line 129:
mk-baseline-results calls: [[BR]] mk-baseline-results calls: <<BR>>
Line 141: Line 141:
perl scripts using the require command. [[BR]] perl scripts using the require command. <<BR>>
Line 147: Line 147:
saved. [[BR]]

usage: [[BR]]
saved. <<BR>>

usage: <<BR>>
Line 152: Line 152:
||-d ||directory of the rules, default: ../rules [[BR]]||
||-o ||output file, default: ./tmp/rules.pl [[BR]]||
||-s ||scoreset (0-3), default: 0 [[BR]]||

parse-rules-for-masses calls: [[BR]]
||-d ||directory of the rules, default: ../rules <<BR>>||
||-o ||output file, default: ./tmp/rules.pl <<BR>>||
||-s ||scoreset (0-3), default: 0 <<BR>>||

parse-rules-for-masses calls: <<BR>>
Line 166: Line 166:
"tests.data", "tests.h", "rules.pl" [[BR]] "tests.data", "tests.h", "rules.pl" <<BR>>
Line 168: Line 168:
usage: [[BR]]
perceptron [options] [[BR]]


||-p ham_preference || adds extra ham to training set multiplied by number of tests hit (2.0 default) [[BR]]||
||-e num_epochs || number of epochs to train (15 default) [[BR]]||
||-l learning_rate || learning rate for gradient descent (2.0 default) [[BR]]||
||-t threshold || minimum threshold for spam (5.0 default) [[BR]]||
||-w weight_decay || per-epoch decay of learned weight and bias (1.0 default) [[BR]]||
||-h || print help [[BR]]
||

perceptron calls: [[BR]]
usage: <<BR>>
perceptron
[options] <<BR>>

||-p ham_preference || adds extra ham to training set multiplied by number of tests hit (2.0 default) <<BR>>||
||-e num_epochs || number of epochs to train (15 default) <<BR>>||
||-l learning_rate || learning rate for gradient descent (2.0 default) <<BR>>||
||-t threshold || minimum threshold for spam (5.0 default) <<BR>>||
||-w weight_decay || per-epoch decay of learned weight and bias (1.0 default) <<BR>>||
||-h || print help <<BR>>
||

perceptron calls: <<BR>>
Line 191: Line 191:
usage: [[BR]] usage: <<BR>>
Line 194: Line 194:
||scoreset ||desired scoreset to write (0-3) [[BR]]||
||oldscores.cf ||old scores [[BR]]||
||newscores.cf ||new scores [[BR]]||

rewrite-cf-with-new-scores calls: [[BR]]
||scoreset ||desired scoreset to write (0-3) <<BR>>||
||oldscores.cf ||old scores <<BR>>||
||newscores.cf ||new scores <<BR>>||

rewrite-cf-with-new-scores calls: <<BR>>
Line 207: Line 207:
examples. [[BR]] examples. <<BR>>
Line 209: Line 209:
parameters:[[BR]] parameters:<<BR>>
Line 211: Line 211:
||SCORESET=value || number of the scoreset (0-3)[[BR]] ||
||HAM_PREFERENCE=value || ham preference for the perceptron[[BR]] ||
||THRESHOLD=value ||minimum threshold for spam[[BR]] ||
||EPOCHS=value || number of epochs to train the perceptron[[BR]] ||
||SCORESET=value || number of the scoreset (0-3)<<BR>> ||
||HAM_PREFERENCE=value || ham preference for the perceptron<<BR>> ||
||THRESHOLD=value ||minimum threshold for spam<<BR>> ||
||EPOCHS=value || number of epochs to train the perceptron<<BR>> ||
Line 227: Line 227:
statistic ("test").[[BR]] statistic ("test").<<BR>>
Line 231: Line 231:
all).[[BR]] all).<<BR>>
Line 235: Line 235:
usage:[[BR]] usage:<<BR>>
Line 238: Line 238:
runGA calls:[[BR]]
fp-fn-statistics[[BR]]

lint-rules-from-freqs [[BR]]
logs-to-c [[BR]]
mk-baseline-results [[BR]]
numcpus [[BR]]
parse-rules-for-masses[[BR]]
perceptron [[BR]]

rewrite-cf-with-new-scores[[BR]]
score-ranges-from-freqs[[BR]]
tenpass/split-log-into-buckets-random [[BR]]
runGA calls:<<BR>>
fp-fn-statistics<<BR>>

lint-rules-from-freqs <<BR>>
logs-to-c <<BR>>
mk-baseline-results <<BR>>
numcpus <<BR>>
parse-rules-for-masses<<BR>>
perceptron <<BR>>

rewrite-cf-with-new-scores<<BR>>
score-ranges-from-freqs<<BR>>
tenpass/split-log-into-buckets-random <<BR>>
Line 260: Line 260:
within these ranges.[[BR]]

usage:[[BR]]
within these ranges.<<BR>>

usage:<<BR>>
Line 265: Line 265:
||cffiledir ||directory of the rules, default: "../rules"[[BR]] ||
||scoreset ||desired scoreset(0-3) [[BR]] ||

score-ranges-from-freqs calls: [[BR]]
parse-rules-for-masses[[BR]]
||cffiledir ||directory of the rules, default: "../rules"<<BR>> ||
||scoreset ||desired scoreset(0-3) <<BR>> ||

score-ranges-from-freqs calls: <<BR>>
parse-rules-for-masses<<BR>>
Line 278: Line 278:
Creates n files named "split-n.log"[[BR]]

usage: [[BR]]
split-log-into-buckets-random [n] < LOGFILE[[BR]]

||n ||number of buckets, default: 10 [[BR]] ||
Creates n files named "split-n.log"<<BR>>

usage: <<BR>>
split-log-into-buckets-random [n] < LOGFILE<<BR>>

||n ||number of buckets, default: 10 <<BR>> ||
Line 285: Line 285:
split-log-into-buckets-random calls:[[BR]]
no other scripts
split-log-into-buckets-random calls:<<BR>>
no other scripts

Tools in the SpamAssassin masses folder

The masses folder was included in the source tarball prior to SA 3.2.0. Since few users actually use these tools, they have been dropped from the distribution. If you have need of them, they can be publicly fetched from SVN.

This is an overview of the scripts in the SpamAssassin masses folder. In brief these scripts are used to mass check hand classified corpora and to calculate new scores with the percpetron approach using the results of a mass check. It's necessary to calculate 4 different scoresets for the rules, depending on whether the bayes or the net option is used:

set0: no bayes, no net
set1: no bayes, net
set2: bayes, no net
set3: bayes, net

A scoreset is one of the 4 columns in a score file like "../rules/50_scores.cf"

cpucount

This script counts the number of CPU in your system

usage:
cpucount

cpucount calls:
no other scripts

fp-fn-statistics

Tests a scoreset and *.log files for false-positives and false-negatives and returns a statistic.

usage:
fp-fn-statistics [options]

--cffile=file

path to *.cf files. Defalut: "../rules"

--lambda=value

lambda value, default: 50

--threshold=value

mails above the threshold are classified as spam

--spam=file

spam logfile, default: "spam.log"

--ham=file

ham logfile, default: "ham.log"

--scoreset=value

scoreset (0-3), default: 0

--fplog=file

false-positives logfile (list of false negatives)

--fnlog=file

false-negatives logfile (list of false positives)

fp-fn-statistics calls:
logs-to-c with --count option

hit-frequencies

see HitFrequencies.

hit-frequencies calls:
parse-rules-for-masses

lint-rules-from-freqs

This script analyzes the rules for usability. It therefore uses a freqs file generated by hit-frequencies (with -x -p options). It also uses a scoreset. The bad rules are returned. Following rules are marked as bad:
Rules that rarely hit (below 0.03%) or don't hit at all, rules with a negative score that have a higher spam-hit rate than ham-hit rate, rules with a positive score that have a higher ham-hit rate than spam-hit rate, rules with score = 0.

usage:
lint-rules-from-freqs [-f falsefreqs] [-s scoreset] < freqs > badtests

-f falsefreqs

also use a "falsfreqs" file for the analysis that was generated with hit-frequencies and -x -p -f options.

-s scoreset

scoreset (0-3).

lint-rules-from-freqs calls:
no other scripts

logs-to-c

Generates different files in the /tmp folder: "ranges.data", "scores.data", "scores.h", "tests.data", "tests.h". Those files are later used by the perceptron script. This script is also used to test scoresets and *.log files for false-positives and false-negatives (use --count).

usage:
logs-to-c [options]

--cffile=file

path to *.cf files. Defalut: "../rules"

--count

create fp-fn statistic

--lambda=value

lambda value, default: 50

--threshold=value

mails above the threshold are classified as spam

--spam=file

spam logfile, default: "spam.log"

--ham=file

ham logfile, default: "ham.log"

--scoreset=value

scoreset (0-3), default: 0

--fplog=file

false-positives logfile (list of false negatives)

--fnlog=file

false-negatives logfile (list of false positives)

logs-to-c calls :
parse-rules-for-masses
score-ranges-from-freqs

mass-check

see MassCheck.

mass-check calls:
no other scripts in the masses folder

mk-baseline-results

Shell script that tests a scoreset and the files "ham-test.log" and "spam-test.log" for false-positives and false-negatives with various thresholds ranging from -4 up to 20. Returns a statistic for all thresholds.

usage:
mk-baseline-results scoreset

scoreset

desired scoreset (0-3)

mk-baseline-results calls:
logs-to-c

parse-rules-for-masses

Parses the rules in all *.cf files that begin with a digit and that are located in the "../rules" folder.It generates a file called "/tmp/rules.pl" that contains a dump of two hashes (perl datatype) called %rules and %scores that can be directly included by other perl scripts using the require command.
The %rules hash consists of a set of data for every rule. In those sets, the score of the rule, a description, the type, whether the rule is mutable and whether it is a subrule are saved. In the %scores hash one score for every rule is saved.

usage:
parse-rules-for-masses [-d rulesdir] [-o outputfile] [-s scoreset]

-d

directory of the rules, default: ../rules

-o

output file, default: ./tmp/rules.pl

-s

scoreset (0-3), default: 0

parse-rules-for-masses calls:
no other scripts

perceptron

Calculates new scores with the perceptron approach and generates a perceptron.scores file. Needs following files in the /tmp folder: "ranges.data", "scores.data", "scores.h", "tests.data", "tests.h", "rules.pl"

usage:
perceptron [options]

-p ham_preference

adds extra ham to training set multiplied by number of tests hit (2.0 default)

-e num_epochs

number of epochs to train (15 default)

-l learning_rate

learning rate for gradient descent (2.0 default)

-t threshold

minimum threshold for spam (5.0 default)

-w weight_decay

per-epoch decay of learned weight and bias (1.0 default)

-h

print help

perceptron calls:
no other scripts

rewrite-cf-with-new-scores

Rewrites a cf file with new scores. Only the area with the generated scores is changed. The argument scoreset is the number of the scoreset (0-3) that is rewritten. The new cf-file is returned on the standard output.

usage:
rewrite-cf-with-new-scores [scoreset] [oldscores.cf] [newsocres.cf]

scoreset

desired scoreset to write (0-3)

oldscores.cf

old scores

newscores.cf

new scores

rewrite-cf-with-new-scores calls:
no other scripts

runGA

Shell script that compiles and runs the perceptron script. New scores are calculated with the perceptron approach and random 9/10 of the examples in the "*.log" files. Then the scores are tested for false-positives and false-negatives with the last 1/10 of the examples.
Needs a "config" file in the "./" folder that contains some parameters:

SCORESET=value

number of the scoreset (0-3)

HAM_PREFERENCE=value

ham preference for the perceptron

THRESHOLD=value

minimum threshold for spam

EPOCHS=value

number of epochs to train the perceptron

Corresponding "*.log" files to the chosen scoreset X (named "ham-setX.log" and "spam-setX.log") are required in the "/ORIG" folder. The script generates several files in the "/tmp" folder by calling logs-to-c, and a new folder named by the options ("gen*") in the config file. This folder contains a "scores" file with the generated scores and corresponding ranges, the "*.log" files that were used for the score generation and for the testing (in "/NSBASE" and "/SPBASE" folders), lists of false-negatives and false-positives that were found in the test, a logfile that contains the used parameters for the score generation, the output of the makefile ("make.output") and a false-positives vs. false-negatives statistic ("test").
The runGA script also generates a "badrules" file by calling lint-rules-from-freqs, that contains rules that are not useful for different reasons (most of them hitting too rarely or not at all).
Note that the generated scores may vary somewhat if runGA is run twice, due to the random selection of the training examples.

usage:
runGA (parameters are saved in a "config" file)

runGA calls:
fp-fn-statistics
lint-rules-from-freqs
logs-to-c
mk-baseline-results
numcpus
parse-rules-for-masses
perceptron
rewrite-cf-with-new-scores
score-ranges-from-freqs
tenpass/split-log-into-buckets-random

score-ranges-form-freqs

Calculates a score-range for the rules. The magnitude of the range depends on the ranking (generated by hit-frequencies) of a rule. Immutable rules get fixed ranges at their scores. The ranges are later used by the perceptron script that tries to find the optimal scores within these ranges.

usage:
score-ranges-from-freqs [cffiledir] [scoreset] < freqs

cffiledir

directory of the rules, default: "../rules"

scoreset

desired scoreset(0-3)

score-ranges-from-freqs calls:
parse-rules-for-masses

split-log-into-buckets-random

Split a mass-check log into n identically- sized buckets, evenly taking messages from all checked corpora and preserving comments. Creates n files named "split-n.log"

usage:
split-log-into-buckets-random [n] < LOGFILE

n

number of buckets, default: 10

split-log-into-buckets-random calls:
no other scripts

MassesOverview (last edited 2009-09-20 23:16:30 by localhost)