Tools in the SpamAssassin /masses folder
This is an overview of the scripts in the SpamAssassin /masses folder. In brief these scripts are used to mass check hand classified corpora and to calculate new scores with the percpetron approach using the results of a mass check. It's necessary to calculate 4 different scoresets for the rules, depending on whether the bayes or the net option is used:
set0: no bayes, no net BR set1: no bayes, net BR set2: bayes, no net BR set3: bayes, net
A scoreset is one of the 4 columns in a score file like "../rules/50_scores.cf"
cpucount
This script counts the number of CPU in your system
usage:BR cpucount
cpucount calls:BR no other scripts
fp-fn-statistics
Tests a scoreset and *.log files for false-positives and false-negatives and returns a statistic.
usage:[BR] fp-fn-statistics [options]
--cffile=file |
path to *.cf files. Defalut: "../rules" |
--lambda=value |
lambda value, default: 50 |
--threshold=value |
mails above the threshold are classified as spam |
--spam=file |
spam logfile, default: "spam.log"BR |
--ham=file |
ham logfile, default: "ham.log"BR |
--scoreset=value |
scoreset (0-3), default: 0 BR |
--fplog=file |
false-positives logfile (list of false negatives)BR |
--fnlog=file |
false-negatives logfile (list of false positives)BR |
fp-fn-statistics calls: BR logs-to-c with --count option
hit-frequencies
This script counts the occurences of rules in *.log files and calculates some details of their accuracy in hitting spam vs. ham. The rules are taken out of the *.cf files (out of all that begin with a digit) in the "../rules" folder.BR A statistic with the following columns is returned:
OVERALL% |
the percentage of mail overall that the test hits BR |
SPAM% |
the percentage of spam mails hit by the rule BR |
HAM% |
the percentage of ham mails hit by the rule BR |
S/O |
"spam over overall" – the probability that, when the rule fires, it hits on a spam message |
RANK |
A ranking that indicates how "good" the rule is.BR |
IG |
Information gain of the rule, normalized to a value between 1 and 0. Intuitively this shows how much knowing the rule helps to guess the correct classification of a e-mail.BR |
SCORE |
the score listed in "../rules/50_scores.cf" for that rule BR |
NAME |
the rule's nameBR |
usage:[BR] hit-frequencies [-c rules dir] [-f] [-m RE] [-M RE] [-X RE] [-l LC] [-s SC] [-a] [-p] [-x] [-i] [spam log] [ham log][BR]
-c p |
use p as the rules directory, default: "../rules"BR |
-f |
falses. count only false-negative or false-positive matchesBR |
-m RE |
print rules matching regular expressionBR |
-t RE |
print rules with tflags matching regular expressionBR |
-M RE |
only consider log entries matching regular expressionBR |
-X RE |
don't consider log entries matching regular expressionBR |
-l LC |
also print language specific rules for lang code LC (or 'all')BR |
-L LC |
only print language specific rules for lang code LC (or 'all')BR |
-a |
display all testsBR |
-p |
percentages. implies -xBR |
-x |
extended output, with S/O ratio and scoresBR |
-s SC |
which scoreset to useBR |
-i |
use IG (information gain) for rankingBR |
options -l and -L are mutually exclusive.BR
options -M and -X are *not* mutually exclusive.BR
if either the spam or and ham logs are unspecified, the defaults are "spam.log" and "ham.log" in the current working directory.BR
hit-frequencies calls:BR parse-rules-for-masses
lint-rules-from-freqs
This script analyzes the rules for usability. It therefore uses a freqs file generated by hit-frequencies (with -x -p options). It also uses a scoreset. The bad rules are returned. Following rules are marked as bad:BR Rules that rarely hit (below 0.03%) or don't hit at all, rules with a negative score that have a higher spam-hit rate than ham-hit rate, rules with a positive score that have a higher ham-hit rate than spam-hit rate, rules with score = 0.BR
usage:[BR] lint-rules-from-freqs [-f falsefreqs] [-s scoreset] < freqs > badtests
-f falsefreqs |
also use a "falsfreqs" file for the analysis that was generated with hit-frequencies and -x -p -f options. BR |
-s scoreset |
scoreset (0-3). BR |
lint-rules-from-freqs calls:BR no other scripts
logs-to-c
Generates different files in the /tmp folder: "ranges.data", "scores.data", "scores.h", "tests.data", "tests.h". Those files are later used by the perceptron script. This script is also used to test scoresets and *.log files for false-positives and false-negatives (use --count).BR
usage:[BR] logs-to-c [options]
--cffile=file |
path to *.cf files. Defalut: "../rules"BR |
--count |
create fp-fn statisticBR |
--lambda=value |
lambda value, default: 50BR |
--threshold=value |
mails above the threshold are classified as spam BR |
--spam=file |
spam logfile, default: "spam.log"BR |
--ham=file |
ham logfile, default: "ham.log"BR |
--scoreset=value |
scoreset (0-3), default: 0 BR |
--fplog=file |
false-positives logfile (list of false negatives)BR |
--fnlog=file |
false-negatives logfile (list of false positives)BR |
logs-to-c calls :BR parse-rules-for-massesBR score-ranges-from-freqsBR
mass-check
Checks a hand classified corpora and will create two files, "ham.log" and "spam.log" containing a code and the hitting rules for every tested e-Mail. BR
usage:[BR] mass-check [options] target ...
-c=file |
set configuration/rules directoryBR |
-p=dir |
set user-prefs directoryBR |
-f=file |
read list of targets from <file>BR |
-j=jobs |
specify the number of processes to run simultaneouslyBR |
--net |
turn on network checks!BR |
--mid |
report Message-ID from each messageBR |
--debug |
report debugging informationBR |
--progress |
show progress updates during checkBR |
--rewrite=OUT |
save rewritten message to OUT (default is /tmp/out)BR |
--showdots |
print a dot for each scanned messageBR |
--rules=RE |
Only test rules matching the given regexp REBR |
--restart=N |
restart all of the children after processing N messagesBR |
--deencap=RE |
Extract SpamAssassin-encapsulated spam mails only if they were encapsulated by servers matching the regexp RE (default = extract all SpamAssassin-encapsulated mails) |
log optionsBR
-o |
write all logs to stdoutBR |
--loghits |
log the text hit for patterns (useful for debugging)BR |
--loguris |
log the URIs foundBR |
--hamlog=log |
use <log> as ham log ('ham.log' is default)BR |
--spamlog=log |
use <log> as spam log ('spam.log' is default)BR |
message selection optionsBR
-n |
no date sorting or spam/ham interleavingBR |
--after=N |
only test mails received after time_t N (negative values are an offset from current time, e.g. -86400 = last day) or after date as parsed by Time::Parsedate (e.g. '-6 months') BR |
--before=N |
same as --after, except received times are before time_t N BR |
--all |
don't skip big messages BR |
--head=N |
only check first N ham and N spam (N messages if -n used) BR |
--tail=N |
only check last N ham and N spam (N messages if -n used) BR |
simple target options (implies -o and no ham/spam classification) BR
--dir |
subsequent targets are directories BR |
--file |
subsequent targets are files in RFC 822 format BR |
--mbox |
subsequent targets are mbox files BR |
--mbx |
subsequent targets are mbx files BR |
Just left over functions we should remove at some point: BR
--bayes |
report score from Bayesian classifier BR |
non-option arguments are used as target names (mail files and folders), the target format is: <class>:<format>:<location> BR
class |
is "spam" or "ham" BR |
format |
is "dir", "file", "mbx", or "mbox" BR |
location |
is a file or directory name. Globbing of ~ and * is supported. BR |
mass-check calls: BR no other scripts in the masses folder
mk-baseline-results
Shell script that tests a scoreset and the files "ham-test.log" and "spam-test.log" for false-positives and false-negatives with various thresholds ranging from -4 up to 20. Returns a statistic for all thresholds. BR
usage: BR mk-baseline-results scoreset
scoreset |
desired scoreset (0-3) |
mk-baseline-results calls: BR logs-to-c
parse-rules-for-masses
Parses the rules in all *.cf files that begin with a digit and that are located in the "../rules" folder.It generates a file called "/tmp/rules.pl"
that contains a dump of a two hashes called %rules and %scores that can be directly included by other perl scripts using the rerquire command. BR The %rules hash consists of a set of data for every rule. In those sets, the score of the rule, a description, the type, whether the rule is mutable and whether it is a subrule are saved. In the %scores hash one score for every rule is saved. BR
usage: [BR] parse-rules-for-masses [-d rulesdir] [-o outputfile] [-s scoreset]
-d |
directory of the rules, default: ../rules BR |
-o |
output file, default: ./tmp/rules.pl BR |
-s |
scoreset (0-3), default: 0 BR |
parse-rules-for-masses calls: BR no other scripts
perceptron
Calculates new scores with the perceptron approach and generates a perceptron.scores file. Needs following files in the /tmp folder: "ranges.data", "scores.data", "scores.h", "tests.data", "tests.h", "rules.pl" BR
usage: [BR] perceptron [options] [BR]
-p ham_preference |
adds extra ham to training set multiplied by number of tests hit (2.0 default) BR |
-e num_epochs |
number of epochs to train (15 default) BR |
-l learning_rate |
learning rate for gradient descent (2.0 default) BR |
-t threshold |
minimum threshold for spam (5.0 default) BR |
-w weight_decay |
per-epoch decay of learned weight and bias (1.0 default) BR |
-h |
print help BR |
perceptron calls: BR no other scripts
rewrite-cf-with-new-scores
Rewrites a cf file with new scores. Only the area with the generated scores is changed. The argument scoreset is the number of the scoreset (0-3) that is rewritten. The new cf-file is returned on the standard output.
usage: [BR] rewrite-cf-with-new-scores [scoreset] [oldscores.cf] [newsocres.cf]
rewrite-cf-with-new-scores calls: BR no other scripts
runGA
Shell script that compiles and runs the perceptron script. New scores are calculated with the perceptron approach and random 9/10 of the examples in the "*.log" files. Then the scores are tested for false-positives and false-negatives with the last 1/10 of the examples. BR Needs a "config" file in the "./" folder that contains some parameters:BR
SCORESET=value |
number of the scoreset (0-3)BR |
HAM_PREFERENCE=value |
ham preference for the perceptronBR |
THRESHOLD=value |
minimum threshold for spamBR |
EPOCHS=value |
number of epochs to train the perceptronBR |
Corresponding "*.log" files to the chosen scoreset X (named "ham-setX.log" and "spam-setX.log") are required in the "/ORIG" folder. The script generates several files in the "/tmp" folder by calling logs-to-c, and a new folder named by the options ("gen*") in the config file. This folder contains a "scores" file with the generated scores and corresponding ranges, the "*.log" files that were used for the score generation and for the testing (in "/NSBASE" and "/SPBASE" folders), lists of false-negatives and false-positives that were found in the test, a logfile that contains the used parameters for the score generation, the output of the makefile ("make.output") and a false-positives vs. false-negatives statistic ("test").BR
The runGA script also generates a "badrules" file by calling lint-rules-from-freqs, that contains rules that are not useful for different reasons (most of them hitting too rarely or not at all).BR Note that the generated scores may vary somewhat if runGA is run twice, due to the random selection of the training examples.
usage:BR runGA (parameters are saved in a "config" file)
runGA calls:BR fp-fn-statisticsBR lint-rules-from-freqs BR logs-to-c BR mk-baseline-results BR numcpus BR parse-rules-for-massesBR perceptron BR rewrite-cf-with-new-scoresBR score-ranges-from-freqsBR tenpass/split-log-into-buckets-random BR
score-ranges-form-freqs
Calculates a score-range for the rules. The magnitude of the range depends on the ranking (generated by hit-frequencies) of a rule. Immutable rules get fixed ranges at their scores. The ranges are later used by the perceptron script that tries to find the optimal scores within these ranges.BR
usage:[BR]
score-ranges-from-freqs [cffiledir] [scoreset] < freqs
score-ranges-from-freqs calls: BR parse-rules-for-massesBR
split-log-into-buckets-random
Split a mass-check log into n identically-
sized buckets, evenly taking messages from all checked corpora and preserving comments. Creates n files named "split-n.log"BR
usage: [BR] split-log-into-buckets-random [n] < LOGFILE[BR]
n |
number of buckets, default: 10 BR |
split-log-into-buckets-random calls:BR no other scripts