You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Here's some advanced information on what happens behind the scenes with body/rawbody rules.

Body rules: body RULENAME /foo/

Let's assume we have this message content, after any normal decoding of base64/quoted-printable and HTML to text rendering:

Subject: This is subject clause.\n
\n
First clause of body.     Second clause in first paragraph.\n
Third clause in first paragraph.\n
\n
First clause of second paragraph. Etc.\n

(Line breaks / newlines are shown as \n for extra clarity)

Body rules are processed in paragraphs (blocks of text separated by atleast two newlines), normalized into single lines.

  • All paragraphs are whitespace normalized, any single or consecutive whitespace (newline/space/tab/etc) is turned into a single space.
  • Newline that ends a paragraph will be preserved, if it exists (message is not required to end in a newline).
  • Space at beginning or end of a normalized paragraph is not removed, i.e. using pure anchoring /^foo/ might not match " foo".
    • Common practice is to use word boundary matching (/\bfoo/), unless there's a reason to anchor to the start of a paragraph(!).
  • Note that if the resulting normalized line is larger than 2048 bytes, it will be split into smaller individual lines, from nearest space boundary if possible.

These three strings (paragraphs turned into lines) will be individually tested against the body pattern /foo/, until a match is found.

1)

This is subject clause.\n

2)

First clause of body. Second clause in first paragraph. Third clause in first paragraph.\n

3)

First clause of second paragraph. Etc.\n

Since the matching is separate, pattern like /first paragraph.*Etc/s won't work. You would need to split it into separate subrules:

  • body __FOO1 /first paragraph/
  • body __FOO2 /Etc/
  • meta FOO (__FOO1 && __FOO2)

Note that Subject header is considered a part of the body to simplify rule writing, it will always be the first string to be tested.

  • Subject string will always contain an ending newline (\n).
  • Even if Subject header is missing or empty, the first string will be a single newline (\n).
  • Starting from SpamAssassin 3.4.3, it's possible to use tflags nosubject to skip matching the Subject completely.

When using tflags multiple, process is exactly the same, all lines are searched for all matches (until maxhits=x, if defined):

  • Pattern /clause/ would result in 5 rule hits.
  • Pattern /^./ would result in 3 rule hits.

Rawbody rules: rawbody RULENAME /foo/

Rawbody rules are processed similarly as body, but these are the main differences:

  • Instead of paragraphs/lines, text is split into 1-4kB chunks (depending on SA version and whitespace boundaries found).
  • All textual mime parts of the message are used, HTML is not rendered.
  • No normalization of text is done, aside from the normal base64/quoted-printable decoding.

It's important to remember that chunks of text are again individually tested against a pattern.

  • When using anchoring (/^foo/), it will only match the start of a chunk.
    • I.e. it's not possible to match a beginning of part 100% accurately, if it's larger than 1-4kB.


  • No labels