Rspamd regexp module
This is a core module that deals with regular expressions, internal functions and Lua code to filter messages.
Principles of work
Regexpモジュールは、atoms の論理シーケンスで構成される式を使用して動作します。 atoms には正規表現、Rspamd関数、Lua関数を使用できます。 Rspamdは以下のような式の演算子をサポートしています:
&&- logical AND (andまたは&でもよい)||- logical OR (or|)!- logical NOT (not)+- logical PLUS, 通常、以下の比較で使われる:>more than<less than>=more or equal<=less or equal
The PLUS operator in Rspamd connects multiple atoms or sub-expressions and compares them to a specific number:
A + B + C + D > 2 - evaluates to true if at least 3 operands are true (A & B) + C + D + E >= 2 - evaluates to true if at least 2 operands are true
Operator priority:
- NOT
- PLUS
- COMPARE
- AND
- OR
Use parentheses to change priority. In Rspamd, all operations are right associative. During expression evaluation, Rspamd optimizes execution time by reordering and avoids evaluating unnecessary branches.
Expressions components
Rspamd support the following components within expressions:
- Regular expressions
- Internal functions
- Lua global functions (not widely used)
Regular expressions
In Rspamd, regular expressions can be used to examine different parts of the message:
- Headers (should be
Header-Name=/regexp/iumxs{header}), MIME part headers - Full headers string
- Textual MIME parts
- Raw messages
- URLs
- Strings returned by a selector (
re_selector_name=/regexp/iumxs{selector})
The match type is defined by a flag that appears after the last / symbol. This can be a single letter or a long type enclosed in curly braces, which has been available since Rspamd 1.3:
| Type | Long type | Tested content |
|---|---|---|
| H | {header} | Header value; if the header contains encoded words they are decoded and converted to UTF-8. All invalid UTF-8 bytes are replaced by a ?
|
| X | {raw_header} |
Raw header value (encoded words are not decoded, but folding is removed) |
| B | {mime_header} |
MIME header value extracted for headers in MIME parts that are not message/rfc822 and that are enclosed in multipart containers only |
| R | {all_headers} |
Full headers content (applied for all headers in their original form and for the message only - not including MIME headers) |
| M | {body} |
Full message (with all headers) as it was sent to Rspamd |
| P | {mime} |
Text MIME part content; base64/quoted-printable is decoded, HTML tags are stripped; if charset is not UTF-8 Rspamd tries to convert it to UTF-8, but if conversion fails the original text is examined |
| Q | {raw_mime} |
Text MIME part raw content (unmodified by Rspamd) |
| C | {sa_body} |
SpamAssassin body analogue (see body pattern test description in SpamAssassin documentation); if charset is not UTF-8, Rspamd tries to convert text to UTF-8 |
| D | {sa_raw_body} |
SpamAssassin rawbody analogue (raw data inside text parts, base64/quoted-printable is decoded, but HTML tags and line breaks are preserved) |
| U | {url} |
URLs (before 2.4 also email addresses extracted from the message body, in the same form as returned by url:tostring()) |
| $ | {selector} |
Strings returned by a selector (from 1.8) |
{email} |
Emails extracted from the message body (from 2.4) | |
{words} |
Unicode normalized (to NFKC) and lower-cased words extracted from the text (excluding URLs), subject and From displayed name | |
{raw_words} |
The same words, but without normalization (converted to utf8 however) | |
{stem_words} |
Unicode normalized, lower-cased and stemmed words extracted from the text (excluding URLs), Subject and From display name |
Each regexp also supports the following modifiers:
- i - ignore case
- u - use UTF-8 regexp
- m - multi-line regular expression - this flag causes the string to be treated as multiple lines. This means that the ^ and $ symbols match the start and end of each line within the string, rather than just the start and end of the first and last lines.
- x - extended regular expression - this flag instructs the regular expression parser to ignore most white-space that is not escaped (\) or within a bracketed character class. This makes it possible to break up the regular expression into more readable parts. Additionally, the # character is treated as a meta-character that introduces a comment which runs up to the pattern’s closing delimiter or to the end of the current line if the pattern extends onto the next line.
- s - dot-all regular expression - this flag causes the string to be treated as a single line. This means that the . symbol matches any character whatsoever, including a newline, which it would not normally match. When used together as /ms, they allow the . to match any character while still allowing ^ and $ to respectively match just after and just before newlines within the string
- O - do not optimize regexp (rspamd optimizes regexps by default)
- r - use non-UTF-8 regular expressions (raw bytes). Defaults to true if raw_mode is set to true in the options section.
- A - return and process all matches (useful for Lua prefilters)
- L - match left part of regexp (useful for Lua prefilters in conjunction with Hyperscan)