Rspamd architecture

Introduction

Rspamd is a universal spam filtering system based on an event-driven processing model, which means that Rspamd is not intended to block anywhere in the code. To process messages Rspamd uses a set of rules. Each rule is a symbolic name associated with a message property. For example, we can define the following rules:

  • SPF_ALLOW - means that a message is validated by SPF;
  • BAYES_SPAM - means that a message is statistically considered as spam;
  • FORGED_OUTLOOK_MID - message ID seems to be forged for the Outlook MUA.

Rules are defined by modules. For instance, if there is a module that performs SPF checks, it may define several rules based on SPF policy:

  • SPF_ALLOW - a sender is allowed to send messages for this domain;
  • SPF_DENY - a sender is denied by SPF policy;
  • SPF_SOFTFAIL - there is no affinity defined by SPF policy.

Rspamd supports two main types of modules: internal modules written in C and external modules written in Lua. There is no real difference between the two types with the exception that C modules are embedded and can be enabled in a filters attribute in the options section of the config:

options {
 filters = "chartable,dkim,surbl,regexp,fuzzy_check";
 ...
}

Protocol

Rspamd uses the HTTP protocol for all operations. This protocol is described in the protocol section.

Metrics

In Rspamd, rules determine the logic of checks, but it is necessary to assign weights to each rule. In Rspamd, weight represents the ‘significance’ of a rule. Rules with a higher absolute weight value are considered more important. Rule weights are specified within ‘metrics.’ Each metric is a collection of grouped rules, each with its specific weight. For instance, you can define the following weights for SPF rules:

  • SPF_ALLOW: -1
  • SPF_DENY: 2
  • SPF_SOFTFAIL: 0.5

Positive weights mean that this rule increases a messages ‘spammyness’, while negative weights mean the opposite.

Rules scheduler

To prevent unnecessary checks, Rspamd employs a rule scheduler for each message. If a message is definitively classified as spam, further checks are skipped. This scheduler follows a straightforward logic:

  • select negative rules before positive ones to prevent false positives;
  • prefer rules with the following characteristics:
    • frequent rules;
    • rules with more weight;
    • faster rules

These optimizations enable quicker identification of definite spam compared to a generic queue.

Since Rspamd-0.9 there are further optimizations for rules and expressions that are described generally in the following presentation.

Actions

Another crucial aspect of metrics is their set of actions. This set establishes the recommended actions for a message based on the cumulative score generated by all the rules that have been triggered. Rspamd defines the following actions:

  • No action: a message is likely to be ham;
  • Greylist: greylist a message if it is not certainly ham;
  • Add header: a message is likely spam, so add a specific header;
  • Rewrite subject: a message is likely spam, so rewrite its subject;
  • Reject: a message is very likely spam, so reject it completely

These actions serve as recommendations for the Mail Transfer Agent (MTA) and are not intended to be followed blindly. When the score equals or exceeds greylist, explicit greylisting is suggested. Both the Add header and Rewrite subject actions carry similar semantic meanings and imply that a message is likely spam. On the other hand, Reject is a stringent rule, often indicating that the message should be outright rejected by the MTA. The specific score thresholds for triggering these actions should align with their priority logic. In cases where two actions share the same weight, the resulting action is undetermined.

Rules weight

The weight assigned to rules is not necessarily fixed. For instance, in the case of statistical rules, there’s no absolute certainty about whether a message is spam or not; instead, there’s a measure of probability. To accommodate such probabilistic rules, Rspamd introduces the concept of dynamic weights. In essence, this means that a rule can contribute a weight ranging from 0 to a predefined value in the metric. So, if we define the symbol BAYES_SPAM with a weight of 5.0, this rule can assign a resulting symbol with a weight anywhere between 0 and 5.0. To distribute these values, Rspamd employs a variation of the Sigma function, creating a fair distribution curve. It’s important to note that the majority of Rspamd rules, apart from fuzzy rules, use static weights.

Statistics

Rspamd employs statistical algorithms to precisely compute the final score of a message. Presently, the sole algorithm defined is OSB-Bayes. You can find comprehensive details regarding this algorithm in the following paper. Rspamd adopts a window size of 5 words for its classification. In the classification process, Rspamd dissects a message into a collection of tokens. These tokens are separated by punctuation or whitespace characters, with short tokens (less than 3 symbols) being disregarded. For each token, Rspamd computes two non-cryptographic hashes, which are subsequently used as indices. All these tokens are stored in various statistics backends, which can be implemented through mmapped files, SQLite3 databases, or Redis servers. Currently, the recommended backend for statistics is Redis.

Running rspamd

Rspamd provides several command-line options that can be supplied when running the program. You can view a list of all these options by using the --help argument.

All options are optional: by default, rspamd attempts to read the etc/rspamd.conf configuration file and operates as a daemon. Additionally, there is a test mode that can be activated using the -t argument. In test mode, rspamd reads the configuration file and assesses its syntax. If the configuration file is valid, the exit code is set to zero. Test mode proves valuable when you need to verify new configuration files without the necessity of restarting rspamd.

Managing rspamd using signals

It’s crucial to remember that all user signals should be directed towards the rspamd main process, not its child processes. This is because these signals can carry different meanings for child processes. To identify the main process:

  • by reading the pidfile:

      $ cat pidfile
    
  • by getting process info:

      $ ps auxwww | grep rspamd
      nobody 28378  0.0  0.2 49744  9424   rspamd: main process
      nobody 64082  0.0  0.2 50784  9520   rspamd: worker process
      nobody 64083  0.0  0.3 51792 11036   rspamd: worker process
      nobody 64084  0.0  2.7 158288 114200 rspamd: controller process
      nobody 64085  0.0  1.8 116304 75228  rspamd: fuzzy storage
    
      $ ps auxwww | grep rspamd | grep main
      nobody 28378  0.0  0.2 49744  9424   rspamd: main process
    

Once you have obtained the PID of the main process, you can manage rspamd using signals, as outlined below:

  • SIGHUP - restart rspamd: reread config file, start new workers (as well as controller and other processes), stop accepting connections by old workers, reopen all log files. Note that old workers would be terminated after one minute which should allow processing of all pending requests. All new requests to rspamd will be processed by the newly started workers.
  • SIGTERM - terminate rspamd.
  • SIGUSR1 - reopen log files (useful for log file rotation).

These signals may be used in rc-style scripts. Restarting of rspamd is performed softly: no connections are dropped and if a new config is incorrect then the old config is used.