Rspamd bayes engine benchmark

2016-10-14 00:00:00 +0000

I have recently decided to compare Bayes classifier in Rspamd with the closest analogues. I have tried 3 competitors:

Rspamd(version 1.4 git master)
Bogofilter - classical bayesian filter
Dspam - the most advanced bayesian filter used by many projects and people

For Dspam, I have tested both chain and osb tokenization modes. I have tried to test chi-square probabilities combiner (since the same algorithm is used in Rspamd), however, I could not make it working somehow.

Testing methodology

First of all, I have collected some corpus of messages with about 1k of spam messages and 1k of ham messages. All messages were carefully selected and manually checked. Then, I have written a small script that performs the following steps:

Split corpus randomly into two equal parts with about 500 messages of Ham and Spam correspondingly.
Learn bayes classifier using the desired spam filtering engine (-d for Dspam, -b for Bogofilter).
Use the rest of messages to test classifier after learning procedure.
Use 95% confidence factor for Rspamd and Dspam (e.g. when probability of spam is less than 95% then consider that a classifier is in undefined state, Bogofilter, in turn, automatically provides 3 results: spam, ham, undefined).

This script collects 6 main values for each classifier:

Spam/Ham detection rate - number of messages that are correctly recognized as spam and ham
Spam FP rate - number of false positives for Spam: HAM messages that are recognized as SPAM
Ham FP rate - number of false positives for Ham: SPAM messages that are recognized as HAM
Ham and Spam FN rate - number of messages that are not recognized as Ham or Spam (but not classified as the opposite class, meaning uncertainty for a classifier)

The worse error for a classifier is Spam False Positive, since it detects an innocent message as Spam. Ham FP and false negatives are more permissive: they just mean that you receive more spam than you want.

Results

The raw results are pasted at the following gist.

Here are the corresponding graphs for detection rate and errors for the competitors.

Conclusions

Rspamd Bayes performs very well comparing to the competitors. It provides higher spam detection rate comparing to both Dspam and Bogofilter. All competitors demonstrated the common spam false positives rate. However, Dspam is more aggressive in marking messages as Ham (which is not bad because Bayes is the only check Dspam provides).

Rspamd is also much faster in learning and testing. With Redis backend, it learns 1k messages in less than 5 seconds. Dspam and Bogofilter both require about 30 seconds to learn.

I have not included SpamAssassin into the comparison since it uses naive Bayes classifier similar to Bogofilter. Hence, it’s quality is very close to Bogofilter's one.

Furthermore, unlike competitors, Rspamd provides a lot of other checks and features. The goal of this particular benchmark was to compare merely Bayesian engines of different spam filters. To summarise, I can conclude that quality of Bayes classifier in Rspamd is high enough to recommend it for using in the production environments or to replace Dspam or Bogofilter in your email system.

Rspamd 1.3.5 has been released

2016-09-01 00:00:00 +0000

The next stable version of Rspamd is now available to download. This release contains a couple of bugfixes and minor improvements.

Termination handlers

Rspamd can now perform some actions on termination of worker processes. For example, it is useful for neural network plugin to save training data on exit. It was also essential for RRD statistics to synchronize RRD on controller’s termination to avoid negative message rates on graphs.

Minimum learns has been fixed

This option was improperly configured previously so it didn’t work as desired. However, it is indeed useful to stop statistical classification before there is enough training for the Bayes classifier. With 1.3.5 release, this option has been fixed.

Rspamd on OpenBSD

There were a couple of bug fixes that allowed Rspamd to run on OpenBSD again. These bugs were cloaked by other systems, however, they were potentially dangerous for those systems as well.

DMARC and DKIM improvements

Andrew Lewis has added various improvements for DKIM, DMARC and SPF plugins to handle cases when the corresponding policies are not listed by senders: e.g. when there is no SPF record or DKIM key for some domain.

Ratelimits improvements

It is now possible to disable ratelimits for specific users.

Mailbox messages and `rspamc`

Rspamd command line client rspamc can now work with messages in UNIX mailbox format which is sometimes used to store messages on the disk.

Spamhaus DROP Support

Rspamd now supports Spamhaus DROP dns block list that is used to block large botnets over the world.

DKIM verification improvements

Some bugs related to canonicanization of empty messages are fixed in the DKIM plugin.

Fix critical issue with line endings finding

There was a critical bug in Rspamd related to parsing of newlines offsets in a message. In some certain cases it could lead to serious malfunction in URLs detector and some other crucial parts of Rspamd.

Minor bugfixes

There are a couple of minor bugfixes in this release, for example, parsing of \0 symbol in lua_tcp module. HFILTER_URL_ONLY is fixed not to produce overly high scores. All invocations of table.maxn have been removed from Lua plugins as this function is deprecated in Lua.

Rspamd 1.3.4 has been released

2016-08-22 00:00:00 +0000

The new stable versions of Rspamd and Rmilter have been released: 1.3.4 and 1.9.2 accordingly. There are a couple of improvements and some important bugfixes. Please note that in the unlikely case you have used regexp rules in Rmilter then you SHOULD NOT upgrade Rmilter and file a bug report (however, I’m pretty sure that it’s not used by anybody since it hasn’t ever been documented). Here is a list of notable changes in Rmilter and Rspamd.

Rspamd reload command has been fixed

It is now possible to gracefully reload Rspamd configuration by sending HUP signal or by using reload subcommand for the init scripts. Graceful reload is useful when it’s required to update configuration without stopping email processing. During this process, Rspamd starts a new worker processes with the new configuration whilst the existing ones process the pending messages.

Better ASN/country support

ASN/country detection module has been split from the ip_score module allowing use of this data in other modules, for example, in the multimap module to match maps based on country or ASN.

Variable maps in the multimap module

It’s now possible to create maps based on the results of other Lua or internal Rspamd modules. This is particularly useful to link different modules with mulitmap.

DNNSEC stub resolver support

It’s now possible to enable DNSSEC checks in Rspamd through use of a DNSSEC compatible recursive resolver (e.g. Unbound) and check for DNSSEC authentication results in Lua DNS module.

DMARC and DKIM module fixes

There are some important fixes for DMARC and DKIM modules in this version of Rspamd that are related to canonicalization in DKIM and subdomains policies in DMARC.

Redis backend configuration

Now Redis backend in the statistical module can use the global redis settings similar to other modules.

Tasks checksums

Each task and each MIME part now has its own checksum that could be used to detect the same message or the same attachment.

DKIM signature header is now folded by Rspamd

Since DKIM signature header might be quite long, Rspamd now folds it to fit 80 characters wide common for MIME messages.

Ratelimit module fixed

This release of Rspamd fixes a regression introduced in 1.3.3 which prevented the ratelimit module from working properly.

Fixed X-Forwarded-For header processing

Processing of X-Forwarded-For header in the controller has been fixed.

Rmilter configuration improvements

It is now possible to use += operator to append elements to Rmilter lists (e.g. whitelists) and = to redefine the parameter completely. Hosts lists now can contain hostnames along with IP addresses. List parameters can now be empty to clear lists that are non-empty by default. DKIM signing can be completely disabled in the configuration.

Rmilter regexp rules are removed

Support for regexp rules has been removed from Rmilter. This is an old feature which has never been documented nor used by any users. It was likely broken so I have decided to remove it from Rmilter completely to simplify configuration parser and the overall processing logic. If you are using it then do not update Rmilter and please file a bug report in the GitHub issue tracker.

Rmilter bugfixes

Unconditional greylisting support is now restored in Rmilter. Headers added or removed by Rspamd are now treated by Rmilter correctly.

Rspamd 1.3.3 has been released

2016-08-15 00:00:00 +0000

The new stable version of Rspamd is available: 1.3.3. This release includes a couple of critical bugs fixes and important improvements. We recommend to update Rspamd to version 1.3.3 as soon as possible due to a serious error in fuzzy hashes processing.

Important fuzzy hashes incompatibility

There was a serious bug in fuzzy check plugin when using transport encryption and the default fuzzy key. Since encryption is enabled in the default configuration, all users should consider updating their Rspamd packages. Another important consequence of this bug is that the private fuzzy storages should be relearned because of a wrong key used for hashing. To summarize, here is a list of issues for different types of fuzzy storages:

For private fuzzy storages that are used without encryption_key nothing has changed and everything should work as desired
For storages with explicit fuzzy_key everything should work as desired with the exception of attachments hashes: they no longer use custom fuzzy_key for performance and architecture reasons. So it is recommended to relearn such a storage after upgrading to 1.3.3
For storages without explicit fuzzy_key but with encryption_key the recommended action is to upgrade to 1.3.3 and relearn a storage since all old hashes won’t be recognized (there won’t be any false positive hits however).

Rspamd.com fuzzy storage has changed hashing algorithm

Users of rspamd.com storage should either use the provided default configuration for fuzzy_check plugin or update their custom configuration to include the following line in the rule for rspamd.com storage:

rule "rspamd.com" {
  algorithm = "mumhash";
  # The rest of the configuration
}

The default rule provided by Rspamd distribution is now the following:

rule "rspamd.com" {
  algorithm = "mumhash";
  servers = "rspamd.com:11335";
  encryption_key = "icy63itbhhni8bq15ntp5n5symuixf73s1kpjh6skaq4e7nx5fiy";
  symbol = "FUZZY_UNKNOWN";
  mime_types = ["application/*"];
  max_score = 20.0;
  read_only = yes;
  skip_unknown = yes;
  fuzzy_map = {
    FUZZY_DENIED {
      max_score = 20.0;
      flag = 1;
    }
    FUZZY_PROB {
      max_score = 10.0;
      flag = 2;
    }
    FUZZY_WHITE {
      max_score = 2.0;
      flag = 3;
    }
  }
}

Failure to update hashing algorithm will cause Rspamd not to find any hits in the rspamd.com storage.

Support for Redis maps in the Multimap plugin

There is now Redis support in the Multimap plugin. With this feature you can create maps that can be easily scaled and frequently modified. For example, you could use it for temporary records that work as DNS blacklists but using Redis storage.

Hyperscan cache important fix

This version contains an important fix for Hyperscan caching inconsistency. After rules change, there were no checks against sanity of regexp ids stored in the cache. In turn, that caused random regexps misdetections and false positive detections. In version 1.3.3 Rspamd checks every single ID using checksum and recompiles the whole cached file when a checksum is invalid.

SARBL URL black list support has been added to the default configuration

Rspamd now will check URLs using SARBL list to detect bad or phishing domains in messages.

Lua API improvements

There are number of improvements in Lua API shipped with Rspamd:

util.get_tld function has been fixed to find the longest possible TLD
rspamd_url now allows initialization of the library and provides a simpler API to parse URLs in strings
rspamd_cryptobox now provides a one step hashing API
util.strequal_caseless function now works as intended
rspamd_redis now returns nil for data when Redis returns NIL (e.g. when a key is not found)
rspamd_http now always performs DNS request even when maximum number of DNS requests for a message has been reached

Prefilters and postfilters registered in Rspamd are now executed in order defined by their priority:

Prefilters with higher priority are executed first
Postfilters with higher priority are executed last

Rspamd 1.3.2 has been released

2016-08-08 00:00:00 +0000

The next stable version of Rspamd, 1.3.2, is now available. It contains many improvements, bug fixes and another integration method: Communigate Pro helper. Here are the main improvements added in this version:

Important bug fixes

There was a bug introduced in 1.3.0 related to multiple value header processing which broke, for instance, the processing of multiple SMTP recipients. This issue has been fixed in 1.3.2.

Attributes in HTML tags are now HTML decoded to avoid polluting other elements.

The last element in DMARC records is now correctly parsed.

Hfilter module has been reworked to reduce false positive hits rate for hostnames and SMTP helo values.

SPF plugin features

Rspamd now recognizes DNS failures when resolving SPF records and abstains from caching failed lookups. There is now a new symbol R_SPF_DNSFAIL that is inserted when there was a DNS error during resolving of a SPF record. Furthermore, Rspamd will not insert R_SPF_DENY if there was an error looking up records required by policy.

Better HTML support

There are a couple of features and important bugs fixes for HTML parser introduced in Rspamd 1.3.2. There is new HTML block elements parser that can deal with colors in HTML documents and, in particular, with named colors. Secondly, parsed HTML tags now contain the length of the content enclosed within the tag. And finally, the Lua API is improved with a new foreach_tag method that allows traversing across particular HTML tags examining their content. The existing HTML related rules are updated accordingly to deal with HTML spam better using the new API.

Improved settings matches for authorized users

It is now possible to match any authorized user when applying user settings. It might be useful for applying different settings to authenticated users:

outbound {
  priority = high;
  id = "outbound";
  authenticated = true;

  apply {
    groups_disabled = ["hfilter", "spf", "dkim", "rbl"];
  }
}

Redis integration fix

Rspamd no longer uses the KEYS pattern* command for getting statistics. It was found that this command literally kills Redis performance on large data sets. This mechanism has been reworked to avoid KEYS command.

DKIM support improvements

DKIM signing has been fixed with additional tests added. rspamc utility can now use a DKIM signature header passed by Rspamd to sign email in --mime mode. DKIM header folding issues have been found and fixed in Rspamd 1.3.2.

URL detection fixes

Rspamd now tries to search for a longest possible suffix when matching TLD parts to distinguish between common suffixes of different length, for example .net and .in.net.