mirror of https://github.com/rspamd/rspamd.git
Rapid spam filtering system
https://rspamd.com/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1023 lines
42 KiB
1023 lines
42 KiB
\input texinfo
|
|
@settitle "Rspamd Spam Filtering System"
|
|
@titlepage
|
|
|
|
@title Rspamd Spam Filtering System
|
|
@subtitle A User's Guide for Rspamd
|
|
|
|
@author Vsevolod Stakhov
|
|
|
|
|
|
@end titlepage
|
|
@contents
|
|
|
|
@chapter Rspamd purposes and features.
|
|
|
|
@node introduction
|
|
@section Introduction.
|
|
Rspamd filtering system is created as a replacement of popular
|
|
@code{spamassassin}
|
|
spamd and is designed to be fast, modular and easily extendable system. Rspamd
|
|
core is written in @code{C} language using event driven paradigma. Plugins for rspamd
|
|
can be written in @code{lua}. Rspamd is designed to process connections
|
|
completely asynchronous and do not block anywhere in code. Spam filtering system
|
|
contains of several processes among them are:
|
|
@itemize @bullet
|
|
@item Main process
|
|
@item Workers processes
|
|
@item Controller process
|
|
@item Other processes
|
|
@end itemize
|
|
Main process manages all other processes, accepting signals from OS (for example
|
|
SIGHUP) and spawn all types of processes if any of them die. Workers processes
|
|
do all tasks for filtering e-mail (or HTML messages in case of using rspamd as
|
|
non-MIME filter). Controller process is designed to manage rspamd itself (for
|
|
example get statistics or learning rspamd). Other processes can do different
|
|
jobs among them now are implemented @code{LMTP} worker that implements
|
|
@code{LMTP} protocol for filtering mail and fuzzy hashes storage server.
|
|
|
|
@node features
|
|
@section Features.
|
|
The main features of rspamd are:
|
|
@itemize @bullet
|
|
@item Completely asynchronous filtering that allows a big number of simultenious
|
|
connections.
|
|
@item Easily extendable architecture that can be extended by plugins written in
|
|
@code{lua} and by dynamicaly loaded plugins written in @code{c}.
|
|
@item Ability to work in cluster: rspamd is able to perform statfiles
|
|
synchronization, dynamic load of lists via HTTP, to use distributed fuzzy hashes
|
|
storage.
|
|
@item Advanced statistics: rspamd now is shipped with winnow-osb classifier that
|
|
provides more accurate statistics than traditional bayesian algorithms based on
|
|
single words.
|
|
@item Internal optimizer: rspamd first of all try to check rules that were met
|
|
more often, so for huge spam storms it works very fast as it just checks only
|
|
that rules that @emph{can} happen and skip all others.
|
|
@item Ability to manage the whole cluster by using controller process.
|
|
@item Compatibility with existing @code{spamassassin} SPAMC protocol.
|
|
@item Extended @code{RSPAMC} protocol that allows to pass many additional data
|
|
from SMTP dialog to rspamd filter.
|
|
@item Internal support of IMAP in rspamc client for automated learning.
|
|
@item Internal support of many anti-spam technologies, among them are
|
|
@code{SPF} and @code{SURBL}.
|
|
@item Active support and development of new features.
|
|
@end itemize
|
|
|
|
@chapter Installation of rspamd.
|
|
|
|
@node obtaining
|
|
@section Obtaining of rspamd.
|
|
|
|
The main rspamd site is @url{http://rspamd.sourceforge.net/, sourceforge}. Here
|
|
you can obtain source code package as well as pre-packed packages for different
|
|
operating systems and architectures. Also, you can use SCM
|
|
@url{http://mercurial.selenic.com, mercurial} for accessing rspamd development
|
|
repository that can be found here:
|
|
@url{http://rspamd.hg.sourceforge.net:8000/hgroot/rspamd/rspamd}. Rspamd is
|
|
shipped with all modules and sample config by default. But there are some
|
|
requirements for building and running rspamd.
|
|
|
|
@node requirements
|
|
@section Requirements.
|
|
|
|
For building rspamd from sources you need @code{CMake} system. CMake is very
|
|
nice source building system and I decided to use it instead of GNU autotools.
|
|
CMake can be obtained here: @url{http://cmake.org}. Also rspamd uses gmime and
|
|
glib for MIME parsing and many other purposes (note that you are NOT required
|
|
to install any GUI libraries - nor glib, nor gmime are GUI libraries). Gmime
|
|
and glib can be obtained from gnome site: @url{http://ftp.gnome.org/}. For
|
|
plugins and configuration system you also need lua language interpreter and
|
|
libraries. They can be easily obtained from @url{http://lua.org, official lua
|
|
site}. Also for rspamc client you need @code{perl} interpreter that could be
|
|
installed from @url{http://www.perl.org}.
|
|
|
|
@node building
|
|
@section Building and Installation.
|
|
|
|
Build process of rspamd is rather simple:
|
|
@itemize @bullet
|
|
@item Configure rspamd build environment, using cmake:
|
|
@example
|
|
$ cmake .
|
|
...
|
|
-- Configuring done
|
|
-- Generating done
|
|
-- Build files have been written to: /home/cebka/rspamd
|
|
@end example
|
|
@noindent
|
|
For special configuring options you can use
|
|
@example
|
|
$ ccmake .
|
|
CMAKE_BUILD_TYPE
|
|
CMAKE_INSTALL_PREFIX /usr/local
|
|
DEBUG_MODE ON
|
|
ENABLE_GPERF_TOOLS OFF
|
|
ENABLE_OPTIMIZATION OFF
|
|
ENABLE_PERL OFF
|
|
ENABLE_PROFILING OFF
|
|
ENABLE_REDIRECTOR OFF
|
|
ENABLE_STATIC OFF
|
|
@end example
|
|
@noindent
|
|
Options allows building rspamd as static module (note that in this case
|
|
dynamicaly loaded plugins are @strong{NOT} supported), linking rspamd with
|
|
google performance tools for benchmarking and include some other flags while
|
|
building.
|
|
@item Build rspamd sources:
|
|
@example
|
|
$ make
|
|
[ 6%] Built target rspamd_lua
|
|
[ 11%] Built target rspamd_json
|
|
[ 12%] Built target rspamd_evdns
|
|
[ 12%] Built target perlmodule
|
|
[ 58%] Built target rspamd
|
|
[ 76%] Built target test/rspamd-test
|
|
[ 85%] Built target utils/expression-parser
|
|
[ 94%] Built target utils/url-extracter
|
|
[ 97%] Built target rspamd_ipmark
|
|
[100%] Built target rspamd_regmark
|
|
@end example
|
|
@noindent
|
|
@item Install rspamd (as superuser):
|
|
@example
|
|
# make install
|
|
Install the project...
|
|
...
|
|
@end example
|
|
@noindent
|
|
@end itemize
|
|
|
|
After installation you would have several new files installed:
|
|
@itemize @bullet
|
|
|
|
@item Binaries:
|
|
@itemize @bullet
|
|
@item PREFIX/bin/rspamd - main rspamd executable
|
|
@item PREFIX/bin/rspamc - rspamd client program
|
|
@end itemize
|
|
@item Sample configuration files and rules:
|
|
@itemize @bullet
|
|
@item PREFIX/etc/rspamd.xml.sample - sample main config file
|
|
@item PREFIX/etc/rspamd/lua/*.lua - rspamd rules
|
|
@end itemize
|
|
@item Lua plugins:
|
|
@itemize @bullet
|
|
@item PREFIX/etc/rspamd/plugins/lua/*.lua - lua plugins
|
|
@end itemize
|
|
|
|
@end itemize
|
|
For @code{FreeBSD} system there also would be start script for running rspamd in
|
|
@emph{PREFIX/etc/rc.d/rspamd.sh}.
|
|
|
|
@node running
|
|
@section Running rspamd.
|
|
|
|
Rspamd can be started by running main rspamd executable -
|
|
@code{PREFIX/bin/rspamd}. There are several command-line options that can be
|
|
passed to rspamd. All of them can be displayed by passing --help argument:
|
|
@example
|
|
$ rspamd --help
|
|
Usage:
|
|
rspamd [OPTION...] - run rspamd daemon
|
|
|
|
Summary:
|
|
Rspamd daemon version 0.3.0
|
|
|
|
Help Options:
|
|
-?, --help Show help options
|
|
|
|
Application Options:
|
|
-t, --config-test Do config test and exit
|
|
-f, --no-fork Do not daemonize main process
|
|
-c, --config Specify config file
|
|
-u, --user User to run rspamd as
|
|
-g, --group Group to run rspamd as
|
|
-p, --pid Path to pidfile
|
|
-V, --dump-vars Print all rspamd variables and exit
|
|
-C, --dump-cache Dump symbols cache stats and exit
|
|
-X, --convert-config Convert old style of config to xml one
|
|
@end example
|
|
@noindent
|
|
|
|
All options are optional: by default rspamd would try to read
|
|
@code{PREFIX/etc/rspamd.xml} config file and run as daemon. Also there is test
|
|
mode that can be turned on by passing @option{-t} argument. In test mode rspamd
|
|
would read config file and checks its syntax, if config file is OK, then exit
|
|
code is zero and non zero otherwise. Test mode is useful for testing new config
|
|
file without restarting of rspamd. With @option{-C} and @option{-V} arguments it is
|
|
possible to dump variables or symbols cache data. The last ability can be used
|
|
for determining which symbols are most often, which are most slow and to watch
|
|
to real order of rules inside rspamd. @option{-X} option can be used to convert
|
|
old style (pre 0.3.0) config to xml one:
|
|
@example
|
|
$ rspamd -c ./rspamd.conf -X ./rspamd.xml
|
|
@end example
|
|
@noindent
|
|
After this command new xml config would be dumped to rspamd.xml file.
|
|
|
|
@node signals
|
|
@section Managing rspamd with signals.
|
|
First of all it is important to note that all user's signals should be sent to
|
|
rspamd main process and not to its children (as for child processes these
|
|
signals may have other meanings). To determine which process is main you can use
|
|
two ways:
|
|
@itemize @bullet
|
|
@item by reading pidfile:
|
|
@example
|
|
$ cat pidfile
|
|
@end example
|
|
@noindent
|
|
@item by getting process info:
|
|
@example
|
|
$ ps auxwww | grep rspamd
|
|
nobody 28378 0.0 0.2 49744 9424 rspamd: main process (rspamd)
|
|
nobody 64082 0.0 0.2 50784 9520 rspamd: worker process (rspamd)
|
|
nobody 64083 0.0 0.3 51792 11036 rspamd: worker process (rspamd)
|
|
nobody 64084 0.0 2.7 158288 114200 rspamd: controller process (rspamd)
|
|
nobody 64085 0.0 1.8 116304 75228 rspamd: fuzzy storage (rspamd)
|
|
|
|
$ ps auxwww | grep rspamd | grep main
|
|
nobody 28378 0.0 0.2 49744 9424 rspamd: main process (rspamd)
|
|
@end example
|
|
@noindent
|
|
@end itemize
|
|
|
|
After getting pid of main process it is possible to manage rspamd with signals:
|
|
@itemize @bullet
|
|
@item SIGHUP - restart rspamd: reread config file, start new workers (as well as
|
|
controller and other processes), stop accepting connections by old workers,
|
|
reopen all log files. Note that old workers would be terminated after one minute
|
|
that should allow to process all pending requests. All new requests to rspamd
|
|
would be processed by newly started workers.
|
|
@item SIGTERM - terminate rspamd system.
|
|
@end itemize
|
|
|
|
These signals may be used in start scripts as it is done in @code{FreeBSD} start
|
|
script. Restarting of rspamd is doing rather softly: no connections would be
|
|
dropped and if new config is syntaxically incorrect old config would be used.
|
|
|
|
@chapter Configuring of rspamd.
|
|
|
|
@node principles
|
|
@section Principles of work.
|
|
|
|
We need to define several terms to explain configuration of rspamd. Rspamd
|
|
operates with @strong{rules}, each rule defines some actions that should be done with
|
|
message to obtain result. Result is called @strong{symbol} - a symbolic
|
|
representation of rule. For example, if we have a rule to check DNS record for
|
|
a url that contains in message we may insert resulting symbol if this DNS record
|
|
is found. Each symbol has several attributes:
|
|
@itemize @bullet
|
|
@item name - symbolic name of symbol (usually uppercase, e.g. MIME_HTML_ONLY)
|
|
@item weight - numeric weight of this symbol (this means how important this rule is), may
|
|
be negative
|
|
@item options - list of symbolic options that defines additional information about
|
|
processing this rule
|
|
@end itemize
|
|
|
|
Weights of symbols are called @strong{factors}. Also when symbol is inserted it
|
|
is possible to define additional multiplier to factor. This can be used for
|
|
rules that have dynamic weights, for example statistical rules (when probability
|
|
is higher weight must be higher as well).
|
|
|
|
All symbols and corresponding rules are combined in @strong{metrics}. Metric
|
|
defines a group of symbols that are designed for common purposes. Each metric
|
|
has maximum weight: if sum of all rules' results (symbols) is bigger than this
|
|
limit then this message is considered as spam in this metric. The default metric
|
|
is called @emph{default} and rules that have not explicitly specified metric
|
|
would insert their results to this default metric.
|
|
|
|
Let's impress how this technics works:
|
|
@enumerate 1
|
|
@item First of all when rspamd is running each module (lua, internal or external
|
|
dynamic module) can register symbols in any defined metric. After this process
|
|
rspamd has a cache of symbols for each metric. This cache can be saved to file
|
|
for speeding up process of optimizing order of calling of symbols.
|
|
@item Rspamd gets a message from client and parse it with mime parsing and do
|
|
other parsing jobs like extracting text parts, urls, and stripping html tags.
|
|
@item For each metric rspamd is looking to metric's cache and select rules to
|
|
check according to their order (this order depends on frequence of symbol, its
|
|
weight and execution time).
|
|
@item Rspamd calls rules of metric till the sum weight of symbols in metric is
|
|
less than its limit.
|
|
@item If sum weight of symbols is more than limit the processing of rules is
|
|
stopped and message is counted as spam in this metric.
|
|
@end enumerate
|
|
|
|
After processing rules rspamd is also does statistic check of message. Rspamd
|
|
statistic module is presented as a set of @strong{classifiers}. Each classifier
|
|
defines algorithm of statistic checks of messages. Also classifier definition
|
|
contains definition of @strong{statistic files} (or @strong{statfiles} shortly).
|
|
Each statfile contains of number of patterns that are extracted from messages.
|
|
These patterns are put into statfiles during learning process. A short example:
|
|
you define classifier that contains two statfiles: @emph{ham} and @emph{spam}.
|
|
Than you find 10000 messages that are spam and 10000 messages that contains ham.
|
|
Then you learn rspamd with these messages. After this process @emph{ham}
|
|
statfile contains patterns from ham messages and @emph{spam} statfile contains
|
|
patterns from spam messages. Then when you are checking message via this
|
|
statfiles messages that are like spam would have more probability/weight in
|
|
@emph{spam} statfile than in @emph{ham} statfile and classifier would insert
|
|
symbol of @emph{spam} statfile and would calculate how this message is like
|
|
patterns that are contained in @emph{spam} statfile. But rspamd is not limiting
|
|
you to define one classifier or two statfiles. It is possible to define a number
|
|
of classifiers and a number of statfiles inside a classifier. It can be useful
|
|
for personal statistic or for specific spam patterns. Note that each classifier
|
|
can insert only one symbol - a symbol of statfile with max weight/probability.
|
|
Also note that statfiles check is allways done after all rules. So statistic can
|
|
@strong{correct} result of rules.
|
|
|
|
Now some words about @strong{modules}. All rspamd rules are contained in
|
|
modules. Modules can be internal (like SURBL, SPF, fuzzy check, email and
|
|
others) and external written in @code{lua} language. In fact there is no differ
|
|
in the way, how rules of these modules are called:
|
|
@enumerate 1
|
|
@item Rspamd loads config and loads specified modules.
|
|
@item Rspamd calls init function for each module passing configurations
|
|
arguments.
|
|
@item Each module examines configuration arguments and register its rules (or
|
|
not register depending on configuration) in rspamd metrics (or in a single
|
|
metric).
|
|
@item During metrics process rspamd calls registered callbacks for module's
|
|
rules.
|
|
@item These rules may insert results to metric.
|
|
@end enumerate
|
|
|
|
So there is no actual difference between lua and internal modules, each are just
|
|
providing callbacks for processing messages. Also inside callback it is possible
|
|
to change state of message's processing. For example this can be done when it is
|
|
required to make DNS or other network request and to wait result. So modules can
|
|
pause message's processing while waiting for some event. This is true for lua
|
|
modules as well.
|
|
|
|
@node config structure
|
|
@section Rspamd config file structure.
|
|
|
|
Rspamd config file is placed in PREFIX/etc/rspamd.xml by default. You can
|
|
specify other location by passing @option{-c} option to rspamd. Rspamd config file
|
|
contains configuration parameters in XML format. XML was selected for rather
|
|
simple manual editing config file and for simple automatic generation as well as
|
|
for dynamic configuration. I've decided to move rules logic from XML file to
|
|
keep it small and simple. So rules are defined in @code{lua} language and rspamd
|
|
parameters are defined in xml file (rspamd.xml). Configuration rules are
|
|
included by @strong{<lua>} tag that have @strong{src} attribute that defines
|
|
relative path to lua file (relative to placement of rspamd.xml):
|
|
@example
|
|
<lua src="rspamd/lua/rspamd.lua">fake</lua>
|
|
@end example
|
|
@noindent
|
|
Note that it is not currently possible to have empty tags. I hope this
|
|
restriction would be fixed in future. Rspamd xml config consists of several
|
|
sections:
|
|
@itemize @bullet
|
|
@item Main section - section where main config parameters are placed.
|
|
@item Workers section - section where workers are described.
|
|
@item Classifiers section - section where you define your classify logic
|
|
@item Modules section - a set of sections that describes module's rules (in fact
|
|
these rules should be in lua code)
|
|
@item Factors section - a section where you can set numeric values for symbols
|
|
@item Logging section - a section that describes rspamd logging
|
|
@item Views section - a section that defines rspamd views
|
|
@end itemize
|
|
|
|
So common structure of rspamd.xml can be described this way:
|
|
@example
|
|
<? xml version="1.0" encoding="utf-8" ?>
|
|
<rspamd>
|
|
<!-- Main section directives -->
|
|
...
|
|
<!-- Workers directives -->
|
|
<worker>
|
|
...
|
|
</worker>
|
|
...
|
|
<!-- Classifiers directives -->
|
|
<classifier>
|
|
...
|
|
</classifier>
|
|
...
|
|
<!-- Factors -->
|
|
<factors>
|
|
<factor name="MIME_HTML_ONLY>1.1</factor>
|
|
...
|
|
</factors>
|
|
<!-- Logging section -->
|
|
<logging>
|
|
<type>console</type>
|
|
<level>info</level>
|
|
...
|
|
</logging>
|
|
<!-- Views section -->
|
|
<view>
|
|
...
|
|
</view>
|
|
...
|
|
<!-- Modules settings -->
|
|
<module name="regexp">
|
|
<option name="test">test</option>
|
|
...
|
|
</module>
|
|
...
|
|
</rspamd>
|
|
@end example
|
|
|
|
Each of these sections would be described further in details.
|
|
|
|
@section Rspamd configuration atoms.
|
|
@node config atoms
|
|
|
|
There are several primitive types of rspamd configuration parameters:
|
|
@itemize @bullet
|
|
@item String - common string that defines option.
|
|
@item Number - integer or fractional number (e.g.: 10 or -1.5).
|
|
@item Time - ammount of time in milliseconds, may has suffixes:
|
|
@itemize @bullet
|
|
@item @emph{s} - for seconds (e.g. @emph{10s});
|
|
@item @emph{m} - for minutes (e.g. @emph{10m});
|
|
@item @emph{h} - for hours (e.g. @emph{10h});
|
|
@item @emph{d} - for days (e.g. @emph{10d});
|
|
@end itemize
|
|
@item Size - like number numerci reprezentation of size, but may have a suffix:
|
|
@itemize @bullet
|
|
@item @emph{k} - 'kilo' - number * 1024 (e.g. @emph{10k});
|
|
@item @emph{m} - 'mega' - number * 1024 * 1024 (e.g. @emph{10m});
|
|
@item @emph{g} - 'giga' - number * 1024 * 1024 * 1024 (e.g. @emph{1g});
|
|
@end itemize
|
|
@noindent
|
|
Size atoms are used for memory limits for example.
|
|
@item Lists - path to dynamic rspamd list (e.g. @emph{http://some.host/some/path}).
|
|
@end itemize
|
|
|
|
While practically all atoms are rather trivial to understand rspamd lists may
|
|
cause some confusion. Lists are widely used in rspamd for getting data that can
|
|
be often changed for example white or black lists, lists of ip addresses, lists
|
|
of domains. So for such purposes it is possible to use files that can be get
|
|
either from local filesystem (e.g. @code{file:///var/run/rspamd/whitelsist}) or
|
|
by HTTP (e.g. @code{http://some.host/some/path/list.txt}). Rspamd constantly
|
|
looks for changes in this files, if using HTTP it also set
|
|
@emph{If-Modified-Since} header and check for @emph{Not modified} reply. So it
|
|
causes no overhead when lists are not modified and may allow to store huge lists
|
|
and to distribute them over HTTP. Monitoring of lists is done with some random
|
|
delay (jitter), so if you have many rspamd servers in cluster that are
|
|
monitoring a single list they would come to check or download it in slightly different
|
|
time. The two most common list formats are @emph{IP list} and @emph{domains
|
|
list}. IP list contains of ip addresses in dot notation (e.g.
|
|
@code{192.168.1.1}) or ip/network pairs in CIDR notation (e.g.
|
|
@code{172.16.0.0/16}). Items in lists are separated by newline symbol. Lines
|
|
that begin with @emph{#} symbol are considered as comments and are ignored while
|
|
parsing. Domains list is very like ip list with difference that it contains
|
|
domain names.
|
|
|
|
@section Main rspamd configuration section.
|
|
|
|
Main rspamd configurtion section contains several definitions that determine
|
|
main parameters of rspamd for example path to pidfile, temporary directory, lua
|
|
includes, several limits e.t.c. Here is list of this directives explained:
|
|
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Tag @tab Mean
|
|
|
|
@item @var{<tempdir>}
|
|
@tab Defines temporary directory for rspamd. Default is to use @env{TEMP}
|
|
environment variable or @code{/tmp}.
|
|
|
|
@item @var{<pidfile>}
|
|
@tab Path to rspamd pidfile. Here would be stored a pid of main process.
|
|
Pidfile is used to manage rspamd from start scripts.
|
|
|
|
@item @var{<statfile_pool_size>}
|
|
@tab Limit of statfile pool size: a total number of bytes that can be used for
|
|
mapping statistic files. Rspamd is using LRU system and would unmap the most
|
|
unused statfile when this limit would be reached. The common sense is to set
|
|
this variable equal to total size of all statfiles, but it can be less than this
|
|
in case of dynamic statfiles (for per-user statistic).
|
|
|
|
@item @var{<filters>}
|
|
@tab List of enabled internal filters. Items in this list can be separated by
|
|
spaces, semicolons or commas. If internal filter is not specified in this line
|
|
it would not be loaded or enabled.
|
|
|
|
@item @var{<raw_mode>}
|
|
@tab Boolean flag that specify whether rspamd should try to convert all
|
|
messages to UTF8 or not. If @var{raw_mode} is enabled all messages are
|
|
processed @emph{as is} and are not converted. Raw mode is faster than utf mode
|
|
but it may confuse statistics and regular expressions.
|
|
|
|
@item @var{<lua>}
|
|
@tab Defines path to lua file that should be loaded fro configuration. Path to
|
|
this file is defined in @strong{src} attribute. Text inside tag is required but
|
|
is not parsed (this is stupid limitation of parser's design).
|
|
@end multitable
|
|
|
|
@section Rspamd logging configuration.
|
|
|
|
Rspamd has a number of logging variants. First of all there are three types of
|
|
logs that are supported by rspamd: console loggging (just output log messages to
|
|
console), file logging (output log messages to file) and logging via syslog.
|
|
Also it is possible to filter logging to specific level:
|
|
@itemize @bullet
|
|
@item error - log only critical errors
|
|
@item warning - log errors and warnings
|
|
@item info - log all non-debug messages
|
|
@item debug - log all including debug messages (huge amount of logging)
|
|
@end itemize
|
|
Also it is possible to turn on debug messages for specific ip addresses. This
|
|
ability is usefull for testing.
|
|
|
|
For each logging type there are special mandatory parameters: log facility for
|
|
syslog (read @emph{syslog (3)} manual page for details about facilities), log
|
|
file for file logging. Also file logging may be buffered for speeding up. For
|
|
reducing logging noise rspamd detects for sequential identic log messages and
|
|
replace them with total number of repeats:
|
|
@example
|
|
#81123(fuzzy): May 11 19:41:54 rspamd file_log_function: Last message repeated 155 times
|
|
#81123(fuzzy): May 11 19:41:54 rspamd process_write_command: fuzzy hash was successfully added
|
|
@end example
|
|
|
|
Here is summary of logging parameters:
|
|
|
|
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Tag @tab Mean
|
|
@item @var{<type>}
|
|
@tab Defines logging type (file, console or syslog). For each type mandatory
|
|
attriute must be present:
|
|
@itemize @bullet
|
|
@item @emph{filename} - path to log file for file logging type;
|
|
@item @emph{facility} - syslog logging facility.
|
|
@end itemize
|
|
|
|
@item @var{<level>}
|
|
@tab Defines loggging level (error, warning, info or debug).
|
|
|
|
@item @var{<log_buffer>}
|
|
@tab For file and console logging defines buffer in bytes (kilo, mega or giga
|
|
bytes) that would be used for logging output.
|
|
|
|
@item @var{<log_urls>}
|
|
@tab Flag that defines whether all urls in message would be logged. Useful for
|
|
testing.
|
|
|
|
@item @var{<debug_ip>}
|
|
@tab List that contains ip addresses for which debugging would be turned on. For
|
|
more information about ip lists look at @ref{config atoms}.
|
|
@end multitable
|
|
|
|
@section Factors configuration.
|
|
|
|
Setting of rspamd factors is the main way to change rules' weights. Factors set
|
|
up weights for all rules: for those that have static weights (for example simple
|
|
regexp rules) and for those that have dynamic weights (for example statistic
|
|
rules). In all cases the base weight of rule is multiplied by factor value. For
|
|
static rules base weight is usually 1.0. So we have:
|
|
@itemize @bullet
|
|
@item @math{w_{symbol} = w_{static} * factor} - for static rules
|
|
@item @math{w_{symbol} = w_{dynamic} * factor} - for dynamic rules
|
|
@end itemize
|
|
Also there is an ability to add so called "grow factor" - additional multiplier
|
|
that would be used when we have more than one symbol in metric. So for each
|
|
added symbol this factor would increment its power. This can be written as:
|
|
@math{w_{total} = w_1 * gf ^ 0 + w_2 * gf ^ 1 + ... + w_n * gf ^ {n - 1}}
|
|
Grow multiplier is used to increment weight of rules when message got many
|
|
symbols (likely spammy). Note that only rules with positive weights would
|
|
increase grow factor, those with negative weights would just be added. Also note
|
|
that grow factor can be less than 1 but it is uncommon use (in this case we
|
|
would have weight lowering when we have many symbols for this message). Factors
|
|
can be set up with config section @emph{factors}:
|
|
@example
|
|
<factors>
|
|
<factor name="MIME_HTML_ONLY">0.1</factor>
|
|
<grow_factor>1.1</grow_factor>
|
|
</factors>
|
|
@end example
|
|
|
|
Note that you basically need to add factor when you add additional rules. The
|
|
decision of weight of newly added rule basically depends on its importance. For
|
|
example you are absolutely sure that some rule would add a symbol on only spam
|
|
messages, so you can increase weight of such rule so it would filter such spam.
|
|
But if you increase weight of rules you should be more or less sure that it
|
|
would not increase false positive errors rate to unacceptable level (false
|
|
positive errors are errors when good mail is treated as spam). Rspamd comes with
|
|
a set of default rules and default weights of that rules are placed in
|
|
rspamd.xml.sample. In most cases it is reasonable to change them for your mail
|
|
system, for example increase weights of some rules or decrease for others. Also
|
|
note that default grow factor is 1.0 that means that weights of rules do not
|
|
depend on count of added symbols. For some situations it useful to set grow
|
|
factor to value more than 1.0. Also by modifying factors it is possible to
|
|
manage static multiplier for dynamic rules.
|
|
|
|
@section Workers configuration.
|
|
|
|
Workers are rspamd processes that are doing specific jobs. Now are supported 4
|
|
types of workers:
|
|
@enumerate 1
|
|
@item Normal worker - a typical worker that process messages.
|
|
@item Controller worker - a worker that manages rspamd, get statistics and do
|
|
learning tasks.
|
|
@item Fuzzy storage worker - a worker that contains a collection of fuzzy
|
|
hashes.
|
|
@item LMTP worker - experimental worker that acts as LMTP server.
|
|
@end enumerate
|
|
|
|
These types of workers has some common parameters:
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Parameter @tab Mean
|
|
@item @emph{<type>}
|
|
@tab Type of worker (normal, controller, lmtp or fuzzy)
|
|
@item @emph{<bind_socket>}
|
|
@tab Socket credits to bind this worker to. Inet and unix sockets are supported:
|
|
@example
|
|
<bind_socket>localhost:11333</bind_socket>
|
|
<bind_socket>/var/run/rspamd.sock</bind_socket>
|
|
@end example
|
|
@noindent
|
|
Also for inet sockets you may specify @code{*} as address to bind to all
|
|
available inet interfaces:
|
|
@example
|
|
<bind_socket>*:11333</bind_socket>
|
|
@end example
|
|
@noindent
|
|
@item @emph{<count>}
|
|
@tab Number of worker processes of this type. By default this number is
|
|
equialent to number of logical processors in system.
|
|
@item @emph{<maxfiles>}
|
|
@tab Maximum number of file descriptors available to this worker process.
|
|
@item @emph{<maxcore>}
|
|
@tab Maximum size of core file that would be dumped in cause of critical errors
|
|
(in mega/kilo/giga bytes).
|
|
@end multitable
|
|
|
|
Also each of workers types can have specific parameters:
|
|
@itemize @bullet
|
|
@item Normal worker:
|
|
@itemize @bullet
|
|
@item @var{<custom_filters>} - path to dynamically loaded plugins that would do real
|
|
check of incoming messages. These modules are described further.
|
|
@item @var{<mime>} - if this parameter is "no" than this worker assumes that incoming
|
|
messages are in non-mime format (e.g. forum's messages) and standart mime
|
|
headers are added to them.
|
|
@end itemize
|
|
@item Controller worker:
|
|
@itemize @bullet
|
|
@item @var{<password>} - a password that would be used to access to contorller's
|
|
privilleged commands.
|
|
@end itemize
|
|
@item Fuzzy worker:
|
|
@itemize @bullet
|
|
@item @var{<hashfile>} - a path to file where fuzzy hashes would be permamently stored.
|
|
@item @var{<use_judy>} - if libJudy is present in system use it for faster storage.
|
|
@item @var{<frequent_score>} - if judy is not turned on use this score to place hashes
|
|
with score that is more than this value to special faster list (this is designed
|
|
to increase lookup speed for frequent hashes).
|
|
@item @var{<expire>} - time to expire of fuzzy hashes after their placement in storage.
|
|
@end itemize
|
|
@end itemize
|
|
|
|
These parameters can be set inside worker's definition:
|
|
@example
|
|
<worker>
|
|
<type>fuzzy</type>
|
|
<bind_socket>*:11335</bind_socket>
|
|
<count>1</count>
|
|
<maxfiles>2048</maxfiles>
|
|
<maxcore>0</maxcore>
|
|
<!-- Other params -->
|
|
<param name="use_judy">yes</param>
|
|
<param name="hashfile">/spool/rspamd/fuzzy.db</param>
|
|
<param name="expire">10d</param>
|
|
</worker>
|
|
@end example
|
|
@noindent
|
|
|
|
The purpose of each worker's type would be described later. The main parameters
|
|
that could be defined are bind sockets for workers, their count, password for
|
|
controller's commands and parameters for fuzzy storage. Default config provides
|
|
reasonable values of this parameters (except password of course), so for basic
|
|
configuration you may just replace controller's password to more secure one.
|
|
|
|
@section Classifiers configuration.
|
|
|
|
@subsection Common classifiers options.
|
|
|
|
Each classifier has mandatory option @var{type} that defines internal algorithm
|
|
that is used for classifying. Currently only @code{winnow} is supported. You can
|
|
read theoretical description of algorithm used here:
|
|
@url{http://www.siefkes.net/papers/winnow-spam.pdf}
|
|
|
|
The common classifier configuration consists of base classifier parameters and
|
|
definitions of two (or more than two) statfiles. During classify process rspamd
|
|
check each statfile in classifier and select those that has more
|
|
probability/weight than others. If all statfiles has zero weight this classifier
|
|
do not add any symbols. Among common classifiers options are:
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Tag @tab Mean
|
|
@item @var{<tokenizer>}
|
|
@tab Tokenizer to extract tokens from messages. Currently only @emph{osb}
|
|
tokenizer is supported
|
|
@item @var{<metric>}
|
|
@tab Metric to which this classifier would insert symbol.
|
|
@end multitable
|
|
|
|
Also option @var{min_tokens} is supported to specify minimum number of tokens to
|
|
work with (this is usefull to avoid classifying of short messages as statistic
|
|
is practically useless for small amount of tokens). Here is example of base
|
|
classifier config:
|
|
@example
|
|
<classifier type="winnow">
|
|
<tokenizer>osb-text</tokenizer>
|
|
<metric>default</metric>
|
|
<option name="min_tokens">20</option>
|
|
<statfile>
|
|
...
|
|
</statfile>
|
|
</classifier>
|
|
@end example
|
|
|
|
@subsection Statfiles options.
|
|
|
|
The most common statfile options are @var{symbol} and @var{size}. The first one defines
|
|
which symbol would be inserted if this statfile would have maximal weight inside
|
|
classifier and size defines statfile size on disk and in memory. Note that
|
|
statfiles are mapped directly to memory and you should practically note
|
|
parameter @var{statfile_pool_size} of main section which defines maximum ammount
|
|
of memory for mapping statistic files. Also note that statistic files are
|
|
of constant size: if you defines 100 megabytes statfile it would occupy 100
|
|
megabytes of disc space and 100 megabytes of memory when it is used (mapped).
|
|
Each statfile is indexed by tokens and contains so called "token chains". This
|
|
mechanizm would be described further but note that each statfile has parameter
|
|
"free tokens" that defines how much space is available for new tokens. If
|
|
statfile has no free space the most unused tokens would be removed from
|
|
statfile.
|
|
|
|
Here is list of common options of statfiles:
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Tag @tab Mean
|
|
@item @var{<symbol>}
|
|
@tab Defines symbol to insert for this statfile.
|
|
@item @var{<size>}
|
|
@tab Size of this statfile in bytes (kilo/mega/giga bytes).
|
|
@item @var{<path>}
|
|
@tab Filesystem path to statistic file.
|
|
@item @var{<normalizer>}
|
|
@tab Defines weight normalization structure. Can be lua function name or
|
|
internal normalizer. Internal normalizer is defined in format:
|
|
"internal:<max_weight>" where max_weight is fractional number that limits the
|
|
maximum weight of this statfile's symbol (this is so called dynamic weight).
|
|
@item @var{<binlog>}
|
|
@tab Defines binlog affinity: master or slave. This option is used for statfiles
|
|
binary sync that would be described further.
|
|
@item @var{<binlog_master>}
|
|
@tab Defines credits of binlog master for this statfile.
|
|
@item @var{<binlog_rotate>}
|
|
@tab Defines rotate time for binlog.
|
|
@end multitable
|
|
|
|
Internal normalization of statfile weight works in this way:
|
|
@itemize @bullet
|
|
@item @math{R_{score} = 1} when @math{W_{statfile} < 1}
|
|
@item @math{R_{score} = W_statfile ^ 2} when @math{1 < W_{statfile} < max / 2}
|
|
@item @math{R_{score} = W_statfile} when @math{max / 2 < W_{statfile} < max}
|
|
@item @math{R_{score} = max} when @math{W_{statfile} > max}
|
|
@end itemize
|
|
|
|
The final result weight would be: @math{weight = R_{score} * W_{factor}}.
|
|
Here is sample classifier configuration with two statfiles that can be used for
|
|
spam/ham classifying:
|
|
|
|
@example
|
|
<factors>
|
|
<factor name="WINNOW_HAM">-1.00</factor>
|
|
<factor name="WINNOW_SPAM">1.00</factor>
|
|
...
|
|
</factors>
|
|
|
|
<!-- Classifiers section -->
|
|
<classifier type="winnow">
|
|
<tokenizer>osb-text</tokenizer>
|
|
<metric>default</metric>
|
|
<option name="min_tokens">20</option>
|
|
<statfile>
|
|
<symbol>WINNOW_HAM</symbol>
|
|
<size>100M</size>
|
|
<path>/var/run/rspamd/data.ham</path>
|
|
<normalizer>internal:3</normalizer>
|
|
</statfile>
|
|
<statfile>
|
|
<symbol>WINNOW_SPAM</symbol>
|
|
<size>100M</size>
|
|
<path>/var/run/rspamd/data.spam</path>
|
|
<normalizer>internal:3</normalizer>
|
|
</statfile>
|
|
</classifier>
|
|
<!-- End of classifiers section -->
|
|
@end example
|
|
@noindent
|
|
In this sample we define classifier that contains two statfiles:
|
|
@emph{WINNOW_SPAM} and @emph{WINNOW_HAM}. Each statfile has 100 megabytes size
|
|
(so they would occupy 200Mb while classifying). Also each statfile has maximum
|
|
weight of 3 so with such factors (-1 for WINNOW_HAM and 1 for WINNOW_SPAM) the
|
|
result weight of symbols would be 0..3 for @emph{WINNOW_SPAM} and 0..-3 for
|
|
@emph{WINNOW_HAM}.
|
|
|
|
@section Modules config.
|
|
|
|
@subsection Lua modules loading.
|
|
For loading custom lua modules you should use @emph{<modules>} section:
|
|
@example
|
|
<modules>
|
|
<module>/usr/local/etc/rspamd/plugins/lua</module>
|
|
</modules>
|
|
@end example
|
|
@noindent
|
|
Each @emph{<module>} directive defines path to lua modules. If this is a
|
|
directory so all @code{*.lua} files inside that directory would be loaded. If
|
|
this is a file it would be loaded directly.
|
|
|
|
@subsection Modules configuration.
|
|
Each module can have its own config section (this is true not only for internal
|
|
module but also for lua modules). Such section is called @emph{<module>} with
|
|
mandatory attribute @emph{"name"}. Each module can be configured by
|
|
@emph{<option>} directives. These directives must also have @emph{"name"}
|
|
attribute. So module configuration is done in @code{param = value} style:
|
|
@example
|
|
<module name="fuzzy_check">
|
|
<option name="servers">localhost:11335</option>
|
|
<option name="symbol">R_FUZZY</option>
|
|
<option name="min_length">300</option>
|
|
<option name="max_score">10</option>
|
|
<option name="metric">default</option>
|
|
</module>
|
|
@end example
|
|
@noindent
|
|
The common parameters are:
|
|
@itemize @bullet
|
|
@item symbol - symbol that this module should insert.
|
|
@item metric - a metric in which this module shoul work.
|
|
@end itemize
|
|
But each module can have its own unique parameters. So it would be discussed
|
|
furhter in detailed modules description. Also note that for internal modules you
|
|
should edit @emph{<filters>} parameter in main section: this parameter defines
|
|
which internal modules would be turned on in this configuration.
|
|
|
|
@chapter Rspamd clients interaction.
|
|
|
|
@section Introduction.
|
|
After you have basic config file you may test rspamd functionality by using
|
|
whether telnet like utility or @emph{rspamc} client. For testing newly installed
|
|
config it is possible to run config file test:
|
|
@example
|
|
$ rspamd -t
|
|
syntax OK
|
|
@end example
|
|
|
|
Rspamc utility is written in @code{perl} language and uses perl modules that are
|
|
shipped with rspamd: @emph{Mail::Rspamd::Client} for client's protocol and
|
|
@emph{Mail::Rspamd::Config} for reading and writing configuration. The
|
|
documentation for these modules can be found by commands:
|
|
@example
|
|
$ perldoc Mail::Rspamd::Client
|
|
$ perldoc Mail::Rspamd::Config
|
|
@end example
|
|
|
|
So other way to access rspamd is to use perl client API:
|
|
@example
|
|
use Mail::Rspamd::Client;
|
|
my $config = {
|
|
hosts => ['localhost:11333'],
|
|
};
|
|
|
|
my $client = new Mail::Rspamd::Client(%config);
|
|
|
|
if (! $client->ping()) {
|
|
die "Cannot ping rspamd: $client->{error}";
|
|
}
|
|
|
|
my $result = $client->check($testmsg);
|
|
|
|
if ($result->{'default'}->{isspam} eq 'True') {
|
|
# do something with spam message here
|
|
}
|
|
@end example
|
|
|
|
@section Rspamc protocol.
|
|
Rspamc protocol is an extension over traditional spamc protocol that is used by
|
|
spamassassin. This protocol looks like traditional HTTP session: first line is
|
|
method with version, headers can be passed by next lines and the message itself
|
|
is waited after empty line:
|
|
@example
|
|
<REQUEST>
|
|
SYMBOLS RSPAMC/1.1
|
|
Content-Length: 2200
|
|
|
|
<message octets>
|
|
|
|
<REPLY>
|
|
RSPAMD/1.1 0 OK
|
|
Metric: default; True; 10.40 / 10.00 / 0.00
|
|
Symbol: R_UNDISC_RCPT
|
|
Symbol: ONCE_RECEIVED
|
|
Symbol: R_MISSING_CHARSET
|
|
Urls:
|
|
@end example
|
|
@noindent
|
|
The format of method line can be presented as:
|
|
@example
|
|
<COMMAND> RSPAMC/<version>
|
|
@end example
|
|
@noindent
|
|
Version can be 1.0 and 1.1. The main difference that in 1.1 metrics output also
|
|
has @emph{reject score} - hard limit of score for metric. This would be
|
|
discussed while describing user's options. Commands are:
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Command @tab Mean
|
|
@item CHECK
|
|
@tab Check a message and output results for each metric. But do not output
|
|
symbols.
|
|
@item SYMBOLS
|
|
@tab Same as @emph{CHECK} but output symbols.
|
|
@item PROCESS
|
|
@tab Same as @emph{SYMBOLS} but output also original message with inserted
|
|
X-Spam headers.
|
|
@item PING
|
|
@tab Do not do any processing, just check rspamd state:
|
|
@example
|
|
$ telnet localhost 11333
|
|
Trying 127.0.0.1...
|
|
Connected to localhost.
|
|
Escape character is '^]'.
|
|
PING RSPAMC/1.1
|
|
|
|
RSPAMD/1.1 0 PONG
|
|
Connection closed by foreign host.
|
|
@end example
|
|
@noindent
|
|
@end multitable
|
|
|
|
After command there should be one mandatory header: @strong{Content-Length} that
|
|
defines message's length in bytes and optional headers:
|
|
@multitable @columnfractions .2 .8
|
|
@headitem Header @tab Mean
|
|
@item @var{Deliver-To:}
|
|
@tab Defines actual delivery recipient of message. Can be used for personalized
|
|
statistic and for user specific options.
|
|
@item @var{IP:}
|
|
@tab Defines IP from which this message is received.
|
|
@item @var{Helo:}
|
|
@tab Defines SMTP helo.
|
|
@item @var{From:}
|
|
@tab Defines SMTP mail from command data.
|
|
@item @var{Queue-Id:}
|
|
@tab Defines SMTP queue id for message (can be used instead of message id in
|
|
logging).
|
|
@item @var{Rcpt:}
|
|
@tab Defines SMTP recipient (it may be several @emph{Rcpt:} headers).
|
|
@item @var{Pass:}
|
|
@tab If this header has @emph{"all"} value, all filters would be checked for
|
|
this message.
|
|
@item @var{Subject:}
|
|
@tab Defines subject of message (is used for non-mime messages).
|
|
@item @var{User:}
|
|
@tab Defines SMTP user (this is currently unused in rspamd however).
|
|
@end multitable
|
|
So rspamc protocol allows to pass many data from MTA to rspamd. This is used to
|
|
increase speed of processing and for building filters (like SPF filter). Also
|
|
note that rspamd support spamassassin spamc protocol and you can even pass
|
|
rspamc headers in spamc mode, but reply of rspamd in spamc mode would be much
|
|
shorter: it would only use "default" metric and won't show additional options
|
|
for symbols. Rspamc reply looks like this:
|
|
@example
|
|
RSPAMD/1.1 0 OK
|
|
Metric: default; True; 10.40 / 10.00 / 0.00
|
|
Symbol: R_UNDISC_RCPT
|
|
Symbol: ONCE_RECEIVED
|
|
Symbol: R_MISSING_CHARSET
|
|
Urls:
|
|
@end example
|
|
@noindent
|
|
First line is method reply: @code{<PROTOCOL>/<VERSION> <ERROR_CODE> <ERROR_REPLY>}.
|
|
Error code is 0 when no error occured. After first reply line there are metrics
|
|
output. For @emph{SYMBOLS} and @emph{PROCESS} commands there are symbols lines
|
|
after each metric. And for @emph{PROCESS} command there would be original
|
|
message after all metrics results. Metric result line looks like this:
|
|
@example
|
|
Metric: <name>; <result>; <score> / <required_score> / <reject_score>
|
|
@end example
|
|
@noindent
|
|
For 1.0 version of rspamc protocol @emph{reject_score} parameter is not printed.
|
|
Symbol line looks like this:
|
|
@example
|
|
Symbol: <Name>[; param1[, param2...]]
|
|
@end example
|
|
@noindent
|
|
Some symbols can have parameters attached. It is useful for example for RBL
|
|
checks (you can insert additional data after symbol name), for statistic and
|
|
fuzzy checks. Also rspamd inserts @emph{Urls} line in which all urls that are
|
|
contained in message are printed in comma-separated list.
|
|
Note that this protocol is used for normal workers. Controller, fuzzy storage
|
|
and lmtp/smtp workers are using other protocols. For example controller's
|
|
protocol is oriented on interactive sessions: you can pass many commands to
|
|
controller before disconnecting. Fuzzy storage is using UDP for making
|
|
interaction with storage faster. LMTP/SMTP workers are using lmtp and smtp
|
|
protocols. All of these protocols would be described in further chapters about
|
|
rspamd workers.
|
|
|
|
@bye
|