rspamd/doc/rspamd.texi


								\input texinfo

								@settitle "Rspamd Spam Filtering System"

								@titlepage


								@title Rspamd Spam Filtering System

								@subtitle A User's Guide for Rspamd


								@author Vsevolod Stakhov


								@end titlepage

								@contents


								@chapter Rspamd purposes and features.


								@node introduction

								@section Introduction.

								Rspamd filtering system is created as a replacement of popular

								@code{spamassassin}

								spamd and is designed to be fast, modular and easily extendable system. Rspamd

								core is written in @code{C} language using event driven paradigma. Plugins for rspamd

								can be written in @code{lua}. Rspamd is designed to process connections

								completely asynchronous and do not block anywhere in code. Spam filtering system

								contains of several processes among them are:

								@itemize @bullet

								@item Main process

								@item Workers processes

								@item Controller process

								@item Other processes

								@end itemize

								Main process manages all other processes, accepting signals from OS (for example

								SIGHUP) and spawn all types of processes if any of them die. Workers processes

								do all tasks for filtering e-mail (or HTML messages in case of using rspamd as

								non-MIME filter). Controller process is designed to manage rspamd itself (for

								example get statistics or learning rspamd). Other processes can do different

								jobs among them now are implemented @code{LMTP} worker that implements

								@code{LMTP} protocol for filtering mail and fuzzy hashes storage server.


								@node features

								@section Features.

								The main features of rspamd are:

								@itemize @bullet

								@item Completely asynchronous filtering that allows a big number of simultenious

								connections.

								@item Easily extendable architecture that can be extended by plugins written in

								@code{lua} and by dynamicaly loaded plugins written in @code{c}.

								@item Ability to work in cluster: rspamd is able to perform statfiles

								synchronization, dynamic load of lists via HTTP, to use distributed fuzzy hashes

								storage.

								@item Advanced statistics: rspamd now is shipped with winnow-osb classifier that

								provides more accurate statistics than traditional bayesian algorithms based on

								single words.

								@item Internal optimizer: rspamd first of all try to check rules that were met

								more often, so for huge spam storms it works very fast as it just checks only

								that rules that @emph{can} happen and skip all others.

								@item Ability to manage the whole cluster by using controller process.

								@item Compatibility with existing @code{spamassassin} SPAMC protocol.

								@item Extended @code{RSPAMC} protocol that allows to pass many additional data

								from SMTP dialog to rspamd filter.

								@item Internal support of IMAP in rspamc client for automated learning.

								@item Internal support of many anti-spam technologies, among them are

								@code{SPF} and @code{SURBL}.

								@item Active support and development of new features.

								@end itemize


								@chapter Installation of rspamd.


								@node obtaining

								@section Obtaining of rspamd.


								The main rspamd site is @url{http://rspamd.sourceforge.net/, sourceforge}. Here

								you can obtain source code package as well as pre-packed packages for different

								operating systems and architectures. Also, you can use SCM

								@url{http://mercurial.selenic.com, mercurial} for accessing rspamd development

								repository that can be found here:

								@url{http://rspamd.hg.sourceforge.net:8000/hgroot/rspamd/rspamd}. Rspamd is

								shipped with all modules and sample config by default. But there are some

								requirements for building and running rspamd.


								@node requirements

								@section Requirements.


								For building rspamd from sources you need @code{CMake} system. CMake is very

								nice source building system and I decided to use it instead of GNU autotools.

								CMake can be obtained here: @url{http://cmake.org}. Also rspamd uses gmime and

								glib for MIME parsing and many other purposes (note that you are NOT required

								to install any GUI libraries -  nor glib, nor gmime are GUI libraries). Gmime

								and glib can be obtained from gnome site: @url{http://ftp.gnome.org/}. For

								plugins and configuration system you also need lua language interpreter and

								libraries. They can be easily obtained from @url{http://lua.org, official lua

								site}. Also for rspamc client you need @code{perl} interpreter that could be

								installed from @url{http://www.perl.org}.


								@node building

								@section Building and Installation.


								Build process of rspamd is rather simple:

								@itemize @bullet

								@item Configure rspamd build environment, using cmake:

								@example

								$ cmake .

								...

								-- Configuring done

								-- Generating done

								-- Build files have been written to: /home/cebka/rspamd

								@end example

								@noindent

								For special configuring options you can use

								@example

								$ ccmake .

								 CMAKE_BUILD_TYPE

								 CMAKE_INSTALL_PREFIX             /usr/local

								 DEBUG_MODE                       ON

								 ENABLE_GPERF_TOOLS               OFF

								 ENABLE_OPTIMIZATION              OFF

								 ENABLE_PERL                      OFF

								 ENABLE_PROFILING                 OFF

								 ENABLE_REDIRECTOR                OFF

								 ENABLE_STATIC                    OFF

								@end example

								@noindent

								Options allows building rspamd as static module (note that in this case

								dynamicaly loaded plugins are @strong{NOT} supported), linking rspamd with

								google performance tools for benchmarking and include some other flags while

								building.

								@item Build rspamd sources:

								@example

								$ make

								[  6%] Built target rspamd_lua

								[ 11%] Built target rspamd_json

								[ 12%] Built target rspamd_evdns

								[ 12%] Built target perlmodule

								[ 58%] Built target rspamd

								[ 76%] Built target test/rspamd-test

								[ 85%] Built target utils/expression-parser

								[ 94%] Built target utils/url-extracter

								[ 97%] Built target rspamd_ipmark

								[100%] Built target rspamd_regmark

								@end example

								@noindent

								@item Install rspamd (as superuser):

								@example

								# make install

								Install the project...

								...

								@end example

								@noindent

								@end itemize


								After installation you would have several new files installed:

								@itemize @bullet


								@item Binaries:

								@itemize @bullet

								@item PREFIX/bin/rspamd - main rspamd executable

								@item PREFIX/bin/rspamc - rspamd client program

								@end itemize

								@item Sample configuration files and rules:

								@itemize @bullet

								@item PREFIX/etc/rspamd.xml.sample - sample main config file

								@item PREFIX/etc/rspamd/lua/*.lua - rspamd rules

								@end itemize

								@item Lua plugins:

								@itemize @bullet

								@item PREFIX/etc/rspamd/plugins/lua/*.lua - lua plugins

								@end itemize


								@end itemize

								For @code{FreeBSD} system there also would be start script for running rspamd in

								@emph{PREFIX/etc/rc.d/rspamd.sh}.


								@node running

								@section Running rspamd.


								Rspamd can be started by running main rspamd executable -

								@code{PREFIX/bin/rspamd}. There are several command-line options that can be

								passed to rspamd. All of them can be displayed by passing --help argument:

								@example

								$ rspamd --help

								Usage:

								  rspamd [OPTION...] - run rspamd daemon


								Summary:

								  Rspamd daemon version 0.3.0


								Help Options:

								  -?, --help               Show help options


								Application Options:

								  -t, --config-test        Do config test and exit

								  -f, --no-fork            Do not daemonize main process

								  -c, --config             Specify config file

								  -u, --user               User to run rspamd as

								  -g, --group              Group to run rspamd as

								  -p, --pid                Path to pidfile

								  -V, --dump-vars          Print all rspamd variables and exit

								  -C, --dump-cache         Dump symbols cache stats and exit

								  -X, --convert-config     Convert old style of config to xml one

								@end example

								@noindent


								All options are optional: by default rspamd would try to read

								@code{PREFIX/etc/rspamd.xml} config file and run as daemon. Also there is test

								mode that can be turned on by passing @option{-t} argument. In test mode rspamd

								would read config file and checks its syntax, if config file is OK, then exit

								code is zero and non zero otherwise. Test mode is useful for testing new config

								file without restarting of rspamd. With @option{-C} and @option{-V} arguments it is

								possible to dump variables or symbols cache data. The last ability can be used

								for determining which symbols are most often, which are most slow and to watch

								to real order of rules inside rspamd. @option{-X} option can be used to convert

								old style (pre 0.3.0) config to xml one:

								@example

								$ rspamd -c ./rspamd.conf -X ./rspamd.xml

								@end example

								@noindent

								After this command new xml config would be dumped to rspamd.xml file.


								@node signals

								@section Managing rspamd with signals.

								First of all it is important to note that all user's signals should be sent to

								rspamd main process and not to its children (as for child processes these

								signals may have other meanings). To determine which process is main you can use

								two ways:

								@itemize @bullet

								@item by reading pidfile:

								@example

								$ cat pidfile

								@end example

								@noindent

								@item by getting process info:

								@example

								$ ps auxwww | grep rspamd

								nobody 28378  0.0  0.2 49744  9424   rspamd: main process (rspamd)

								nobody 64082  0.0  0.2 50784  9520   rspamd: worker process (rspamd)

								nobody 64083  0.0  0.3 51792 11036   rspamd: worker process (rspamd)

								nobody 64084  0.0  2.7 158288 114200 rspamd: controller process (rspamd)

								nobody 64085  0.0  1.8 116304 75228  rspamd: fuzzy storage (rspamd)


								$ ps auxwww | grep rspamd | grep main

								nobody 28378  0.0  0.2 49744  9424   rspamd: main process (rspamd)

								@end example

								@noindent

								@end itemize


								After getting pid of main process it is possible to manage rspamd with signals:

								@itemize @bullet

								@item SIGHUP - restart rspamd: reread config file, start new workers (as well as

								controller and other processes), stop accepting connections by old workers,

								reopen all log files. Note that old workers would be terminated after one minute

								that should allow to process all pending requests. All new requests to rspamd

								would be processed by newly started workers.

								@item SIGTERM - terminate rspamd system.

								@end itemize


								These signals may be used in start scripts as it is done in @code{FreeBSD} start

								script. Restarting of rspamd is doing rather softly: no connections would be

								dropped and if new config is syntaxically incorrect old config would be used.


								@chapter Configuring of rspamd.


								@node principles

								@section Principles of work.


								We need to define several terms to explain configuration of rspamd. Rspamd

								operates with @strong{rules}, each rule defines some actions that should be done with

								message to obtain result. Result is called @strong{symbol} - a symbolic

								representation of rule. For example, if we have a rule to check DNS record for

								a url that contains in message we may insert resulting symbol if this DNS record

								is found. Each symbol has several attributes:

								@itemize @bullet

								@item name - symbolic name of symbol (usually uppercase, e.g. MIME_HTML_ONLY)

								@item weight - numeric weight of this symbol (this means how important this rule is), may

								be negative

								@item options - list of symbolic options that defines additional information about

								processing this rule

								@end itemize


								Weights of symbols are called @strong{factors}. Also when symbol is inserted it

								is possible to define additional multiplier to factor. This can be used for

								rules that have dynamic weights, for example statistical rules (when probability

								is higher weight must be higher as well).


								All symbols and corresponding rules are combined in @strong{metrics}. Metric

								defines a group of symbols that are designed for common purposes. Each metric

								has maximum weight: if sum of all rules' results (symbols) is bigger than this

								limit then this message is considered as spam in this metric. The default metric

								is called @emph{default} and rules that have not explicitly specified metric

								would insert their results to this default metric.


								Let's impress how this technics works:

								@enumerate 1

								@item First of all when rspamd is running each module (lua, internal or external

								dynamic module) can register symbols in any defined metric. After this process

								rspamd has a cache of symbols for each metric. This cache can be saved to file

								for speeding up process of optimizing order of calling of symbols.

								@item Rspamd gets a message from client and parse it with mime parsing and do

								other parsing jobs like extracting text parts, urls, and stripping html tags.

								@item For each metric rspamd is looking to metric's cache and select rules to

								check according to their order (this order depends on frequence of symbol, its

								weight and execution time).

								@item Rspamd calls rules of metric till the sum weight of symbols in metric is

								less than its limit.

								@item If sum weight of symbols is more than limit the processing of rules is

								stopped and message is counted as spam in this metric.

								@end enumerate


								After processing rules rspamd is also does statistic check of message. Rspamd

								statistic module is presented as a set of @strong{classifiers}. Each classifier

								defines algorithm of statistic checks of messages. Also classifier definition

								contains definition of @strong{statistic files} (or @strong{statfiles} shortly).

								Each statfile contains of number of patterns that are extracted from messages.

								These patterns are put into statfiles during learning process. A short example:

								you define classifier that contains two statfiles: @emph{ham} and @emph{spam}.

								Than you find 10000 messages that are spam and 10000 messages that contains ham.

								Then you learn rspamd with these messages. After this process @emph{ham}

								statfile contains patterns from ham messages and @emph{spam} statfile contains

								patterns from spam messages. Then when you are checking message via this

								statfiles messages that are like spam would have more probability/weight in

								@emph{spam} statfile than in @emph{ham} statfile and classifier would insert

								symbol of @emph{spam} statfile and would calculate how this message is like

								patterns that are contained in @emph{spam} statfile. But rspamd is not limiting

								you to define one classifier or two statfiles. It is possible to define a number

								of classifiers and a number of statfiles inside a classifier. It can be useful

								for personal statistic or for specific spam patterns. Note that each classifier

								can insert only one symbol - a symbol of statfile with max weight/probability.

								Also note that statfiles check is allways done after all rules. So statistic can

								@strong{correct} result of rules.


								Now some words about @strong{modules}. All rspamd rules are contained in

								modules. Modules can be internal (like SURBL, SPF, fuzzy check, email and

								others) and external written in @code{lua} language. In fact there is no differ

								in the way, how rules of these modules are called:

								@enumerate 1

								@item Rspamd loads config and loads specified modules.

								@item Rspamd calls init function for each module passing configurations

								arguments.

								@item Each module examines configuration arguments and register its rules (or

								not register depending on configuration) in rspamd metrics (or in a single

								metric).

								@item During metrics process rspamd calls registered callbacks for module's

								rules.

								@item These rules may insert results to metric.

								@end enumerate


								So there is no actual difference between lua and internal modules, each are just

								providing callbacks for processing messages. Also inside callback it is possible

								to change state of message's processing. For example this can be done when it is

								required to make DNS or other network request and to wait result. So modules can

								pause message's processing while waiting for some event. This is true for lua

								modules as well.


								@node config structure

								@section Rspamd config file structure.


								Rspamd config file is placed in PREFIX/etc/rspamd.xml by default. You can

								specify other location by passing @option{-c} option to rspamd. Rspamd config file

								contains configuration parameters in XML format. XML was selected for rather

								simple manual editing config file and for simple automatic generation as well as

								for dynamic configuration. I've decided to move rules logic from XML file to

								keep it small and simple. So rules are defined in @code{lua} language and rspamd

								parameters are defined in xml file (rspamd.xml). Configuration rules are

								included by @strong{<lua>} tag that have @strong{src} attribute that defines

								relative path to lua file (relative to placement of rspamd.xml):

								@example

								<lua src="rspamd/lua/rspamd.lua">fake</lua>

								@end example

								@noindent

								Note that it is not currently possible to have empty tags. I hope this

								restriction would be fixed in future. Rspamd xml config consists of several

								sections:

								@itemize @bullet

								@item Main section - section where main config parameters are placed.

								@item Workers section - section where workers are described.

								@item Classifiers section - section where you define your classify logic

								@item Modules section - a set of sections that describes module's rules (in fact

								these rules should be in lua code)

								@item Factors section - a section where you can set numeric values for symbols

								@item Logging section - a section that describes rspamd logging

								@item Views section - a section that defines rspamd views

								@end itemize


								So common structure of rspamd.xml can be described this way:

								@example

								<? xml version="1.0" encoding="utf-8" ?>

								<rspamd>

								 <!-- Main section directives -->

								 ...

								 <!-- Workers directives -->

								 <worker>

								  ...

								 </worker>

								 ...

								 <!-- Classifiers directives -->

								 <classifier>

								  ...

								 </classifier>

								 ...

								 <!-- Factors -->

								 <factors>

								  <factor name="MIME_HTML_ONLY>1.1</factor>

								  ...

								 </factors>

								 <!-- Logging section -->

								 <logging>

								  <type>console</type>

								  <level>info</level>

								  ...

								 </logging>

								 <!-- Views section -->

								 <view>

								  ...

								 </view>

								 ...

								 <!-- Modules settings -->

								 <module name="regexp">

								  <option name="test">test</option>

								  ...

								 </module>

								 ...

								</rspamd>

								@end example


								Each of these sections would be described further in details.


								@section Rspamd configuration atoms.

								@node config atoms


								There are several primitive types of rspamd configuration parameters:

								@itemize @bullet

								@item String - common string that defines option.

								@item Number - integer or fractional number (e.g.: 10 or -1.5).

								@item Time - ammount of time in milliseconds, may has suffixes:

								@itemize @bullet

								@item @emph{s} - for seconds (e.g. @emph{10s});

								@item @emph{m} - for minutes (e.g. @emph{10m});

								@item @emph{h} - for hours (e.g. @emph{10h});

								@item @emph{d} - for days (e.g. @emph{10d});

								@end itemize

								@item Size - like number numerci reprezentation of size, but may have a suffix:

								@itemize @bullet

								@item @emph{k} - 'kilo' - number * 1024 (e.g. @emph{10k});

								@item @emph{m} - 'mega' - number * 1024 * 1024 (e.g. @emph{10m});

								@item @emph{g} - 'giga' - number * 1024 * 1024 * 1024 (e.g. @emph{1g});

								@end itemize

								@noindent

								Size atoms are used for memory limits for example.

								@item Lists - path to dynamic rspamd list (e.g. @emph{http://some.host/some/path}).

								@end itemize


								While practically all atoms are rather trivial to understand rspamd lists may

								cause some confusion. Lists are widely used in rspamd for getting data that can

								be often changed for example white or black lists, lists of ip addresses, lists

								of domains. So for such purposes it is possible to use files that can be get

								either from local filesystem (e.g. @code{file:///var/run/rspamd/whitelsist}) or

								by HTTP (e.g. @code{http://some.host/some/path/list.txt}). Rspamd constantly

								looks for changes in this files, if using HTTP it also set

								@emph{If-Modified-Since} header and check for @emph{Not modified} reply. So it

								causes no overhead when lists are not modified and may allow to store huge lists

								and to distribute them over HTTP. Monitoring of lists is done with some random

								delay (jitter), so if you have many rspamd servers in cluster that are

								monitoring a single list they would come to check or download it in slightly different

								time. The two most common list formats are @emph{IP list} and @emph{domains

								list}. IP list contains of ip addresses in dot notation (e.g.

								@code{192.168.1.1}) or ip/network pairs in CIDR notation (e.g.

								@code{172.16.0.0/16}). Items in lists are separated by newline symbol. Lines

								that begin with @emph{#} symbol are considered as comments and are ignored while

								parsing. Domains list is very like ip list with difference that it contains

								domain names.


								@section Main rspamd configuration section.


								Main rspamd configurtion section contains several definitions that determine

								main parameters of rspamd for example path to pidfile, temporary directory, lua

								includes, several limits e.t.c. Here is list of this directives explained:


								@multitable @columnfractions .2 .8

								@headitem Tag @tab Mean


								@item @var{<tempdir>}

								@tab Defines temporary directory for rspamd. Default is to use @env{TEMP}

								environment variable or @code{/tmp}.


								@item @var{<pidfile>}

								@tab Path to rspamd pidfile. Here would be stored a pid of main process.

								Pidfile is used to manage rspamd from start scripts.


								@item @var{<statfile_pool_size>}

								@tab Limit of statfile pool size: a total number of bytes that can be used for

								mapping statistic files. Rspamd is using LRU system and would unmap the most

								unused statfile when this limit would be reached. The common sense is to set

								this variable equal to total size of all statfiles, but it can be less than this

								in case of dynamic statfiles (for per-user statistic).


								@item @var{<filters>}

								@tab List of enabled internal filters. Items in this list can be separated by

								spaces, semicolons or commas. If internal filter is not specified in this line

								it would not be loaded or enabled.


								@item @var{<raw_mode>}

								@tab Boolean flag that specify whether rspamd should try to convert all

								messages to UTF8 or not. If @var{raw_mode} is enabled all messages are

								processed @emph{as is} and are not converted. Raw mode is faster than utf mode

								but it may confuse statistics and regular expressions.


								@item @var{<lua>}

								@tab Defines path to lua file that should be loaded fro configuration. Path to

								this file is defined in @strong{src} attribute. Text inside tag is required but

								is not parsed (this is stupid limitation of parser's design).

								@end multitable


								@section Rspamd logging configuration.


								Rspamd has a number of logging variants. First of all there are three types of

								logs that are supported by rspamd: console loggging (just output log messages to

								console), file logging (output log messages to file) and logging via syslog.

								Also it is possible to filter logging to specific level:

								@itemize @bullet

								@item error - log only critical errors

								@item warning - log errors and warnings

								@item info - log all non-debug messages

								@item debug - log all including debug messages (huge amount of logging)

								@end itemize

								Also it is possible to turn on debug messages for specific ip addresses. This

								ability is usefull for testing.


								For each logging type there are special mandatory parameters: log facility for

								syslog (read @emph{syslog (3)} manual page for details about facilities), log

								file for file logging. Also file logging may be buffered for speeding up. For

								reducing logging noise rspamd detects for sequential identic log messages and

								replace them with total number of repeats:

								@example

								#81123(fuzzy): May 11 19:41:54 rspamd file_log_function: Last message repeated 155 times

								#81123(fuzzy): May 11 19:41:54 rspamd process_write_command: fuzzy hash was successfully added

								@end example


								Here is summary of logging parameters:


								@multitable @columnfractions .2 .8

								@headitem Tag @tab Mean

								@item @var{<type>}

								@tab Defines logging type (file, console or syslog). For each type mandatory

								attriute must be present:

								@itemize @bullet

								@item @emph{filename} - path to log file for file logging type;

								@item @emph{facility} - syslog logging facility.

								@end itemize


								@item @var{<level>}

								@tab Defines loggging level (error, warning, info or debug).


								@item @var{<log_buffer>}

								@tab For file and console logging defines buffer in bytes (kilo, mega or giga

								bytes) that would be used for logging output.


								@item @var{<log_urls>}

								@tab Flag that defines whether all urls in message would be logged. Useful for

								testing.


								@item @var{<debug_ip>}

								@tab List that contains ip addresses for which debugging would be turned on. For

								more information about ip lists look at @ref{config atoms}.

								@end multitable


								@section Factors configuration.


								Setting of rspamd factors is the main way to change rules' weights. Factors set

								up weights for all rules: for those that have static weights (for example simple

								regexp rules) and for those that have dynamic weights (for example statistic

								rules). In all cases the base weight of rule is multiplied by factor value. For

								static rules base weight is usually 1.0. So we have:

								@itemize @bullet

								@item @math{w_{symbol} = w_{static} * factor} - for static rules

								@item @math{w_{symbol} = w_{dynamic} * factor} - for dynamic rules

								@end itemize

								Also there is an ability to add so called "grow factor" - additional multiplier

								that would be used when we have more than one symbol in metric. So for each

								added symbol this factor would increment its power. This can be written as:

								@math{w_{total} = w_1 * gf ^ 0 + w_2 * gf ^ 1 + ... + w_n * gf ^ {n - 1}}

								Grow multiplier is used to increment weight of rules when message got many

								symbols (likely spammy). Note that only rules with positive weights would

								increase grow factor, those with negative weights would just be added. Also note

								that grow factor can be less than 1 but it is uncommon use (in this case we

								would have weight lowering when we have many symbols for this message). Factors

								can be set up with config section @emph{factors}:

								@example

								<factors>

								 <factor name="MIME_HTML_ONLY">0.1</factor>

								 <grow_factor>1.1</grow_factor>

								</factors>

								@end example


								Note that you basically need to add factor when you add additional rules. The

								decision of weight of newly added rule basically depends on its importance. For

								example you are absolutely sure that some rule would add a symbol on only spam

								messages, so you can increase weight of such rule so it would filter such spam.

								But if you increase weight of rules you should be more or less sure that it

								would not increase false positive errors rate to unacceptable level (false

								positive errors are errors when good mail is treated as spam). Rspamd comes with

								a set of default rules and default weights of that rules are placed in

								rspamd.xml.sample. In most cases it is reasonable to change them for your mail

								system, for example increase weights of some rules or decrease for others. Also

								note that default grow factor is 1.0 that means that weights of rules do not

								depend on count of added symbols. For some situations it useful to set grow

								factor to value more than 1.0. Also by modifying factors it is possible to

								manage static multiplier for dynamic rules.


								@section Workers configuration.


								Workers are rspamd processes that are doing specific jobs. Now are supported 4

								types of workers:

								@enumerate 1

								@item Normal worker - a typical worker that process messages.

								@item Controller worker - a worker that manages rspamd, get statistics and do

								learning tasks.

								@item Fuzzy storage worker - a worker that contains a collection of fuzzy

								hashes.

								@item LMTP worker - experimental worker that acts as LMTP server.

								@end enumerate


								These types of workers has some common parameters:

								@multitable @columnfractions .2 .8

								@headitem Parameter @tab Mean

								@item @emph{<type>}

								@tab Type of worker (normal, controller, lmtp or fuzzy)

								@item @emph{<bind_socket>}

								@tab Socket credits to bind this worker to. Inet and unix sockets are supported:

								@example

								<bind_socket>localhost:11333</bind_socket>

								<bind_socket>/var/run/rspamd.sock</bind_socket>

								@end example

								@noindent

								Also for inet sockets you may specify @code{*} as address to bind to all

								available inet interfaces:

								@example

								<bind_socket>*:11333</bind_socket>

								@end example

								@noindent

								@item @emph{<count>}

								@tab Number of worker processes of this type. By default this number is

								equialent to number of logical processors in system.

								@item @emph{<maxfiles>}

								@tab Maximum number of file descriptors available to this worker process.

								@item @emph{<maxcore>}

								@tab Maximum size of core file that would be dumped in cause of critical errors

								(in mega/kilo/giga bytes).

								@end multitable


								Also each of workers types can have specific parameters:

								@itemize @bullet

								@item Normal worker:

								@itemize @bullet

								@item @var{<custom_filters>} - path to dynamically loaded plugins that would do real

								check of incoming messages. These modules are described further.

								@item @var{<mime>} - if this parameter is "no" than this worker assumes that incoming

								messages are in non-mime format (e.g. forum's messages) and standart mime

								headers are added to them.

								@end itemize

								@item Controller worker:

								@itemize @bullet

								@item @var{<password>} - a password that would be used to access to contorller's

								privilleged commands.

								@end itemize

								@item Fuzzy worker:

								@itemize @bullet

								@item @var{<hashfile>} - a path to file where fuzzy hashes would be permamently stored.

								@item @var{<use_judy>} - if libJudy is present in system use it for faster storage.

								@item @var{<frequent_score>} - if judy is not turned on use this score to place hashes

								with score that is more than this value to special faster list (this is designed

								to increase lookup speed for frequent hashes).

								@item @var{<expire>} - time to expire of fuzzy hashes after their placement in storage.

								@end itemize

								@end itemize


								These parameters can be set inside worker's definition:

								@example

								<worker>

								  <type>fuzzy</type>

								  <bind_socket>*:11335</bind_socket>

								  <count>1</count>

								  <maxfiles>2048</maxfiles>

								  <maxcore>0</maxcore>

								<!-- Other params -->

								    <param name="use_judy">yes</param>

								    <param name="hashfile">/spool/rspamd/fuzzy.db</param>

								    <param name="expire">10d</param>

								</worker>

								@end example

								@noindent


								The purpose of each worker's type would be described later. The main parameters

								that could be defined are bind sockets for workers, their count, password for

								controller's commands and parameters for fuzzy storage. Default config provides

								reasonable values of this parameters (except password of course), so for basic

								configuration you may just replace controller's password to more secure one.


								@section Classifiers configuration.


								@subsection Common classifiers options.


								Each classifier has mandatory option @var{type} that defines internal algorithm

								that is used for classifying. Currently only @code{winnow} is supported. You can

								read theoretical description of algorithm used here:

								@url{http://www.siefkes.net/papers/winnow-spam.pdf}


								The common classifier configuration consists of base classifier parameters and

								definitions of two (or more than two) statfiles. During classify process rspamd

								check each statfile in classifier and select those that has more

								probability/weight than others. If all statfiles has zero weight this classifier

								do not add any symbols. Among common classifiers options are:

								@multitable @columnfractions .2 .8

								@headitem Tag @tab Mean

								@item @var{<tokenizer>}

								@tab Tokenizer to extract tokens from messages. Currently only @emph{osb}

								tokenizer is supported

								@item @var{<metric>}

								@tab Metric to which this classifier would insert symbol.

								@end multitable


								Also option @var{min_tokens} is supported to specify minimum number of tokens to

								work with (this is usefull to avoid classifying of short messages as statistic

								is practically useless for small amount of tokens). Here is example of base

								classifier config:

								@example

								<classifier type="winnow">

								 <tokenizer>osb-text</tokenizer>

								 <metric>default</metric>

								 <option name="min_tokens">20</option>

								 <statfile>

								  ...

								 </statfile>

								</classifier>

								@end example


								@subsection Statfiles options.


								The most common statfile options are @var{symbol} and @var{size}. The first one defines

								which symbol would be inserted if this statfile would have maximal weight inside

								classifier and size defines statfile size on disk and in memory. Note that

								statfiles are mapped directly to memory and you should practically note

								parameter @var{statfile_pool_size} of main section which defines maximum ammount

								of memory for mapping statistic files. Also note that statistic files are

								of constant size: if you defines 100 megabytes statfile it would occupy 100

								megabytes of disc space and 100 megabytes of memory when it is used (mapped).

								Each statfile is indexed by tokens and contains so called "token chains". This

								mechanizm would be described further but note that each statfile has parameter

								"free tokens" that defines how much space is available for new tokens. If

								statfile has no free space the most unused tokens would be removed from

								statfile.


								Here is list of common options of statfiles:

								@multitable @columnfractions .2 .8

								@headitem Tag @tab Mean

								@item @var{<symbol>}

								@tab Defines symbol to insert for this statfile.

								@item @var{<size>}

								@tab Size of this statfile in bytes (kilo/mega/giga bytes).

								@item @var{<path>}

								@tab Filesystem path to statistic file.

								@item @var{<normalizer>}

								@tab Defines weight normalization structure. Can be lua function name or

								internal normalizer. Internal normalizer is defined in format:

								"internal:<max_weight>" where max_weight is fractional number that limits the

								maximum weight of this statfile's symbol (this is so called dynamic weight).

								@item @var{<binlog>}

								@tab Defines binlog affinity: master or slave. This option is used for statfiles

								binary sync that would be described further.

								@item @var{<binlog_master>}

								@tab Defines credits of binlog master for this statfile.

								@item @var{<binlog_rotate>}

								@tab Defines rotate time for binlog.

								@end multitable


								Internal normalization of statfile weight works in this way:

								@itemize @bullet

								@item @math{R_{score} = 1} when @math{W_{statfile} < 1}

								@item @math{R_{score} = W_statfile ^ 2} when @math{1 < W_{statfile} < max / 2}

								@item @math{R_{score} = W_statfile} when @math{max / 2 < W_{statfile} < max}

								@item @math{R_{score} = max} when @math{W_{statfile} > max}

								@end itemize


								The final result weight would be: @math{weight = R_{score} * W_{factor}}.

								Here is sample classifier configuration with two statfiles that can be used for

								spam/ham classifying:


								@example

								<factors>

								   <factor name="WINNOW_HAM">-1.00</factor>

								   <factor name="WINNOW_SPAM">1.00</factor>

								...

								</factors>


								<!-- Classifiers section -->

								<classifier type="winnow">

								 <tokenizer>osb-text</tokenizer>

								 <metric>default</metric>

								 <option name="min_tokens">20</option>

								 <statfile>

								  <symbol>WINNOW_HAM</symbol>

								  <size>100M</size>

								  <path>/var/run/rspamd/data.ham</path>

								  <normalizer>internal:3</normalizer>

								 </statfile>

								 <statfile>

								  <symbol>WINNOW_SPAM</symbol>

								  <size>100M</size>

								  <path>/var/run/rspamd/data.spam</path>

								  <normalizer>internal:3</normalizer>

								 </statfile>

								</classifier>

								<!-- End of classifiers section -->

								@end example

								@noindent

								In this sample we define classifier that contains two statfiles:

								@emph{WINNOW_SPAM} and @emph{WINNOW_HAM}. Each statfile has 100 megabytes size

								(so they would occupy 200Mb while classifying). Also each statfile has maximum

								weight of 3 so with such factors (-1 for WINNOW_HAM and 1 for WINNOW_SPAM) the

								result weight of symbols would be 0..3 for @emph{WINNOW_SPAM} and 0..-3 for

								@emph{WINNOW_HAM}.


								@section Modules config.


								@subsection Lua modules loading.

								For loading custom lua modules you should use @emph{<modules>} section:

								@example

								<modules>

								 <module>/usr/local/etc/rspamd/plugins/lua</module>

								</modules>

								@end example

								@noindent

								Each @emph{<module>} directive defines path to lua modules. If this is a

								directory so all @code{*.lua} files inside that directory would be loaded. If

								this is a file it would be loaded directly.


								@subsection Modules configuration.

								Each module can have its own config section (this is true not only for internal

								module but also for lua modules). Such section is called @emph{<module>} with

								mandatory attribute @emph{"name"}. Each module can be configured by

								@emph{<option>} directives. These directives must also have @emph{"name"}

								attribute. So module configuration is done in @code{param = value} style:

								@example

								<module name="fuzzy_check">

								  <option name="servers">localhost:11335</option>

								  <option name="symbol">R_FUZZY</option>

								  <option name="min_length">300</option>

								  <option name="max_score">10</option>

								  <option name="metric">default</option>

								</module>

								@end example

								@noindent

								The common parameters are:

								@itemize @bullet

								@item symbol - symbol that this module should insert.

								@item metric - a metric in which this module shoul work.

								@end itemize

								But each module can have its own unique parameters. So it would be discussed

								furhter in detailed modules description. Also note that for internal modules you

								should edit @emph{<filters>} parameter in main section: this parameter defines

								which internal modules would be turned on in this configuration.


								@chapter Rspamd clients interaction.


								@section Introduction.

								After you have basic config file you may test rspamd functionality by using

								whether telnet like utility or @emph{rspamc} client. For testing newly installed

								config it is possible to run config file test:

								@example

								$ rspamd -t

								syntax OK

								@end example


								Rspamc utility is written in @code{perl} language and uses perl modules that are

								shipped with rspamd: @emph{Mail::Rspamd::Client} for client's protocol and

								@emph{Mail::Rspamd::Config} for reading and writing configuration. The

								documentation for these modules can be found by commands:

								@example

								$ perldoc Mail::Rspamd::Client

								$ perldoc Mail::Rspamd::Config

								@end example


								So other way to access rspamd is to use perl client API:

								@example

								use Mail::Rspamd::Client;

								my $config = {

									hosts => ['localhost:11333'],

								};


								my $client = new Mail::Rspamd::Client(%config);


								if (! $client->ping()) {

									die "Cannot ping rspamd: $client->{error}";

								}


								my $result = $client->check($testmsg);


								if ($result->{'default'}->{isspam} eq 'True') {

									# do something with spam message here

								}

								@end example


								@section Rspamc protocol.

								Rspamc protocol is an extension over traditional spamc protocol that is used by

								spamassassin. This protocol looks like traditional HTTP session: first line is

								method with version, headers can be passed by next lines and the message itself

								is waited after empty line:

								@example

								<REQUEST>

								SYMBOLS RSPAMC/1.1

								Content-Length: 2200


								<message octets>


								<REPLY>

								RSPAMD/1.1 0 OK

								Metric: default; True; 10.40 / 10.00 / 0.00

								Symbol: R_UNDISC_RCPT

								Symbol: ONCE_RECEIVED

								Symbol: R_MISSING_CHARSET

								Urls:

								@end example

								@noindent

								The format of method line can be presented as:

								@example

								<COMMAND> RSPAMC/<version>

								@end example

								@noindent

								Version can be 1.0 and 1.1. The main difference that in 1.1 metrics output also

								has @emph{reject score} - hard limit of score for metric. This would be

								discussed while describing user's options. Commands are:

								@multitable @columnfractions .2 .8

								@headitem Command @tab Mean

								@item CHECK

								@tab Check a message and output results for each metric. But do not output

								symbols.

								@item SYMBOLS

								@tab Same as @emph{CHECK} but output symbols.

								@item PROCESS

								@tab Same as @emph{SYMBOLS} but output also original message with inserted

								X-Spam headers.

								@item PING

								@tab Do not do any processing, just check rspamd state:

								@example

								$ telnet localhost 11333

								Trying 127.0.0.1...

								Connected to localhost.

								Escape character is '^]'.

								PING RSPAMC/1.1


								RSPAMD/1.1 0 PONG

								Connection closed by foreign host.

								@end example

								@noindent

								@end multitable


								After command there should be one mandatory header: @strong{Content-Length} that

								defines message's length in bytes and optional headers:

								@multitable @columnfractions .2 .8

								@headitem Header @tab Mean

								@item @var{Deliver-To:}

								@tab Defines actual delivery recipient of message. Can be used for personalized

								statistic and for user specific options.

								@item @var{IP:}

								@tab Defines IP from which this message is received.

								@item @var{Helo:}

								@tab Defines SMTP helo.

								@item @var{From:}

								@tab Defines SMTP mail from command data.

								@item @var{Queue-Id:}

								@tab Defines SMTP queue id for message (can be used instead of message id in

								logging).

								@item @var{Rcpt:}

								@tab Defines SMTP recipient (it may be several @emph{Rcpt:} headers).

								@item @var{Pass:}

								@tab If this header has @emph{"all"} value, all filters would be checked for

								this message.

								@item @var{Subject:}

								@tab Defines subject of message (is used for non-mime messages).

								@item @var{User:}

								@tab Defines SMTP user (this is currently unused in rspamd however).

								@end multitable

								So rspamc protocol allows to pass many data from MTA to rspamd. This is used to

								increase speed of processing and for building filters (like SPF filter). Also

								note that rspamd support spamassassin spamc protocol and you can even pass

								rspamc headers in spamc mode, but reply of rspamd in spamc mode would be much

								shorter: it would only use "default" metric and won't show additional options

								for symbols. Rspamc reply looks like this:

								@example

								RSPAMD/1.1 0 OK

								Metric: default; True; 10.40 / 10.00 / 0.00

								Symbol: R_UNDISC_RCPT

								Symbol: ONCE_RECEIVED

								Symbol: R_MISSING_CHARSET

								Urls:

								@end example

								@noindent

								First line is method reply: @code{<PROTOCOL>/<VERSION> <ERROR_CODE> <ERROR_REPLY>}.

								Error code is 0 when no error occured. After first reply line there are metrics

								output. For @emph{SYMBOLS} and @emph{PROCESS} commands there are symbols lines

								after each metric. And for @emph{PROCESS} command there would be original

								message after all metrics results. Metric result line looks like this:

								@example

								Metric: <name>; <result>; <score> / <required_score> / <reject_score>

								@end example

								@noindent

								For 1.0 version of rspamc protocol @emph{reject_score} parameter is not printed.

								Symbol line looks like this:

								@example

								Symbol: <Name>[; param1[, param2...]]

								@end example

								@noindent

								Some symbols can have parameters attached. It is useful for example for RBL

								checks (you can insert additional data after symbol name), for statistic and

								fuzzy checks. Also rspamd inserts @emph{Urls} line in which all urls that are

								contained in message are printed in comma-separated list.

								Note that this protocol is used for normal workers. Controller, fuzzy storage

								and lmtp/smtp workers are using other protocols. For example controller's

								protocol is oriented on interactive sessions: you can pass many commands to

								controller before disconnecting. Fuzzy storage is using UDP for making

								interaction with storage faster. LMTP/SMTP workers are using lmtp and smtp

								protocols. All of these protocols would be described in further chapters about

								rspamd workers.


								@bye