Plain table

Plain table is a basic element for non-percolate searching. It can be defined only in a configuration file using the Plain mode, and is not supported in the RT mode. It is typically used in conjunction with a source to process data from the external storage and can later be attached to a real-time table.

👍 What you can do with a plain table:

⛔ What you cannot do with a plain table:

  • Insert additional data into the table once it has been built
  • Delete data from the table
  • Create, delete, or alter the table online
  • Use UUID for automatic ID generation (data from external storage must include a unique identifier)

Numeric attributes, including MVAs, are the only elements that can be updated in a plain table. All other data in the table is immutable. If updates or new records are required, the table must be rebuilt. During the rebuilding process, the existing table remains available to serve requests, and a process called rotation is performed when the new version is ready, bringing it online and discarding the old version.

‹›
  • Plain table example
Plain table example
📋

To create a plain table, you'll need to define it in a configuration file. It's not supported by the CREATE TABLE command.

Here's an example of a plain table configuration and a source for fetching data from a MySQL database:

source source {
  type             = mysql
  sql_host         = localhost
  sql_user         = myuser
  sql_pass         = mypass
  sql_db           = mydb
  sql_query        = SELECT id, title, description, category_id  from mytable
  sql_attr_uint    = category_id
  sql_field_string = title
 }

table tbl {
  type   = plain
  source = source
  path   = /path/to/table
 }

Plain table building performance

The speed at which a plain table is indexed depends on several factors, including:

  • Data source retrieval speed
  • tokenization settings
  • The hardware specifications (such as CPU, RAM, and disk performance)

Plain table building scenarios

Rebuild fully when needed

For small data sets, the simplest option is to have a single plain table that is fully rebuilt as needed. This approach is acceptable when:

  • The data in the table is not as fresh as the data in the source
  • The time it takes to build the table increases as the data set grows
Main+delta scenario

For larger data sets, a plain table can be used instead of a Real-Time. The main+delta scenario involves:

  • Creating a smaller table for incremental indexing
  • Combining the two tables using a distributed table

This approach allows for infrequent rebuilding of the larger table and more frequent processing of updates from the source. The smaller table can be rebuilt more often (e.g. every minute or even every few seconds).

However, as time goes on, the indexing duration for the smaller table will become too long, requiring a rebuild of the larger table and the emptying of the smaller one.

The main+delta schema is explained in detail in this interactive course.

The mechanism of kill list and killlist_target directive is used to ensure that documents from the current table take precedence over those from the other table.

For more information on this topic, see here.

Plain table files structure

The following table outlines the various file extensions used in a plain table and their respective descriptions:

Extension Description
.spa stores document attributes in row-wise mode
.spb stores blob attributes in row-wise mode: strings, MVA, json
.spc stores document attributes in columnar mode
.spd stores matching document ID lists for each word ID
.sph stores table header information
.sphi stores histograms of attribute values
.spi stores word lists (word IDs and pointers to .spd file)
.spidx stores secondary indexes data
.spk stores kill-lists
.spl lock file
.spm stores a bitmap of killed documents
.spp stores hit (aka posting, aka word occurrence) lists for each word ID
.spt stores additional data structures to speed up lookups by document ids
.spe stores skip-lists to speed up doc-list filtering
.spds stores document texts
.tmp* temporary files during index_settings_and_status
.new.sp* new version of a plain table before rotation
.old.sp* old version of a plain table after rotation

Plain and real-time table settings

Defining table schema in a configuration file

table <index_name>[:<parent table name>] {
...
}
‹›
  • Plain
  • Real-time
📋
table <table name> {
  type = plain
  path = /path/to/table
  source = <source_name>
  source = <another source_name>
  [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
}

Common plain and real-time tables settings

type

type = plain

type = rt

Table type: "plain" or "rt" (real-time)

Value: plain (default), rt

path

path = path/to/table

The path to where the table will be stored or located, either absolute or relative, without the extension.

Value: The path to the table, mandatory

stored_fields

stored_fields = title, content

By default, the original content of full-text fields is indexed and stored when a table is defined in a configuration file. This setting allows you to specify the fields that should have their original values stored.

Value: A comma-separated list of full-text fields that should be stored. An empty value (i.e. stored_fields = ) disables the storage of original values for all fields.

Note: In the case of a real-time table, the fields listed in stored_fields should also be declared as rt_field.

Also, note that you don't need to list attributes in stored_fields, since their original values are stored anyway. stored_fields can only be used for full-text fields.

See also docstore_block_size, docstore_compression for document storage compression options.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)

stored_only_fields

stored_only_fields = title,content

List of fields that will be stored in the table, but not indexed. This setting works similarly to stored_fields except that when a field is specified in stored_only_fields t will only be stored, not indexed, and cannot be searched using full-text queries. It can only be retrieved in search results.

The value is a comma-separated list of fields that should be stored only, not indexed. By default, this value is empty. If a real-time table is being defined, the fields listed in stored_only_fields must also be declared as rt_field.

Note also, that you don't need to list attributes instored_only_fields,since their original values are stored anyway. If to compare stored_only_fields to string attributes the former (stored field):

  • is stored on disk and does not take up memory
  • is stored in a compressed format
  • can only be fetched, it cannot be used for sorting, filtering or grouping

In contrast, the latter (string attribute):

  • is stored on disk and in memory
  • is stored in an uncompressed format
  • can be used for sorting, grouping, filtering, and any other actions you want to take with attributes.

Real-time table settings:

optimize_cutoff

The maximum number of disk chunks for the RT table. Learn more here.

rt_field

rt_field = subject

This field declaration determines the full-text fields that will be indexed. The field names must be unique, and the order is preserved. When inserting data, the field values must be in the same order as specified in the configuration.

This is a multi-value, optional field.

rt_attr_uint

rt_attr_uint = gid

This declaration defines an unsigned integer attribute.

Value: the field name or field_name:N (where N is the maximum number of bits to keep).

rt_attr_bigint

rt_attr_bigint = gid

This declaration defines a BIGINT attribute.

Value: field name, multiple records allowed.

rt_attr_multi

rt_attr_multi = tags

Declares a multi-valued attribute (MVA) with unsigned 32-bit integer values.

Value: field name. Multiple records allowed.

rt_attr_multi_64

rt_attr_multi_64 = wide_tags

Declares a multi-valued attribute (MVA) with signed 64-bit BIGINT values.

Value: field name. Multiple records allowed.

rt_attr_float

rt_attr_float = lat
rt_attr_float = lon

Declares floating point attributes with single precision, 32-bit IEEE 754 format.

Value: field name. Multiple records allowed.

rt_attr_bool

rt_attr_bool = available

Declares a boolean attribute with 1-bit unsigned integer values.

Value: field name.

rt_attr_string

rt_attr_string = title

String attribute declaration.

Value: field name.

rt_attr_json

rt_attr_json = properties

Declares a JSON attribute.

Value: field name.

rt_attr_timestamp

rt_attr_timestamp = date_added

Declares a timestamp attribute.

Value: field name.

rt_mem_limit

rt_mem_limit = 512M

Memory limit for a RAM chunk of the table. Optional, default is 128M.

RT tables store some data in memory, known as the "RAM chunk," and also maintain a number of on-disk tables, referred to as "disk chunks." This directive allows you to control the size of the RAM chunk. When there is too much data to keep in memory, RT tables will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.

Please note that the limit is strict, and RT tables will never allocate more memory than what is specified in the rt_mem_limit. Additionally, memory is not preallocated, so specifying a 512MB limit and only inserting 3MB of data will result in allocating only 3MB, not 512MB.

The rt_mem_limit is never exceeded, but the actual RAM chunk size can be significantly lower than the limit. RT tables adapt to the data insertion pace and adjust the actual limit dynamically to minimize memory usage and maximize data write speed. This is how it works:

  • By default, the RAM chunk size is 50% of the rt_mem_limit, referred to as the "rt_mem_limit".
  • As soon as the RAM chunk accumulates data equivalent to rt_mem_limit * rate data (50% of rt_mem_limit by default), Manticore starts saving the RAM chunk as a new disk chunk.
  • While a new disk chunk is being saved, Manticore assesses the number of new/updated documents.
  • After saving a new disk chunk, the rt_mem_limit rate is updated.
  • The rate is reset to 50% each time you restart the searchd.

For instance, if 90MB of data is saved to a disk chunk and an additional 10MB of data arrives while the save is in progress, the rate would be 90%. Next time, the RT table will collect up to 90% of rt_mem_limit before flushing the data. The faster the insertion pace, the lower the rt_mem_limit rate. The rate varies between 33.3% to 95%. You can view the current rate of a table using the SHOW TABLE STATUS command.

How to change rt_mem_limit and optimize_cutoff

In real-time mode, you can adjust the size limit of RAM chunks and the maximum number of disk chunks using the ALTER TABLE statement. To set rt_mem_limit to 1 gigabyte for the table "t," run the following query: ALTER TABLE t rt_mem_limit='1G'. To change the maximum number of disk chunks, run the query: ALTER TABLE t optimize_cutoff='5'.

In the plain mode, you can change the values of rt_mem_limit and optimize_cutoff by updating the table configuration or running the command ALTER TABLE <index_name> RECONFIGURE

Important notes about RAM chunks
  • Real-time tables are similar to distributed consisting of multiple local tables, also known as disk chunks.
  • Each RAM chunk is made up of multiple segments, which are special RAM-only tables.
  • While disk chunks are stored on disk, RAM chunks are stored in memory.
  • Each transaction made to a real-time table generates a new segment, and RAM chunk segments are merged after each transaction commit. It is more efficient to perform bulk INSERTs of hundreds or thousands of documents rather than multiple separate INSERTs with one document to reduce the overhead from merging RAM chunk segments.
  • When the number of segments exceeds 32, they will be merged to keep the count below 32.
  • Real-time tables always have one RAM chunk (which may be empty) and one or more disk chunks.
  • Merging larger segments takes longer, so it's best to avoid having a very large RAM chunk (and therefore rt_mem_limit).
  • The number of disk chunks depends on the data in the table and the rt_mem_limit setting.
  • Searchd flushes the RAM chunk to disk (as a persisted file, not as a disk chunk) on shutdown and periodically according to the rt_flush_period setting. Flushing several gigabytes to disk may take some time.
  • A large RAM chunk puts more pressure on storage, both when flushing to disk into the .ram file and when the RAM chunk is full and dumped to disk as a disk chunk.
  • The RAM chunk is backed up by a binary log until it is flushed to disk, and a larger rt_mem_limit, setting will increase the time it takes to replay the binary log and recover the RAM chunk.
  • The RAM chunk may be slightly slower than a disk chunk.
  • Although the RAM chunk itself doesn't take up more memory than rt_mem_limit Manticore may take up more memory in some cases, such as when you start a transaction to insert data and don't commit it for a while. In this case, the data you have already transmitted within the transaction will remain in memory.

Plain table settings:

source

source = srcpart1
source = srcpart2
source = srcpart3

The source field specifies the source from which documents will be obtained during indexing of the current table. There must be at least one source. The sources can be of different types (e.g. one could be MySQL, another PostgreSQL). For more information on indexing from external storages, indexing from external storages here

Value: The name of the source is mandatory. Multiple values are allowed.

killlist_target

killlist_target = main:kl

This setting determines the table(s) to which the kill-list will be applied. Matches in the targeted table that are updated or deleted in the current table will be suppressed. In :kl mode, the documents to suppress are taken from the kill-list. In :id mode, all document IDs from the current table are suppressed in the targeted one. If neither is specified, both modes will take effect. Learn more about kill-lists here

Value: not specified (default), target_index_name:kl, target_index_name:id, target_index_name. Multiple values are allowed

columnar_attrs

columnar_attrs = *
columnar_attrs = id, attr1, attr2, attr3

This configuration setting determines which attributes should be stored in the columnar storage instead of the row-wise storage.

You can set columnar_attrs = * to store all supported data types in the columnar storage.

Additionally, id is a supported attribute to store in the columnar storage.

columnar_strings_no_hash

columnar_strings_no_hash = attr1, attr2, attr3

By default, all string attributes stored in columnar storage store pre-calculated hashes. These hashes are used for grouping and filtering. However, they occupy extra space, and if you don't need to group by that attribute, you can save space by disabling hash generation.

Creating a real-time table online via CREATE TABLE

General syntax of CREATE TABLE
CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]
Data types:

For more information on data types, see more about data types here.

Type Equivalent in a configuration file Notes Aliases
text rt_field Options: indexed, stored. Default: both. To keep text stored, but indexed, specify "stored" only. To keep text indexed only, specify "indexed" only. string
integer rt_attr_uint integer int, uint
bigint rt_attr_bigint big integer
float rt_attr_float float
multi rt_attr_multi multi-integer
multi64 rt_attr_multi_64 multi-bigint
bool rt_attr_bool boolean
json rt_attr_json JSON
string rt_attr_string string. Option indexed, attribute will make the value full-text indexed and filterable, sortable and groupable at the same time
timestamp rt_attr_timestamp timestamp
bit(n) rt_attr_uint field_name:N N is the max number of bits to keep
‹›
  • SQL
SQL
📋
CREATE TABLE products (title text, price float) morphology='stem_en'

This creates the "products" table with two fields: "title" (full-text) and "price" (float), and sets the "morphology" to "stem_en".

CREATE TABLE products (title text indexed, description text stored, author text, price float)

This creates the "products" table with three fields:

  • "title" is indexed, but not stored.
  • "description" is stored, but not indexed.
  • "author" is both stored and indexed.

Engine

create table ... engine='columnar';
create table ... engine='rowwise';

The engine setting changes the default attribute storage for all attributes in the table. You can also specify engine separately for each attribute.

For information on how to enable columnar storage for a plain table, see columnar_attrs .

Values:

  • columnar - Enables columnar storage for all table attributes, except for json
  • rowwise (default) - Doesn't change anything and uses the traditional row-wise storage for the table.

Other settings

The following settings are applicable for both real-time and plain tables, regardless of whether they are specified in a configuration file or set online using the CREATE or ALTER command.

Performance related

Accessing table files

Manticore supports two access modes for reading table data: seek+read and mmap.

In seek+read mode, the server uses the pread system call to read document lists and keyword positions, represented by the*.spd and *.spp files. The server uses internal read buffers to optimize the reading process, and the size of these buffers can be adjusted using the options read_buffer_docs and read_buffer_hits.There is also the option preopen that controls how Manticore opens files at start.

In mmap access mode, the search server maps the table's file into memory using the mmap system call, and the OS caches the file contents. The options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock the table's data in memory using themlock privileged call, which prevents the OS from swapping the cached data out to disk.

To control which access mode to use, the options access_plain_attrs, access_blob_attrs, access_doclists and access_hitlists are available, with the following values:

Value Description
file server reads the table files from disk with seek+read using internal buffers on file access
mmap server maps the table files into memory and OS caches up its contents on file access
mmap_preread server maps the table files into memory and a background thread reads it once to warm up the cache
mlock server maps the table files into memory and then executes the mlock() system call to cache up the file contents and lock it into memory to prevent it being swapped out
Setting Values Description
access_plain_attrs mmap, mmap_preread (default), mlock controls how *.spa (plain attributes) *.spe (skip lists) *.spi (word lists) *.spt (lookups) *.spm (killed docs) will be read
access_blob_attrs mmap, mmap_preread (default), mlock controls how *.spb (blob attributes) (string, mva and json attributes) will be read
access_doclists file (default), mmap, mlock controls how *.spd (doc lists) data will be read
access_hitlists file (default), mmap, mlock controls how *.spp (hit lists) data will be read

Here is a table which can help you select your desired mode:

table part keep it on disk keep it in memory cached in memory on server start lock it in memory
plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docs mmap mmap mmap_preread (default) mlock
row-wise string, multi-value attributes (MVA) and json attributes mmap mmap mmap_preread (default) mlock
columnar numeric, string and multi-value attributes always only by means of OS no not supported
doc lists file (default) mmap no mlock
hit lists file (default) mmap no mlock
The recommendations are:
  • For the fastest search response time and ample memory availability, use row-wise attributes and lock them in memory using mlock. Additionally, use mlock for doclists/hitlists.
  • If you prioritize can't afford lower performance after start and are willing to sacrifice longer startup time, use the --force-preread. option. If you desire faster searchd restart, stick to the default mmap_preread option.
  • If you are looking to conserve memory, while still having enough memory for all attributes, skip the use of mlock. The operating system will determine what should be kept in memory based on frequent disk reads.
  • If row-wise attributes do not fit into memory, opt for columnar attributes
  • If full-text search performance is not a concern, and you wish to save memory, use access_doclists/access_hitlists=file

The default mode offers a balance of:

  • mmap,
  • Prereading non-columnar attributes,
  • Seeking and reading columnar attributes with no preread,
  • Seeking and reading doclists/hitlists with no preread.

This provides a decent search performance, optimal memory utilization, and faster searchd restart in most scenarios.

Other performance related settings

attr_update_reserve

attr_update_reserve = 256k

This setting reserves extra space for updates to blob attributes such as multi-value attributes (MVA), strings, and JSON. The default value is 128k. When updating these attributes, their length may change. If the updated string is shorter than the previous one, it will overwrite the old data in the *.spb file. If the updated string is longer, it will be written to the end of the *.spb file. This file is memory-mapped, making resizing it a potentially slow process, depending on the operating system's memory-mapped file implementation. To avoid frequent resizing, you can use this setting to reserve extra space at the end of the .spb file.

Value: size, default 128k.

docstore_block_size

docstore_block_size = 32k

This setting controls the size of blocks used by the document storage. The default value is 16kb. When original document text is stored using stored_fields or stored_only_fields, it is stored within the table and compressed for efficiency. To optimize disk access and compression ratios for small documents, these documents are concatenated into blocks. The indexing process collects documents until their total size reaches the threshold specified by this option. At that point, the block of documents is compressed. This option can be adjusted to achieve better compression ratios (by increasing the block size) or faster access to document text (by decreasing the block size).

Value: size, default 16k.

docstore_compression

docstore_compression = lz4hc

This setting determines the type of compression used for compressing blocks of documents stored in document storage. If stored_fields or stored_only_fields are specified, the document storage stores compressed document blocks. 'lz4' offers fast compression and decompression speeds, while 'lz4hc' (high compression) sacrifices some compression speed for a better compression ratio. 'none' disables compression completely.

Values: lz4 (default), lz4hc, none.

docstore_compression_level

docstore_compression_level = 12

The compression level used when 'lz4hc' compression is applied in document storage. By adjusting the compression level, you can find the right balance between performance and compression ratio when using 'lz4hc' compression. Note that this option is not applicable when using 'lz4' compression.

Value: An integer between 1 and 12, with a default of 9.

preopen

preopen = 1

This setting indicates that searchd should open all table files on startup or rotation, and keep them open while running. By default, the files are not pre-opened. Pre-opened tables require a few file descriptors per table, but they eliminate the need for per-query open() calls and are immune to race conditions that might occur during table rotation under high load. However, if you are serving many tables, it may still be more efficient to open them on a per-query basis in order to conserve file descriptors.

Value: 0 (default), or 1.

read_buffer_docs

read_buffer_docs = 1M

Buffer size for storing the list of documents per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.

Value: size, default 256k, minimum value is 8k.

read_buffer_hits

read_buffer_hits = 1M

Buffer size for storing the list of hits per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.

Value: size, default 256k, minimum value is 8k.

Plain table disk footprint settings

inplace_enable

inplace_enable = {0|1}

Enables in-place table inversion. Optional, default is 0 (uses separate temporary files).

The inplace_enable option reduces the disk footprint during indexing of plain tables, while slightly slowing down indexing (it uses approximately 2 times less disk, but yields around 90-95% of the original performance).

Indexing is comprised of two primary phases. During the first phase, documents are collected, processed, and partially sorted by keyword, and the intermediate results are written to temporary files (.tmp*). During the second phase, the documents are fully sorted and the final table files are created. Rebuilding a production table on-the-fly requires approximately 3 times the peak disk footprint: first for the intermediate temporary files, second for the newly constructed copy, and third for the old table that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final table.) This may be too much disk footprint for large data collections, and the inplace_enable option can be used to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them upon completion. However, this may require additional temporary data chunk relocation, which is where the performance impact comes from.

This directive has no effect on searchd, it only affects the indexer.

‹›
  • CONFIG
CONFIG
📋
table products {
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_hit_gap

inplace_hit_gap = size

The option In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.

This directive only affects the searchd tool, and does not have any impact on the indexer.

‹›
  • CONFIG
CONFIG
📋
table products {
  inplace_hit_gap = 1M
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_reloc_factor

inplace_reloc_factor = 0.1

The inplace_reloc_factor setting determines the size of the relocation buffer within the memory arena used during indexing. The default value is 0.1.

This option is optional and only affects the indexer tool, not the searchd server.

‹›
  • CONFIG
CONFIG
📋
table products {
  inplace_reloc_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_write_factor

inplace_write_factor = 0.1

Controls the size of the buffer used for in-place writing during indexing. Optional, with a default value of 0.1.

It's important to note that this directive only impacts the indexer tool and not the searchd server.

‹›
  • CONFIG
CONFIG
📋
table products {
  inplace_write_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

Natural language processing specific settings

The following settings are supported. They are all described in section NLP and tokenization.

Percolate table

A percolate table is a special table that stores queries rather than documents. It is used for prospective searches, or "search in reverse."

The schema of a percolate table is fixed and contains the following fields:

Field Description
ID An unsigned 64-bit integer with auto-increment functionality. It can be omitted when adding a PQ rule, as described in add a PQ rule
Query Full-text query of the rule, which can be thought of as the value of MATCH clause or JSON /search. If per field operators are used inside the query, the full-text fields need to be declared in the percolate table configuration. If the stored query is only for attribute filtering (without full-text querying), the query value can be empty or omitted. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table.
Filters Optional. Filters are an optional string containing attribute filters and/or expressions, defined the same way as in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table.
Tags Optional. Tags represent a list of string labels separated by commas that can be used for filtering/deleting PQ rules. The tags can also be returned along with matching documents when performing a Percolate query

Note that you do not need to add the above fields when creating a percolate table.

What you need to keep in mind when creating a new percolate table is to specify the expected schema of a document, which will be checked against the rules you will add later. This is done in the same way as for any other local table.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • typescript
  • go
  • CONFIG
📋
Creating a percolate table via MySQL protocol:
CREATE TABLE products(title text, meta json) type='pq';
‹›
Response
Query OK, 0 rows affected (0.00 sec)