Transaction data sharding

Transaction data sharding means that it is possible to split transaction journal into multiple files, and Tackler supports journal sharding with both Filesystem and Git backends.

It is also possible to store each transaction into own file, this is so called "single transaction - single file" mode. This is the sharding mode is used by performance tests and it is recommended if transaction data is generated by some automated system.

With "single transaction - single file mode" it is also recommended to use UUIDs with transaction metadata and use same UUID as part of file name. Transaction UUID is printed with Register Report, and by using UUIDs with transactions it will be easier to find actual the transaction file, in case there is any need to do so.

Sharding schemes

Two most common shard schemes are time based or topic based sharding.

Example of time based shards:

  • year/month (e.g. txns/2019/01/)

  • year/month/day (e.g. txns/2019/01/31)

  • year/iso-week (e.g. txns/2019/W10)

  • year/iso-week/iso-week-date (e.g. Monday 2017-01-02 → txns/2017/W01/1)

Example of topic based shards by customers:

  • txns/Customers/ACME

  • txns/Customers/Initech

Tackler doesn’t care how do you shard or not shard txn data. But sharding makes a lot of sense with Git Storage backend and in case that there is lots of data. If transactions are generated automatically, its recommended to use single transaction - single file model and shard data.

Regardless of used sharding scheme, it is possible to group txns by different group-by operators with Balance Group report.

Using shards to select subset of transaction data

Selecting subset of transactions can be by using Transaction Filters or by using shards.

The major difference is that by using Transaction Filters all data is first parsed, and after that filtered. By using sharding scheme, "filtering" happens before journal files are even parsed. On the otherhand, sharding lacks all fancy filtering options.

File scanning and glob matching starts from input.fs.dir and descents from there.

From performance point of view, sharding is beneficial maybe after tens or hundreds of thousands of transactions. This is affect heavily by used Operating System, filesystem and used hardware. See Performance Testing for further details.

Example of week based sharding

With data shard and glob-pattern it is very easy to generate reports with only selected set of accounting data. For example with shard based on iso-week it is possible to generate weekly reports with following piece Bash-shell code:

report_year=$1
report_week=$2

java \
   -jar tackler-cli.jar \
   --basedir="$exe_dir/.." \
   --input.fs.dir="txns/$report_year/W$report_week" \
   --input.fs.glob="**.txn" \
   "$@"

It is possible to combine usage of environment variables with configuration system of Tackler.

See external documentation of used configuration library for examples and details: Config: system or environment variable overrides.