Transaction Data Sharding
Transaction data sharding means that it is possible to split transaction journal into multiple files, and Tackler supports journal sharding with both Filesystem and Git backends.
It is also possible to store each transaction into own file, this is so called "single transaction - single file" mode. This is the sharding mode is used by performance tests and it is recommended if transaction data is generated by some automated system.
With "single transaction - single file mode" it is also recommended to use UUIDs with transaction metadata and use same UUID as part of file name. Transaction UUID is printed with Register Report, and by using UUIDs with transactions it will be easier to find actual the transaction file, in case there is any need to do so.
Sharding schemes
Two most common shard schemes are time based or topic based sharding.
Example of time based shards:
-
year/month (e.g.
txns/2019/01/
) -
year/month/day (e.g.
txns/2019/01/31
) -
year/iso-week (e.g.
txns/2019/W10
) -
year/iso-week/iso-week-date (e.g. Monday 2017-01-02 →
txns/2017/W01/1
)
Example of topic based shards by customers:
-
txns/Customers/ACME
-
txns/Customers/Initech
Tackler doesn’t care how do you shard or not shard txn data. But sharding makes a lot of sense with Git Storage backend and in case that there is lots of data. If transactions are generated automatically, its recommended to use single transaction - single file model and shard data.
Regardless of used sharding scheme, it is possible to group txns by different
group-by
operators with Balance Group report.
Using shards to select subset of transaction data
Selecting subset of transactions can be by using Transaction Filters or by using shards.
The major difference is that by using Transaction Filters all data is first parsed, and after that filtered. By using sharding scheme, "filtering" happens before journal files are even parsed. On the otherhand, sharding lacks all fancy filtering options.
File scanning starts from top level directory identified by input.fs.dir
setting.
From performance point of view, sharding is beneficial maybe after tens or hundreds of thousands of transactions. This is affect heavily by used Operating System, filesystem and used hardware. See Performance Testing for further details.
Example of week based sharding
With data sharding it is very straightforward to generate reports with only selected set of accounting data. For example with shard based on iso-week it is possible to generate weekly reports with following piece of shell script:
report_year=$1
report_week=$2
tackler\
--config journal.toml \
--input.fs.dir="txns/${report_year}/${report_week}" \
--input.fs.ext="txn" \
"$@"