Skip to the content.

bees Configuration

The only configuration parameter that must be provided is the hash table size. Other parameters are optional or hardcoded, and the defaults are reasonable in most cases.

Hash Table Sizing

Hash table entries are 16 bytes per data block. The hash table stores the most recently read unique hashes. Once the hash table is full, each new entry added to the table evicts an old entry. This makes the hash table a sliding window over the most recently scanned data from the filesystem.

Here are some numbers to estimate appropriate hash table sizes:

unique data size |  hash table size |average dedupe extent size
    1TB          |      4GB         |        4K
    1TB          |      1GB         |       16K
    1TB          |    256MB         |       64K
    1TB          |    128MB         |      128K <- recommended
    1TB          |     16MB         |     1024K
   64TB          |      1GB         |     1024K

Notes:

Factors affecting optimal hash table size

It is difficult to predict the net effect of data layout and access patterns on dedupe effectiveness without performing deep inspection of both the filesystem data and its structure–a task that is as expensive as performing the deduplication.

Scanning modes

The --scan-mode option affects how bees iterates over the filesystem, schedules extents for scanning, and tracks progress.

There are now two kinds of scan mode: the legacy subvol scan modes, and the new extent scan mode.

Scan mode can be changed by restarting bees with a different scan mode option.

Extent scan mode:

Subvol scan modes:

The default scan mode is 4, “extent”.

If you are using bees for the first time on a filesystem with many existing snapshots, you should read about snapshot gotchas.

Subvol scan modes

Subvol scan modes are maintained for compatibility with existing installations, but will not be developed further. New installations should use extent scan mode instead.

The quantity of text below detailing the shortcomings of each subvol scan mode should be informative all by itself.

Subvol scan modes work on any kernel version supported by bees. They are the only scan modes usable on kernel 4.14 and earlier.

The difference between the subvol scan modes is the order in which the files from different subvols are fed into the scanner. They all scan files in inode number order, from low to high offset within each inode, the same way that a program like cat would read files (but skipping over old data from earlier btrfs transactions).

If a filesystem has only one subvolume with data in it, then all of the subvol scan modes are equivalent. In this case, there is only one subvolume to scan, so every possible ordering of subvols is the same.

The --workaround-btrfs-send option pauses scanning subvols that are read-only. If the subvol is made read-write (e.g. with btrfs prop set $subvol ro false), or if the --workaround-btrfs-send option is removed, then the scan of that subvol is unpaused and dedupe proceeds normally. Space will only be recovered when the last read-only subvol is deleted.

Subvol scan modes cannot efficiently or accurately calculate an ETA for completion or estimate progress through the data. They simply request “the next new inode” from btrfs, and they are completed when btrfs says there is no next new inode.

Between subvols, there are several scheduling algorithms with different trade-offs:

Scan mode 0, “lockstep”, scans the same inode number in each subvol at close to the same time. This is useful if the subvols are snapshots with a common ancestor, since the same inode number in each subvol will have similar or identical contents. This maximizes the likelihood that all of the references to a snapshot of a file are scanned at close to the same time, improving dedupe hit rate. If the subvols are unrelated (i.e. not snapshots of a single subvol) then this mode does not provide any significant advantage. This mode uses smaller amounts of temporary space for shorter periods of time when most subvols are snapshots. When a new snapshot is created, this mode will stop scanning other subvols and scan the new snapshot until the same inode number is reached in each subvol, which will effectively stop dedupe temporarily as this data has already been scanned and deduped in the other snapshots.

Scan mode 1, “independent”, scans the next inode with new data in each subvol. There is no coordination between the subvols, other than round-robin distribution of files from each subvol to each worker thread. This mode makes continuous forward progress in all subvols. When a new snapshot is created, previous subvol scans continue as before, but the worker threads are now divided among one more subvol.

Scan mode 2, “sequential”, scans one subvol at a time, in numerical subvol ID order, processing each subvol completely before proceeding to the next subvol. This avoids spending time scanning short-lived snapshots that will be deleted before they can be fully deduped (e.g. those used for btrfs send). Scanning starts on older subvols that are more likely to be origin subvols for future snapshots, eliminating the need to dedupe future snapshots separately. This mode uses the largest amount of temporary space for the longest time, and typically requires a larger hash table to maintain dedupe hit rate.

Scan mode 3, “recent”, scans the subvols with the highest min_transid value first (i.e. the ones that were most recently completely scanned), then falls back to “independent” mode to break ties. This interrupts long scans of old subvols to give a rapid dedupe response to new data in previously scanned subvols, then returns to the old subvols after the new data is scanned.

Extent scan mode

Scan mode 4, “extent”, scans the extent tree instead of the subvol trees. Extent scan mode reads each extent once, regardless of the number of reflinks or snapshots. It adapts to the creation of new snapshots immediately, without having to revisit old data.

In the extent scan mode, extents are separated into multiple size tiers to prioritize large extents over small ones. Deduping large extents keeps the metadata update cost low per block saved, resulting in faster dedupe at the start of a scan cycle. This is important for maximizing performance in use cases where bees runs for a limited time, such as during an overnight maintenance window.

Once the larger size tiers are completed, dedupe space recovery speeds slow down significantly. It may be desirable to stop bees running once the larger size tiers are finished, then start bees running some time later after new data has appeared.

Each extent is mapped in physical address order, and all extent references are submitted to the scanner at the same time, resulting in much better cache behavior and dedupe performance compared to the subvol scan modes.

The “extent” scan mode is not usable on kernels before 4.15 because it relies on the LOGICAL_INO_V2 ioctl added in that kernel release. When using bees with an older kernel, only subvol scan modes will work.

Extents are divided into virtual subvols by size, using reserved btrfs subvol IDs 250..255. The size tier groups are:

Extent scan mode can efficiently calculate dedupe progress within the filesystem and estimate an ETA for completion within each size tier; however, the accuracy of the ETA can be questionable due to the non-uniform distribution of block addresses in a typical user filesystem.

Older versions of bees do not recognize the virtual subvols, so running an old bees version after running a new bees version will reset the “extent” scan mode’s progress in beescrawl.dat to the beginning. This may change in future bees releases, i.e. extent scans will store their checkpoint data somewhere else.

The --workaround-btrfs-send option behaves differently in extent scan modes: In extent scan mode, dedupe proceeds on all subvols that are read-write, but all subvols that are read-only are excluded from dedupe. Space will only be recovered when the last read-only subvol is deleted.

During btrfs send all duplicate extents in the sent subvol will not be removed (the kernel will reject dedupe commands while send is active, and bees currently will not re-issue them after the send is complete). It may be preferable to terminate the bees process while running btrfs send in extent scan mode, and restart bees after the send is complete.

Threads and load management

By default, bees creates one worker thread for each CPU detected. These threads then perform scanning and dedupe operations. The number of worker threads can be set with the --thread-count and --thread-factor options.

If desired, bees can automatically increase or decrease the number of worker threads in response to system load. This reduces impact on the rest of the system by pausing bees when other CPU and IO intensive loads are active on the system, and resumes bees when the other loads are inactive. This is configured with the --loadavg-target and --thread-min options.

Log verbosity

bees can be made less chatty with the --verbose option.