Skip to content

Conversation

zatricky
Copy link

@zatricky zatricky commented Aug 25, 2024

Maintenance tasks may starve system IO resources for other more urgent workloads. This PR enables cgroup IO resource limits when using systemd timers.

Notes:

  • This needs some testing especially on other distributions. I've tested this with balance operations in Fedora 40 (systemd 255.10-3.fc40).
  • This works for systemd but I'm not sure how best to achieve the same for cron. Wrapping systemd-run seems redundant since it technically temporarily creates a service for each run.
  • I am not sure if the defaults I have suggested are good. I have based these defaults on the limits I feel would be appropriate for spindles where I expect high IO demand for non-maintenance services and I don't mind if the maintenance tasks take a very long time to complete.
  • I am not sure if there is a good way to configure different limits for different disk classes. For example if you have a RAID1 OS filesystem on SSD and a large RAID5 backup filesystem on spindles, it would be useful to be able to have different sets of IO limits.

To view these cgroup limits in action outside of a regular service, you can wrap a command with systemd-run. For example, the following is a balance with -musage=30 on a two-disk filesystem on /dev/dm-0 and /dev/dm-1, with IOPS limits of 60 and bandwidth of 10MBps:

$ systemd-run --property="IOReadBandwidthMax=/dev/dm-0 10M" --property="IOWriteBandwidthMax=/dev/dm-0 10M" --property="IOReadIOPSMax=/dev/dm-0 60" --property="IOWriteIOPSMax=/dev/dm-0 60" --property="IOReadBandwidthMax=/dev/dm-1 10M" --property="IOWriteBandwidthMax=/dev/dm-1 10M" --property="IOReadIOPSMax=/dev/dm-1 60" --property="IOWriteIOPSMax=/dev/dm-1 60" btrfs balance start -musage=30 /
Running as unit: run-r0fa03384626b4245b07857fc38089744.service; invocation ID: 7869d843bd814cb7bc1e5db9d35bc46f
$ cat /sys/fs/cgroup/system.slice/run-r0fa03384626b4245b07857fc38089744.service/io.max
252:0 rbps=10000000 wbps=10000000 riops=60 wiops=max
252:128 rbps=10000000 wbps=10000000 riops=60 wiops=max
$ cat /sys/fs/cgroup/system.slice/run-r0fa03384626b4245b07857fc38089744.service/io.stat
253:6 rbytes=1114112 wbytes=15321464832 rios=68 wios=117555 dbytes=0 dios=0
.... many similar lines here in my system
$ journalctl -u run-r0fa03384626b4245b07857fc38089744.service
Aug 25 15:55:25 <hostname> systemd[1]: Started run-r0fa03384626b4245b07857fc38089744.service - /usr/sbin/btrfs balance start -musage=30 /.
Aug 25 15:55:52 <hostname> btrfs[86830]: Done, had to relocate 3 out of 12787 chunks
Aug 25 15:55:52 <hostname> systemd[1]: run-r0fa03384626b4245b07857fc38089744.service: Deactivated successfully.
Aug 25 15:55:52 <hostname> systemd[1]: run-r0fa03384626b4245b07857fc38089744.service: Consumed 17.403s CPU time.

Further to the above example, if I run lsblk it shows the corresponding device IDs 252:0 and 252:128 for dm-0 and dm-1 respectively.

@zatricky zatricky force-pushed the systemd-cgroup-iolimits branch from b5e4284 to 682684a Compare August 25, 2024 17:27
@zatricky zatricky force-pushed the systemd-cgroup-iolimits branch from 682684a to 87a1c72 Compare August 25, 2024 18:55
@zatricky
Copy link
Author

zatricky commented Aug 26, 2024

Limits tested and working for btrfs-scrub also

@zatricky
Copy link
Author

zatricky commented Sep 9, 2024

I have tested and confirmed that this also works for btrfs-defrag. I'm not sure if it applies to btrfs-trim, so maybe the insertion of the IO limits configs should specifically skip the trim service.

@kdave I'd appreciate any comments on this PR, especially regards testing and what else should be done to have this ready to merge.

@kdave
Copy link
Owner

kdave commented Aug 18, 2025

I think the difficult part here is how to do the configuration. Systemd needs the raw device paths, this may be cumbersome for the user to extract them and the devices could change over time (using device add/remove). Ideally there's a helper that lists devices of a given mount point before running the command.

Next, where to store the configuration for each filesystem. The generated unit files with IO limits make more sense as the sysconfig does not seem suitable for that, other than a global switch on/off wheather to apply the limits if configured.

For the disk classes I think this needs some user interaction, it can be guessed from sysfs but still should be confirmed if it's correct as there's more than HDD/SSD/NVMe. There could be a helper tool to gather the information and create the timer config overrides.

Regarding cron I'm not sure if this is still in use, in the beginning it was meant to be temporary as systemd was not everywhere.

@zatricky
Copy link
Author

zatricky commented Sep 8, 2025

I've been putting some thought into this for a while but I don't really have a good answer that I'm sure on yet. Below are my current "good enough for now" thoughts:

You are right that the ideal config is not catered for due to complexity. My initial thought to make that complexity intuitive would be to put the config into a .json, .toml, or .yaml. We could specify only the mountpoints and the wanted limits - then the refresh script could figure out all the block device information automatically. Alternatively we could specify only the "disk-type" limits that either match on rotational vs non-rotational, or maybe to match on disk id/model/serials/etc.

Perhaps the refresh timer could also specify boot time as that is also a well-known time when disk paths are changed/refreshed.

The above could work well after the systemd rules are overridden - but then, as you suggested, any disks dynamically added or removed will have the wrong limits applied until the refresh is executed. I'd consider this an acceptable caveat as long as it is documented.

Please let me know if you like this path or if you have suggestions. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants