Description
Description
Particularly in situations where:
- We know that the addition cannot overflow (e.g. when we are using intermediate types that are larger than the inputs)
- We are processing in a loop that can be optimized with AVX2
The manual implementation of div_ceil
, i.e. (a + b + 1) / 2
, allows the compiler to use the vpavgb
instruction. Meanwhile, div_ceil
's implementation prevents the compiler from generating such optimized assembly and results in multiple instructions--in the case I'm using as an example, a sequence of vmovdqu, vpxor, vpand, vpaddb, vpsrlw, vpand, vpaddb, vpsubb
. Not surprisingly, the sequence of 8 instructions is slower than the one single instruction. In the following real-world example function, replacing the usages of (a + b + 1) / 2
with (a + b).div_ceil(2)
results in about a 5% function-wide performance degradation in benchmarks.
#[target_feature(enable = "avx2")]
unsafe fn refine_horizontal_bicubic_u8(
src: *const u8,
dest: *mut u8,
pitch: NonZeroUsize,
width: NonZeroUsize,
height: NonZeroUsize,
bits_per_sample: NonZeroU8,
) {
let pixel_max = (1u16 << bits_per_sample.get()) - 1;
let width_val = width.get();
let pitch_val = pitch.get();
for j in 0..height.get() {
let row_offset = j * pitch_val;
let src_row = unsafe { src.add(row_offset) };
let dest_row = unsafe { dest.add(row_offset) };
// First pixel: linear interpolation
let a = unsafe { *src_row.add(0) } as u16;
let b = unsafe { *src_row.add(1) } as u16;
unsafe { *dest_row.add(0) = ((a + b + 1) / 2) as u8 };
// Handle middle pixels individually - SIMD implementation would be more complex
// and needs careful boundary checking
for i in 1..(width_val - 3) {
let a = unsafe { *src_row.add(i - 1) } as i16;
let b = unsafe { *src_row.add(i) } as i16;
let c = unsafe { *src_row.add(i + 1) } as i16;
let d = unsafe { *src_row.add(i + 2) } as i16;
let result = (-(a + d) + (b + c) * 9 + 8) >> 4;
unsafe {
*dest_row.add(i) = std::cmp::min(pixel_max, std::cmp::max(0, result) as u16) as u8;
}
}
// Second-to-last pixels: linear interpolation
for i in (width_val - 3)..(width_val - 1) {
let a = unsafe { *src_row.add(i) } as u16;
let b = unsafe { *src_row.add(i + 1) } as u16;
unsafe { *dest_row.add(i) = ((a + b + 1) / 2) as u8 };
}
// Last pixel: copy
unsafe { *dest_row.add(width_val - 1) = *src_row.add(width_val - 1) };
}
}
The justification for manual_div_ceil
being "warn" by default as part of the complexity
group is that "It’s simpler, clearer and more readable." This, also, is not something I 100% agree with. You could argue it is preferable because it avoids overflows, but there are also cases where we know an overflow cannot occur. As such, given the above performance issues, I'd favor moving this to another group such as pedantic
and making it "allow" by default.
Version
rustc 1.87.0 (17067e9ac 2025-05-09)
binary: rustc
commit-hash: 17067e9ac6d7ecb70e50f92c1944e545188d2359
commit-date: 2025-05-09
host: x86_64-unknown-linux-gnu
release: 1.87.0
LLVM version: 20.1.1
Additional Labels
@rustbot label +A-category +S-needs-discussion