Skip to content

manual_div_ceil suggestion can result in worse generated code #14944

Open
@shssoichiro

Description

@shssoichiro

Description

Particularly in situations where:

  • We know that the addition cannot overflow (e.g. when we are using intermediate types that are larger than the inputs)
  • We are processing in a loop that can be optimized with AVX2

The manual implementation of div_ceil, i.e. (a + b + 1) / 2, allows the compiler to use the vpavgb instruction. Meanwhile, div_ceil's implementation prevents the compiler from generating such optimized assembly and results in multiple instructions--in the case I'm using as an example, a sequence of vmovdqu, vpxor, vpand, vpaddb, vpsrlw, vpand, vpaddb, vpsubb. Not surprisingly, the sequence of 8 instructions is slower than the one single instruction. In the following real-world example function, replacing the usages of (a + b + 1) / 2 with (a + b).div_ceil(2) results in about a 5% function-wide performance degradation in benchmarks.

#[target_feature(enable = "avx2")]
unsafe fn refine_horizontal_bicubic_u8(
    src: *const u8,
    dest: *mut u8,
    pitch: NonZeroUsize,
    width: NonZeroUsize,
    height: NonZeroUsize,
    bits_per_sample: NonZeroU8,
) {
    let pixel_max = (1u16 << bits_per_sample.get()) - 1;
    let width_val = width.get();
    let pitch_val = pitch.get();

    for j in 0..height.get() {
        let row_offset = j * pitch_val;
        let src_row = unsafe { src.add(row_offset) };
        let dest_row = unsafe { dest.add(row_offset) };

        // First pixel: linear interpolation
        let a = unsafe { *src_row.add(0) } as u16;
        let b = unsafe { *src_row.add(1) } as u16;
        unsafe { *dest_row.add(0) = ((a + b + 1) / 2) as u8 };

        // Handle middle pixels individually - SIMD implementation would be more complex
        // and needs careful boundary checking
        for i in 1..(width_val - 3) {
            let a = unsafe { *src_row.add(i - 1) } as i16;
            let b = unsafe { *src_row.add(i) } as i16;
            let c = unsafe { *src_row.add(i + 1) } as i16;
            let d = unsafe { *src_row.add(i + 2) } as i16;
            let result = (-(a + d) + (b + c) * 9 + 8) >> 4;
            unsafe {
                *dest_row.add(i) = std::cmp::min(pixel_max, std::cmp::max(0, result) as u16) as u8;
            }
        }

        // Second-to-last pixels: linear interpolation
        for i in (width_val - 3)..(width_val - 1) {
            let a = unsafe { *src_row.add(i) } as u16;
            let b = unsafe { *src_row.add(i + 1) } as u16;
            unsafe { *dest_row.add(i) = ((a + b + 1) / 2) as u8 };
        }

        // Last pixel: copy
        unsafe { *dest_row.add(width_val - 1) = *src_row.add(width_val - 1) };
    }
}

The justification for manual_div_ceil being "warn" by default as part of the complexity group is that "It’s simpler, clearer and more readable." This, also, is not something I 100% agree with. You could argue it is preferable because it avoids overflows, but there are also cases where we know an overflow cannot occur. As such, given the above performance issues, I'd favor moving this to another group such as pedantic and making it "allow" by default.

Version

rustc 1.87.0 (17067e9ac 2025-05-09)
binary: rustc
commit-hash: 17067e9ac6d7ecb70e50f92c1944e545188d2359
commit-date: 2025-05-09
host: x86_64-unknown-linux-gnu
release: 1.87.0
LLVM version: 20.1.1

Additional Labels

@rustbot label +A-category +S-needs-discussion

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-categoryArea: Categorization of lintsS-needs-discussionStatus: Needs further discussion before merging or work can be started

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions