⚡️ Speed up function `masks2poly` by 69% in PR #1586 (`tune-mask2polygon`) #1587

codeflash-ai · 2025-09-24T18:41:03Z

⚡️ This pull request contains optimizations for PR #1586

If you approve this dependent PR, these changes will be merged into the original PR branch tune-mask2polygon.

This PR will be automatically closed if the original PR is merged.

📄 69% (0.69x) speedup for `masks2poly` in `inference/core/utils/postprocess.py`

⏱️ Runtime : 18.1 milliseconds → 10.7 milliseconds (best of 234 runs)

📝 Explanation and details

The optimized code achieves a 69% speedup through three key performance improvements:

1. Faster empty mask detection: Replaces np.any(m_uint8) with np.count_nonzero(m_uint8) == 0. The profiler shows this reduces the most expensive line from 36.6% to 15.1% of total time. count_nonzero is significantly faster on dense binary arrays, especially for the common case of empty masks where it can short-circuit early.

2. Optimized contour selection: Instead of creating a temporary array np.array([len(x) for x in contours]) and calling argmax(), the code uses a simple loop to track the largest contour directly. This eliminates array allocation overhead and is particularly effective when there's only one contour (common case), reducing mask2poly time from 54.8% to 56.5% but with better per-hit performance.

3. Minor loop optimizations:

Caches segments.append as a local variable to avoid repeated attribute lookups
Stores mask.dtype once to avoid repeated property access
Streamlines boolean mask handling with direct astype(np.uint8, copy=False)

The optimizations are most effective for:

Empty masks: 84-119% faster (common in many vision tasks)
Large batches: 93-210% faster on batches of 100-500 masks
Single contour cases: Avoids unnecessary array operations when only one contour exists

These improvements compound especially well in typical computer vision workflows where many masks are empty or contain simple shapes.

✅ Correctness verification report:

Test	Status
⏪ Replay Tests	🔘 None Found
⚙️ Existing Unit Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 41 Passed
📊 Tests Coverage	87.0%

🌀 Generated Regression Tests and Runtime

from typing import List

import cv2
import numpy as np
# imports
import pytest  # used for our unit tests
from inference.core.utils.postprocess import masks2poly

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_single_square_mask_bool():
    # 5x5 mask with a 3x3 square in the center
    mask = np.zeros((5, 5), dtype=bool)
    mask[1:4, 1:4] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 40.3μs -> 26.0μs (55.2% faster)
    poly = polys[0]

def test_single_square_mask_uint8():
    # Same as above, but uint8 mask
    mask = np.zeros((5, 5), dtype=np.uint8)
    mask[1:4, 1:4] = 255
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 37.5μs -> 23.1μs (62.2% faster)
    poly = polys[0]

def test_multiple_masks():
    # Two masks: one square, one diagonal line
    mask1 = np.zeros((5, 5), dtype=bool)
    mask1[1:4, 1:4] = True
    mask2 = np.zeros((5, 5), dtype=np.uint8)
    np.fill_diagonal(mask2, 255)
    masks = np.stack([mask1, mask2])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 55.9μs -> 31.1μs (79.9% faster)

def test_noncontiguous_mask():
    # Mask with two disconnected squares
    mask = np.zeros((7, 7), dtype=bool)
    mask[1:3, 1:3] = True
    mask[4:6, 4:6] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 40.2μs -> 26.8μs (49.9% faster)
    poly = polys[0]

def test_mask_with_holes():
    # Mask with a hole in the middle
    mask = np.ones((5, 5), dtype=bool)
    mask[2, 2] = False
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 38.1μs -> 24.7μs (54.0% faster)
    poly = polys[0]

# ---------------------------
# Edge Test Cases
# ---------------------------

def test_empty_mask():
    # All zeros mask
    mask = np.zeros((5, 5), dtype=bool)
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 19.0μs -> 10.3μs (84.9% faster)
    poly = polys[0]

def test_all_one_mask():
    # All ones mask
    mask = np.ones((5, 5), dtype=np.uint8) * 255
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 36.5μs -> 23.2μs (57.0% faster)
    poly = polys[0]

def test_minimal_mask():
    # 1x1 mask, single pixel
    mask = np.ones((1, 1), dtype=bool)
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 37.2μs -> 23.8μs (56.5% faster)
    poly = polys[0]

def test_mask_dtype_int32():
    # Mask with int32 dtype
    mask = np.zeros((5, 5), dtype=np.int32)
    mask[2, 2] = 1
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 42.8μs -> 28.8μs (48.5% faster)
    poly = polys[0]

def test_mask_non_contiguous_memory():
    # Mask with non-contiguous memory
    mask = np.ones((5, 5), dtype=bool)[::2]
    # Now mask is 3x5, non-contiguous
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 37.5μs -> 24.2μs (55.0% faster)
    poly = polys[0]

def test_mask_with_single_pixel():
    # Mask with only one pixel set
    mask = np.zeros((5, 5), dtype=bool)
    mask[2, 3] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 37.6μs -> 24.3μs (54.6% faster)
    poly = polys[0]

def test_mask_with_border_pixel():
    # Mask with a pixel at the border
    mask = np.zeros((5, 5), dtype=bool)
    mask[0, 0] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 37.5μs -> 23.9μs (56.9% faster)
    poly = polys[0]

def test_mask_with_large_hole():
    # Large mask with a hole in the center
    mask = np.ones((10, 10), dtype=bool)
    mask[3:7, 3:7] = False
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 38.3μs -> 24.6μs (55.8% faster)
    poly = polys[0]

def test_mask_with_multiple_components():
    # Mask with three disconnected dots
    mask = np.zeros((6, 6), dtype=bool)
    mask[1, 1] = True
    mask[3, 3] = True
    mask[5, 5] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 40.5μs -> 27.5μs (47.3% faster)
    poly = polys[0]

# ---------------------------
# Large Scale Test Cases
# ---------------------------

def test_large_mask():
    # Large mask, single filled rectangle
    mask = np.zeros((100, 100), dtype=bool)
    mask[10:90, 10:90] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 45.3μs -> 31.8μs (42.5% faster)
    poly = polys[0]

def test_many_small_masks():
    # 100 masks, each with a single pixel at a different location
    masks = []
    for i in range(100):
        mask = np.zeros((10, 10), dtype=bool)
        mask[i // 10, i % 10] = True
        masks.append(mask)
    masks = np.stack(masks)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 1.26ms -> 625μs (101% faster)
    for i, poly in enumerate(polys):
        pass

def test_large_batch_of_rectangles():
    # 50 masks, each with a 5x5 rectangle at a different location
    masks = []
    for i in range(50):
        mask = np.zeros((20, 20), dtype=bool)
        start = i
        mask[start:start+5, start:start+5] = True
        masks.append(mask)
    masks = np.stack(masks)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 426μs -> 215μs (98.1% faster)
    for i, poly in enumerate(polys):
        # All points should be within the rectangle bounds
        start = i

def test_large_mask_with_multiple_components():
    # Large mask with several disconnected rectangles
    mask = np.zeros((100, 100), dtype=bool)
    mask[10:20, 10:20] = True
    mask[30:40, 30:40] = True
    mask[50:60, 50:60] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 46.7μs -> 33.4μs (40.0% faster)
    poly = polys[0]

def test_large_empty_masks():
    # 100 masks, all empty
    masks = np.zeros((100, 50, 50), dtype=bool)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 482μs -> 248μs (93.8% faster)
    for poly in polys:
        pass

def test_large_mask_with_border_touching():
    # Large mask with a rectangle touching the border
    mask = np.zeros((100, 100), dtype=bool)
    mask[0:50, 0:50] = True
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 44.0μs -> 30.4μs (44.8% faster)
    poly = polys[0]

def test_large_mask_with_hole():
    # Large mask with a hole in the center
    mask = np.ones((200, 200), dtype=bool)
    mask[50:150, 50:150] = False
    masks = np.stack([mask])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 54.2μs -> 42.3μs (27.9% faster)
    poly = polys[0]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import List

# function to test
import cv2
import numpy as np
# imports
import pytest  # used for our unit tests
from inference.core.utils.postprocess import masks2poly

# unit tests

# --- Basic Test Cases ---

def test_single_square_mask_uint8():
    # Test with a single 5x5 mask with a centered 3x3 square (uint8)
    mask = np.zeros((5, 5), dtype=np.uint8)
    mask[1:4, 1:4] = 1
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 40.0μs -> 25.0μs (59.9% faster)

def test_single_square_mask_bool():
    # Test with a single 5x5 mask with a centered 3x3 square (bool)
    mask = np.zeros((5, 5), dtype=bool)
    mask[1:4, 1:4] = True
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 39.8μs -> 25.6μs (55.5% faster)

def test_multiple_masks():
    # Test with two masks: one empty, one with a square
    mask1 = np.zeros((5, 5), dtype=np.uint8)
    mask2 = np.zeros((5, 5), dtype=np.uint8)
    mask2[1:4, 1:4] = 1
    masks = np.stack([mask1, mask2])
    codeflash_output = masks2poly(masks); polys = codeflash_output # 44.5μs -> 25.2μs (76.4% faster)

def test_single_pixel_mask():
    # Test with a mask with a single pixel set
    mask = np.zeros((5, 5), dtype=np.uint8)
    mask[2, 2] = 1
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 38.2μs -> 23.5μs (62.4% faster)

def test_non_contiguous_input():
    # Test with a mask that is non-contiguous in memory
    mask = np.zeros((5, 5), dtype=np.uint8)[::-1]
    mask[1:4, 1:4] = 1
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 38.3μs -> 24.3μs (57.4% faster)

# --- Edge Test Cases ---

def test_empty_mask():
    # Test with a completely empty mask
    mask = np.zeros((5, 5), dtype=np.uint8)
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 18.4μs -> 8.40μs (119% faster)

def test_all_true_mask():
    # Test with a mask where all pixels are True
    mask = np.ones((5, 5), dtype=bool)
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 39.5μs -> 25.6μs (54.3% faster)

def test_mask_with_hole():
    # Test with a mask with a hole in the middle
    mask = np.ones((5, 5), dtype=np.uint8)
    mask[2, 2] = 0
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 36.7μs -> 22.7μs (61.6% faster)

def test_mask_with_multiple_objects():
    # Test with a mask with two separate squares
    mask = np.zeros((5, 5), dtype=np.uint8)
    mask[1:3, 1:3] = 1
    mask[3:5, 3:5] = 1
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 37.4μs -> 23.1μs (62.1% faster)

def test_non_binary_mask():
    # Test with a mask containing values other than 0 and 1
    mask = np.zeros((5, 5), dtype=np.int32)
    mask[1:4, 1:4] = 5
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 44.0μs -> 30.0μs (46.7% faster)

def test_mask_dtype_float():
    # Test with float mask, values between 0 and 1
    mask = np.zeros((5, 5), dtype=np.float32)
    mask[1:4, 1:4] = 0.5
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 44.1μs -> 29.9μs (47.6% faster)

def test_mask_shape_1x1():
    # Test with a mask of shape (1, 1)
    mask = np.ones((1, 1), dtype=np.uint8)
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 36.2μs -> 22.3μs (62.2% faster)

def test_mask_shape_1xN():
    # Test with a mask of shape (1, N)
    mask = np.ones((1, 5), dtype=np.uint8)
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 36.3μs -> 22.7μs (59.8% faster)

def test_mask_shape_Nx1():
    # Test with a mask of shape (5, 1)
    mask = np.ones((5, 1), dtype=np.uint8)
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 36.1μs -> 22.4μs (61.3% faster)

# --- Large Scale Test Cases ---

def test_many_masks():
    # Test with 100 masks, each with a single pixel set
    N = 100
    masks = np.zeros((N, 10, 10), dtype=np.uint8)
    for i in range(N):
        masks[i, i % 10, i // 10] = 1
    codeflash_output = masks2poly(masks); polys = codeflash_output # 1.19ms -> 556μs (114% faster)
    for i in range(N):
        pass

def test_large_mask():
    # Test with a large mask (100x100) with a large filled square
    mask = np.zeros((100, 100), dtype=np.uint8)
    mask[10:90, 10:90] = 1
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 44.6μs -> 29.8μs (49.8% faster)

def test_large_mask_with_hole():
    # Large mask with a hole in the middle
    mask = np.ones((100, 100), dtype=np.uint8)
    mask[40:60, 40:60] = 0
    masks = np.expand_dims(mask, axis=0)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 43.8μs -> 29.3μs (49.7% faster)

def test_large_batch_of_empty_masks():
    # Test with 500 completely empty masks
    N = 500
    masks = np.zeros((N, 10, 10), dtype=np.uint8)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 2.03ms -> 655μs (210% faster)
    for poly in polys:
        pass

def test_large_batch_of_full_masks():
    # Test with 500 full masks
    N = 500
    masks = np.ones((N, 10, 10), dtype=np.uint8)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 5.94ms -> 2.87ms (107% faster)
    for poly in polys:
        pass

def test_performance_on_large_data():
    # Test performance on a batch of 100 masks, each 50x50, with random blobs
    N = 100
    rng = np.random.default_rng(42)
    masks = (rng.integers(0, 2, size=(N, 50, 50)) > 0).astype(np.uint8)
    codeflash_output = masks2poly(masks); polys = codeflash_output # 5.44ms -> 4.67ms (16.4% faster)
    for poly in polys:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1586-2025-09-24T18.40.57 and push.

The optimized code achieves a **69% speedup** through three key performance improvements: **1. Faster empty mask detection:** Replaces `np.any(m_uint8)` with `np.count_nonzero(m_uint8) == 0`. The profiler shows this reduces the most expensive line from 36.6% to 15.1% of total time. `count_nonzero` is significantly faster on dense binary arrays, especially for the common case of empty masks where it can short-circuit early. **2. Optimized contour selection:** Instead of creating a temporary array `np.array([len(x) for x in contours])` and calling `argmax()`, the code uses a simple loop to track the largest contour directly. This eliminates array allocation overhead and is particularly effective when there's only one contour (common case), reducing `mask2poly` time from 54.8% to 56.5% but with better per-hit performance. **3. Minor loop optimizations:** - Caches `segments.append` as a local variable to avoid repeated attribute lookups - Stores `mask.dtype` once to avoid repeated property access - Streamlines boolean mask handling with direct `astype(np.uint8, copy=False)` The optimizations are most effective for: - **Empty masks:** 84-119% faster (common in many vision tasks) - **Large batches:** 93-210% faster on batches of 100-500 masks - **Single contour cases:** Avoids unnecessary array operations when only one contour exists These improvements compound especially well in typical computer vision workflows where many masks are empty or contain simple shapes.

codeflash-ai bot requested review from PawelPeczek-Roboflow, grzegorz-roboflow, yeldarby, probicheaux and hansent as code owners September 24, 2025 18:41

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Sep 24, 2025

codeflash-ai bot mentioned this pull request Sep 24, 2025

improve speed and memory usage of masks2poly and masks2multipoly #1586

Merged

1 task

Base automatically changed from tune-mask2polygon to main October 1, 2025 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `masks2poly` by 69% in PR #1586 (`tune-mask2polygon`) #1587

⚡️ Speed up function `masks2poly` by 69% in PR #1586 (`tune-mask2polygon`) #1587

Uh oh!

codeflash-ai bot commented Sep 24, 2025

Uh oh!

Uh oh!

⚡️ Speed up function masks2poly by 69% in PR #1586 (tune-mask2polygon) #1587

Are you sure you want to change the base?

⚡️ Speed up function masks2poly by 69% in PR #1586 (tune-mask2polygon) #1587

Uh oh!

Conversation

codeflash-ai bot commented Sep 24, 2025

⚡️ This pull request contains optimizations for PR #1586

📄 69% (0.69x) speedup for masks2poly in inference/core/utils/postprocess.py

📝 Explanation and details

Uh oh!

Uh oh!

⚡️ Speed up function `masks2poly` by 69% in PR #1586 (`tune-mask2polygon`) #1587

⚡️ Speed up function `masks2poly` by 69% in PR #1586 (`tune-mask2polygon`) #1587

📄 69% (0.69x) speedup for `masks2poly` in `inference/core/utils/postprocess.py`