Skip to content

NPP performance bottleneck with multiple streams #3330

Open
@awarebayes

Description

@awarebayes
System information (version)
  • OpenCV => 4.6
  • Operating System / Platform => Linux x64 with Nvidia Cuda 11.6
  • Compiler => gcc
Detailed description

Hello! I am working on a pipeline with multiple stream executing steps.
It looks something like that and flips every second image in the pipeline.
Everything works fine, however I have noticed a huge performance bottleneck when dealing with nvidia's npp.
For example in warpAffine, it takes ~20 times longer than your own implementation with WarpDispatcher::call.

Here is my pipeline

    cv::Mat trans_mat = get_affine_transform(cs, m_pixel_std, m_input_size);
    cv::cuda::warpAffine(
            *input.image,
            m_warped_resized_image,
            trans_mat,
            m_input_size,
            cv::INTER_LINEAR,
            cv::BORDER_CONSTANT,
            cv::Scalar(),
            m_stream
            );

    if (m_flip_images && m_index % 2 == 1)
    {
        cv::cuda::flip(m_warped_resized_image, m_flipped_image, 1, m_stream);
        m_flipped_image.convertTo(m_float_resized_image, CV_32FC3, 1.f, m_stream);
    }
    else
        m_warped_resized_image.convertTo(m_float_resized_image, CV_32FC3, 1.f, m_stream);
    cv::cuda::subtract(m_float_resized_image, cv::Scalar(127.5f, 127.5f, 127.5f),
                       m_float_resized_image, cv::noArray(), -1, m_stream);
    cv::cuda::divide(m_float_resized_image, cv::Scalar(128.0f, 128.0f, 128.0f),
                     m_float_resized_image, 1, -1, m_stream);
    cv::cuda::split(m_float_resized_image, m_chw, m_stream);

When I do NO changes to your library, perf output looks something like that:

image

as you can see, cudaWarpAffine is really performance hungry because of npp.

When i disable the npp, simply by setting this if statement to false, I see drastic performance improvement, like that, you cant even see the call to warpAffine:

image

But cv::cuda::flip still uses npp and takes way longer that it should, it calls to npp::SetStream and cudaStreamSynchronize, which is dumb, why would anyone need this to flip an image.

Here is profiler, second call is to cudaStreamSynchronize.

image

Steps to reproduce

Compare speed of execution of cv::cuda::warpAffine with and without npp on multiple streams. Same goes to cv::cuda::flip

Possible solution

Do not use npp, and write a custom kernel, like that in warpAffine_gpu

Issue submission checklist
  • [*] I report the issue, it's not a question
  • [*] I checked the problem with documentation, FAQ, open issues,
    forum.opencv.org, Stack Overflow, etc and have not found any solution
  • [*] I updated to the latest OpenCV version and the issue is still there
  • [*] There is reproducer code and related data files: videos, images, onnx, etc

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions