NPP performance bottleneck with multiple streams

##### System information (version)


- OpenCV => 4.6
- Operating System / Platform => Linux x64 with Nvidia Cuda 11.6
- Compiler => gcc 

##### Detailed description



Hello! I am working on a pipeline with multiple stream executing steps.
It looks something like that and flips every second image in the pipeline.
Everything works fine, however I have noticed a huge performance bottleneck when dealing with nvidia's npp.
For example in warpAffine, it takes ~20 times longer than your own implementation with WarpDispatcher::call.

Here is my pipeline

```
    cv::Mat trans_mat = get_affine_transform(cs, m_pixel_std, m_input_size);
    cv::cuda::warpAffine(
            *input.image,
            m_warped_resized_image,
            trans_mat,
            m_input_size,
            cv::INTER_LINEAR,
            cv::BORDER_CONSTANT,
            cv::Scalar(),
            m_stream
            );

    if (m_flip_images && m_index % 2 == 1)
    {
        cv::cuda::flip(m_warped_resized_image, m_flipped_image, 1, m_stream);
        m_flipped_image.convertTo(m_float_resized_image, CV_32FC3, 1.f, m_stream);
    }
    else
        m_warped_resized_image.convertTo(m_float_resized_image, CV_32FC3, 1.f, m_stream);
    cv::cuda::subtract(m_float_resized_image, cv::Scalar(127.5f, 127.5f, 127.5f),
                       m_float_resized_image, cv::noArray(), -1, m_stream);
    cv::cuda::divide(m_float_resized_image, cv::Scalar(128.0f, 128.0f, 128.0f),
                     m_float_resized_image, 1, -1, m_stream);
    cv::cuda::split(m_float_resized_image, m_chw, m_stream);
```

When I do NO changes to your library, perf output looks something like that:

<img width="1111" alt="image" src="https://user-images.githubusercontent.com/42784580/185755151-e8c1ad63-d810-4643-bc94-812595d87eca.png">

as you can see, cudaWarpAffine is really performance hungry because of npp.

When i disable the npp, simply by setting [this if statement](https://github.com/opencv/opencv_contrib/blob/4.x/modules/cudawarping/src/warp.cpp#L247) to false, I see drastic performance improvement, like that, you cant even see the call to warpAffine:

<img width="1128" alt="image" src="https://user-images.githubusercontent.com/42784580/185754296-c0078771-d979-457c-9d2d-01f8cd178494.png">

But cv::cuda::flip still uses npp and takes way longer that it should, it calls to npp::SetStream and cudaStreamSynchronize, which is dumb, why would anyone need this to flip an image.

Here is profiler, second call is to cudaStreamSynchronize.

<img width="1108" alt="image" src="https://user-images.githubusercontent.com/42784580/185755400-31fa017d-29cb-4d60-be4e-73e22836dbe6.png">

##### Steps to reproduce

Compare speed of execution of cv::cuda::warpAffine with and without npp on multiple streams. Same goes to cv::cuda::flip



##### Possible solution

Do not use npp, and write a custom kernel, like that in warpAffine_gpu

##### Issue submission checklist

 - [*] I report the issue, it's not a question
   
 - [*] I checked the problem with documentation, FAQ, open issues,
       forum.opencv.org, Stack Overflow, etc and have not found any solution
   
 - [*] I updated to the latest OpenCV version and the issue is still there
   
 - [*] There is reproducer code and related data files: videos, images, onnx, etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPP performance bottleneck with multiple streams #3330

System information (version)

Detailed description

Steps to reproduce

Possible solution

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NPP performance bottleneck with multiple streams #3330

Description

System information (version)

Detailed description

Steps to reproduce

Possible solution

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions