Description
System information (version)
- OpenCV => 4.6
- Operating System / Platform => Linux x64 with Nvidia Cuda 11.6
- Compiler => gcc
Detailed description
Hello! I am working on a pipeline with multiple stream executing steps.
It looks something like that and flips every second image in the pipeline.
Everything works fine, however I have noticed a huge performance bottleneck when dealing with nvidia's npp.
For example in warpAffine, it takes ~20 times longer than your own implementation with WarpDispatcher::call.
Here is my pipeline
cv::Mat trans_mat = get_affine_transform(cs, m_pixel_std, m_input_size);
cv::cuda::warpAffine(
*input.image,
m_warped_resized_image,
trans_mat,
m_input_size,
cv::INTER_LINEAR,
cv::BORDER_CONSTANT,
cv::Scalar(),
m_stream
);
if (m_flip_images && m_index % 2 == 1)
{
cv::cuda::flip(m_warped_resized_image, m_flipped_image, 1, m_stream);
m_flipped_image.convertTo(m_float_resized_image, CV_32FC3, 1.f, m_stream);
}
else
m_warped_resized_image.convertTo(m_float_resized_image, CV_32FC3, 1.f, m_stream);
cv::cuda::subtract(m_float_resized_image, cv::Scalar(127.5f, 127.5f, 127.5f),
m_float_resized_image, cv::noArray(), -1, m_stream);
cv::cuda::divide(m_float_resized_image, cv::Scalar(128.0f, 128.0f, 128.0f),
m_float_resized_image, 1, -1, m_stream);
cv::cuda::split(m_float_resized_image, m_chw, m_stream);
When I do NO changes to your library, perf output looks something like that:
as you can see, cudaWarpAffine is really performance hungry because of npp.
When i disable the npp, simply by setting this if statement to false, I see drastic performance improvement, like that, you cant even see the call to warpAffine:
But cv::cuda::flip still uses npp and takes way longer that it should, it calls to npp::SetStream and cudaStreamSynchronize, which is dumb, why would anyone need this to flip an image.
Here is profiler, second call is to cudaStreamSynchronize.
Steps to reproduce
Compare speed of execution of cv::cuda::warpAffine with and without npp on multiple streams. Same goes to cv::cuda::flip
Possible solution
Do not use npp, and write a custom kernel, like that in warpAffine_gpu
Issue submission checklist
- [*] I report the issue, it's not a question
- [*] I checked the problem with documentation, FAQ, open issues,
forum.opencv.org, Stack Overflow, etc and have not found any solution - [*] I updated to the latest OpenCV version and the issue is still there
- [*] There is reproducer code and related data files: videos, images, onnx, etc