-
Notifications
You must be signed in to change notification settings - Fork 7
cmake: use -flto=auto compiler flag when supported, rework fast-math disablement #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Use -flto=auto compiler flag when supported, this silence this GCC warning: > lto-wrapper: warning: using serial compilation of # LTRANS jobs This also greatly speeds-up the linkage time as it enables LTO multithreading in GCC (either by using Make jobserver if detected, either by detecting CPU cores). Also always set LTO if enabled.
Some compilers may enable fast-math by default (example: ICC). Some contractions are still safe and can be enabled.
166155c
to
867022a
Compare
867022a
to
160640b
Compare
867022a
to
629da09
Compare
2abc19b
to
27c13ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any numbers on performance impact of the strict math flags? Reproducible floating-point operations are of somewhat niche interest and have limited chance to be reliably achieved between differing compilers and architectures, so I think that would be a bad default if it costs more than a few percent of speed.
CMakeLists.txt
Outdated
@@ -88,12 +98,8 @@ if (MSVC) | |||
endif() | |||
|
|||
if (USE_LTO) | |||
set_cxx_flag("/GL" RELEASE) | |||
set_cxx_flag("/GL" RELWITHDEBINFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of these is to have different flags per configuration in a multi-configuration build, so that you can test a Release build with LTO and a Debug build without it from the same IDE instance.
It's OK to remove this for the GCC-style options since it gets in the way of flag tests, but there is no need to remove it for MSVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this looked overly verbose to me but I didn't know MSVC required such explicit boilerplate.
Now reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to push, now pushed. 😆
It looks very niche to me to care about fast math in an offline conversion tool, the option can still be enabled for those wanting to feature it as a lib. In fact it's even safer for those building against crnlib to not have the fast math being enabled for their whole program in their back. Speed difference between fast math or not is negligible (see below). While reproducibility is fragile when fast math is disabled, it makes it impossible when it is enabled, I don't see the benefit of it. The time difference may be exaggerated by slower image decoding (like slower PNG decoding). BenchmarksEnvironmentTest machine: 16 cores, 32 threads AMD Zen 2 CPU, DDR4 RAM with enough space to fit the whole benchmark corpus in cache. ProtocolImages are put in memory cache before the runs and during the run images are read from memory cache and written to 32 crunch executions are done in parallel (one crunch per CPU thread, one image per crunch execution), and each crunch execution uses 15 threads (because it uses About 10 runs are done sequentially for every configuration, the 5 best ones are kept (to filter out runs that may be disturbed by random unrelated tasks spawning in the background). # Print number of images to convert.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) | wc -l
# Preload image files into memory.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite -0 -I{} -P"$(nproc)" dd if={} of=/dev/null bs=1M status=none
# Convert images files.
time find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite --halt 0 -0 -I{} -P"$(nproc)" crunch -quiet -file {} -fileformat crn -out /dev/null I only added ResultsCorpus: UnvanquishedAssets Files: 2656 images; 153 jpg, 2486 png, 17 tga. Fast math times:
Strict math times:
Average time difference per image: 0.002 s Corpus: InterstellarOasis Files: 5160 images; 3097 jpg, 2063 png, 0 tga. Fast math times:
Strict math times:
Average time difference per image: 0.0005 s Corpus: Xonotic Files: 10041 images; 2351 jpg, 101 png, 7589 tga. Fast math times:
Strict math times:
Average time difference per image: 0.0008 s |
27c13ac
to
9174b64
Compare
# By default, GCC uses -std=gnu* and then enables -ffp-contract=fast even if -ffast-math is not enabled. | ||
set_cxx_flag("-ffp-contract=off") | ||
# See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html | ||
# GCC fast contractions (-ffp-contract=fast) should be safe, but aren't on arm64 with GCC 12. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "aren't safe"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been more verbose at the time because I forgot the exact issue.
I guess it meant it doesn't produce the same result as compiling without any fast stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the purpose of my patch is to maximize the chance the files are reproducible, I probably noticed that on ARM such option broke the reproducibility. It's probably a similar problem than using x87 instead of SSE on x86, maybe some ARM fused operations break IEEE compliance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, since I mentioned a specific GCC version, that was the result of me testing that specific compiler on the said hardware, and I was testing the reproducibility of converted files. GCC 12 is the Debian Bookworm GCC, and I use Debian Bookworm on my Arm boards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unobvious so it should be explained in the comment.
As we've discussed in the past, I don't think floating point reproducibility is a good goal -- the language and compilers don't make any attempt to provide such guarantees. Platforms using x87 floating point are an easy example where you're not going to achieve it. But for GCC/Clang those options to disable fast math are good in any case, since fast math makes the software too unreliable.
@@ -95,6 +95,9 @@ if (MSVC) | |||
# and https://devblogs.microsoft.com/cppblog/the-fpcontract-flag-and-changes-to-fp-modes-in-vs2022/ | |||
# By default, MSVC doesn't enable the /fp:fast option. | |||
set_cxx_flag("/fp:fast") | |||
else() | |||
# Precise model (/fp:precise) should do safe contractions, but we should not trust that (see below). | |||
set_cxx_flag("/fp:strict") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/fp:precise
would be a better default for MSVC. The documentation says /fp:strict
is mostly needed if you want floating point exceptions. I did a little test with 57 PNG files and got 246.162s
(self-reported) with master and 259.846s
with this branch.
cmake: use -flto=auto compiler flag when supported
Use -flto=auto compiler flag when supported, this silence this
GCC warning:
This also greatly speeds-up the linkage time as it enables LTO
multithreading in GCC (either by using Make jobserver if detected,
either by detecting CPU cores).
Also always set LTO if enabled.
Similar to:
cmake: rework the fast-math enablement and force the disablement
Some compilers may enable fast-math by default (example: ICC).
Some contractions are still safe and can be enabled.
cmake: fix typo