Skip to content

Conversation

illwieckz
Copy link
Member

@illwieckz illwieckz commented Mar 5, 2025

cmake: use -flto=auto compiler flag when supported

Use -flto=auto compiler flag when supported, this silence this
GCC warning:

lto-wrapper: warning: using serial compilation of # LTRANS jobs

This also greatly speeds-up the linkage time as it enables LTO
multithreading in GCC (either by using Make jobserver if detected,
either by detecting CPU cores).

Also always set LTO if enabled.

Similar to:

cmake: rework the fast-math enablement and force the disablement

Some compilers may enable fast-math by default (example: ICC).
Some contractions are still safe and can be enabled.

cmake: fix typo

Use -flto=auto compiler flag when supported, this silence this
GCC warning:

> lto-wrapper: warning: using serial compilation of # LTRANS jobs

This also greatly speeds-up the linkage time as it enables LTO
multithreading in GCC (either by using Make jobserver if detected,
either by detecting CPU cores).

Also always set LTO if enabled.
Some compilers may enable fast-math by default (example: ICC).
Some contractions are still safe and can be enabled.
@illwieckz illwieckz changed the title Illwieckz/cmake cmake: use -flto=auto compiler flag when supported Mar 17, 2025
@illwieckz illwieckz changed the title cmake: use -flto=auto compiler flag when supported cmake: use -flto=auto compiler flag when supported, rework fast-math disablement Mar 17, 2025
@illwieckz illwieckz force-pushed the illwieckz/cmake branch 3 times, most recently from 2abc19b to 27c13ac Compare March 17, 2025 17:45
Copy link
Member

@slipher slipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any numbers on performance impact of the strict math flags? Reproducible floating-point operations are of somewhat niche interest and have limited chance to be reliably achieved between differing compilers and architectures, so I think that would be a bad default if it costs more than a few percent of speed.

CMakeLists.txt Outdated
@@ -88,12 +98,8 @@ if (MSVC)
endif()

if (USE_LTO)
set_cxx_flag("/GL" RELEASE)
set_cxx_flag("/GL" RELWITHDEBINFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of these is to have different flags per configuration in a multi-configuration build, so that you can test a Release build with LTO and a Debug build without it from the same IDE instance.

It's OK to remove this for the GCC-style options since it gets in the way of flag tests, but there is no need to remove it for MSVC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looked overly verbose to me but I didn't know MSVC required such explicit boilerplate.
Now reverted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to push, now pushed. 😆

@illwieckz
Copy link
Member Author

Any numbers on performance impact of the strict math flags? Reproducible floating-point operations are of somewhat niche interest and have limited chance to be reliably achieved between differing compilers and architectures, so I think that would be a bad default if it costs more than a few percent of speed.

It looks very niche to me to care about fast math in an offline conversion tool, the option can still be enabled for those wanting to feature it as a lib. In fact it's even safer for those building against crnlib to not have the fast math being enabled for their whole program in their back.

Speed difference between fast math or not is negligible (see below).

While reproducibility is fragile when fast math is disabled, it makes it impossible when it is enabled, I don't see the benefit of it.

The time difference may be exaggerated by slower image decoding (like slower PNG decoding).


Benchmarks

Environment

Test machine: 16 cores, 32 threads AMD Zen 2 CPU, DDR4 RAM with enough space to fit the whole benchmark corpus in cache.

Protocol

Images are put in memory cache before the runs and during the run images are read from memory cache and written to /dev/null. Crunch is a release build built with LTO enabled.

32 crunch executions are done in parallel (one crunch per CPU thread, one image per crunch execution), and each crunch execution uses 15 threads (because it uses max((nproc - 1), 15) by default).

About 10 runs are done sequentially for every configuration, the 5 best ones are kept (to filter out runs that may be disturbed by random unrelated tasks spawning in the background).

# Print number of images to convert.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) | wc -l

# Preload image files into memory.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite -0 -I{} -P"$(nproc)" dd if={} of=/dev/null bs=1M status=none

# Convert images files.
time find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite --halt 0 -0 -I{} -P"$(nproc)" crunch -quiet -file {} -fileformat crn -out /dev/null

I only added --halt 0 to skip a Xonotic jpg that is not accepted by crunch (I will report), no image is skipped in other corpus.

Results

Corpus: UnvanquishedAssets

Files: 2656 images; 153 jpg, 2486 png, 17 tga.

Fast math times:

real	2m40,024s
real	2m40,504s
real	2m40,575s
real	2m40,610s
real	2m40,692s

Strict math times:

real	2m46,021s
real	2m46,026s
real	2m46,105s
real	2m46,517s
real	2m46,680s

Average time difference per image: 0.002 s

Corpus: InterstellarOasis

Files: 5160 images; 3097 jpg, 2063 png, 0 tga.

Fast math times:

real	1m20,441s
real	1m20,481s
real	1m20,557s
real	1m20,617s
real	1m20,531s

Strict math times:

real	1m23,275s
real	1m23,369s
real	1m23,084s
real	1m23,050s
real	1m22,908s

Average time difference per image: 0.0005 s

Corpus: Xonotic

Files: 10041 images; 2351 jpg, 101 png, 7589 tga.

Fast math times:

real	5m48,586s
real	5m51,403s
real	5m51,649s
real	5m51,784s
real	5m51,985s

Strict math times:

real	5m57,966s
real	5m58,812s
real	5m59.601s
real	6m0,113s
real	6m0,327s

Average time difference per image: 0.0008 s

# By default, GCC uses -std=gnu* and then enables -ffp-contract=fast even if -ffast-math is not enabled.
set_cxx_flag("-ffp-contract=off")
# See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
# GCC fast contractions (-ffp-contract=fast) should be safe, but aren't on arm64 with GCC 12.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "aren't safe"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been more verbose at the time because I forgot the exact issue.

I guess it meant it doesn't produce the same result as compiling without any fast stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the purpose of my patch is to maximize the chance the files are reproducible, I probably noticed that on ARM such option broke the reproducibility. It's probably a similar problem than using x87 instead of SSE on x86, maybe some ARM fused operations break IEEE compliance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, since I mentioned a specific GCC version, that was the result of me testing that specific compiler on the said hardware, and I was testing the reproducibility of converted files. GCC 12 is the Debian Bookworm GCC, and I use Debian Bookworm on my Arm boards.

Copy link
Member

@slipher slipher Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unobvious so it should be explained in the comment.

As we've discussed in the past, I don't think floating point reproducibility is a good goal -- the language and compilers don't make any attempt to provide such guarantees. Platforms using x87 floating point are an easy example where you're not going to achieve it. But for GCC/Clang those options to disable fast math are good in any case, since fast math makes the software too unreliable.

@@ -95,6 +95,9 @@ if (MSVC)
# and https://devblogs.microsoft.com/cppblog/the-fpcontract-flag-and-changes-to-fp-modes-in-vs2022/
# By default, MSVC doesn't enable the /fp:fast option.
set_cxx_flag("/fp:fast")
else()
# Precise model (/fp:precise) should do safe contractions, but we should not trust that (see below).
set_cxx_flag("/fp:strict")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/fp:precise would be a better default for MSVC. The documentation says /fp:strict is mostly needed if you want floating point exceptions. I did a little test with 57 PNG files and got 246.162s (self-reported) with master and 259.846s with this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants