feat: add support for big-endian systems and big-/little-endian interop #36

mhx · 2025-08-06T21:26:12Z

Before this change, fsst did not even work on big-endian systems:

$ ./binary /usr/share/dict/words words.fsst
Compressed 2486824 bytes into 3239956 bytes ==> 130%
$ ./binary -d words.fsst words.dec
Decompressed 3239953 bytes into 2486824 bytes ==> 76%
$ head /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
$ head words.dec

A
v
wo
Ao
Aoc
Aoc
Aot

With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by fsst_export() will always use little-endian version headers, and fsst_import() will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic.

The change is fully backwards-compatible.

On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.

Before this change, fsst did not even work on big-endian systems: ``` $ ./binary /usr/share/dict/words words.fsst Compressed 2486824 bytes into 3239956 bytes ==> 130% $ ./binary -d words.fsst words.dec Decompressed 3239953 bytes into 2486824 bytes ==> 76% $ head /usr/share/dict/words A a aa aal aalii aam Aani aardvark aardwolf Aaron $ head words.dec A v wo Ao Aoc Aoc Aot ``` With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by `fsst_export()` will always use little-endian version headers, and `fsst_import()` will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic. The change is fully backwards-compatible. On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.

mhx mentioned this pull request Aug 6, 2025

Mixed-endian decoding? #35

Open

mhx mentioned this pull request Aug 19, 2025

Endless loop in buildSymbolTable on 32-bit ARM / gcc #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add support for big-endian systems and big-/little-endian interop #36

feat: add support for big-endian systems and big-/little-endian interop #36

Uh oh!

mhx commented Aug 6, 2025

Uh oh!

Uh oh!

feat: add support for big-endian systems and big-/little-endian interop #36

Are you sure you want to change the base?

feat: add support for big-endian systems and big-/little-endian interop #36

Uh oh!

Conversation

mhx commented Aug 6, 2025

Uh oh!

Uh oh!