Skip to content

Conversation

mhx
Copy link
Contributor

@mhx mhx commented Aug 6, 2025

Before this change, fsst did not even work on big-endian systems:

$ ./binary /usr/share/dict/words words.fsst
Compressed 2486824 bytes into 3239956 bytes ==> 130%
$ ./binary -d words.fsst words.dec
Decompressed 3239953 bytes into 2486824 bytes ==> 76%
$ head /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
$ head words.dec

A
v
wo
Ao
Aoc
Aoc
Aot

With this change, it works correctly on big-endian systems and delivers the exact same result as on little-endian systems. Furthermore, the symbol tables produced by fsst_export() will always use little-endian version headers, and fsst_import() will always expect little-endian version headers, regardless of which system the code is running on. This enables symbol table exchange between big- and little-endian systems, as the remainder of the symbol table is byte-order-agnostic.

The change is fully backwards-compatible.

On little-endian systems, the code should behave exactly as before. On big-endian systems, the numeric 64-bit value of a symbol will be swapped as needed and will always be stored as little-endian. There is certainly some overhead in doing this, but it is much better than not being able to use fsst at all on big-endian systems.

Before this change, fsst did not even work on big-endian systems:

```
$ ./binary /usr/share/dict/words words.fsst
Compressed 2486824 bytes into 3239956 bytes ==> 130%
$ ./binary -d words.fsst words.dec
Decompressed 3239953 bytes into 2486824 bytes ==> 76%
$ head /usr/share/dict/words
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
$ head words.dec

A
v
wo
Ao
Aoc
Aoc
Aot
```

With this change, it works correctly on big-endian systems and delivers
the exact same result as on little-endian systems. Furthermore, the
symbol tables produced by `fsst_export()` will always use little-endian
version headers, and `fsst_import()` will always expect little-endian
version headers, regardless of which system the code is running on. This
enables symbol table exchange between big- and little-endian systems, as
the remainder of the symbol table is byte-order-agnostic.

The change is fully backwards-compatible.

On little-endian systems, the code should behave exactly as before. On
big-endian systems, the numeric 64-bit value of a symbol will be swapped
as needed and will always be stored as little-endian. There is certainly
some overhead in doing this, but it is much better than not being able
to use fsst at all on big-endian systems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant