From 75644ef200cbcddb6a6db746add1c3d4b8930c39 Mon Sep 17 00:00:00 2001 From: Tatu Saloranta Date: Tue, 26 Aug 2025 17:26:44 -0700 Subject: [PATCH 1/2] Fix #28: correct explanation of 32-bit `float` encoding --- smile-specification.md | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/smile-specification.md b/smile-specification.md index d780644..c205720 100644 --- a/smile-specification.md +++ b/smile-specification.md @@ -12,13 +12,15 @@ This page covers current data format specification; which is planned to eventual ### Version history -* Current: 1.0.6 (18-Apr-2025) - * Previous: 1.0.5 (26-Jan-2022) +* Current: 1.0.7 (26-Aug-2025) + * Previous: 1.0.6 (18-Apr-2025) * First official: 1.0.0 (12-Sep-2010) * First draft: (24-Jun-2010) ### Update history +* 2025-08-26: Corrected explanation and example of 32-bit `float` encoding + * Version 1.0.6 -> 1.0.7 * 2025-05-18: Explained existing (implement by Jackson codec) but previously undocumented requirement to skip Shared String/Key Name references ending with `0xFE` / `0xFF` bytes. * Version 1.0.5 -> 1.0.6 * 2022-02-20: Clarify handling of "unused" bits (see issue #17) primarily regarding encoding of floating-point numbers, but more generally for all unused bits. @@ -117,10 +119,24 @@ Some general notes on tokens: * Length indicators are done using VInts (for binary data, ("big") integer/decimal values) * All length indicators define _actual_ length of data; not possibly encoded length (in case of "safe" encoding, encoded data is longer, and that length can be calculated from payload data length) * "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0` -* Floating point values (IEEE 32 and 64-bit) are encoded using fixed-length big-endian encoding (7 bits used to avoid use of reserved bytes like `0xFF`): - * Data is "right-aligned", meaning padding is prepended to the first byte (and its MSB). - * For example, the 32-bit float `29.9510 is` encoded as `0x26 0x37 0x3E 0x0F 0x04.` We get to this encoding by taking the IEEE 764 32-bit binary representation of the number 29.9510, (1) writing the least-significant 7 bits, (2) right-shifting 7 bits, and repeating the process until encoding the entire bit-string (5 times for a 32-bit float). As a result, `0x26` = `29.9510 & 0x7F`, `0x37` = `(29.9510 >> 7) & 0x7F`, `0x3E` = `(29.9510 >> 14) & 0x7F`, `0x0F` = `(29.9510 >> 21) & 0x7F`, and `0x04 = (29.9510 >> 28) & 0x7F`. - * "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0` +* Floating point values (IEEE 32 and 64-bit) are encoded using fixed-length big-endian (MSB first) encoding (7 bits used to avoid use of reserved bytes like `0xFF`): + * Data is "right-aligned", meaning padding is prepended to the first byte (and specifically as MSB). + * For example, the 32-bit float `29.9510` is encoded as `0x04 0x0F 0x3E 0x37 0x26`. We get to this encoding by taking the IEEE 764 32-bit binary representation of the number `29.9510` -- `0x41ef9ba6` -- and: + 1. writing the most-significant 4 bits (with 4-bit MSB padding), + 2. followed by next-MSB 7 bits, and + 3. repeating the process until encoding the entire bit-string (5 times for a 32-bit float). + * As a result we get: + 1. `0x04' = '(29.9510 >> 28) & 0x7F` + 2. `0x0F` = `(29.9510 >> 21) & 0x7F` + 3. `0x3E` = `(29.9510 >> 14) & 0x7F` + 4. `0x37` = `(29.9510 >> 7) & 0x7F` + 5. `0x26` = `29.9510 & 0x7F` + * "Unused" bits (4 most-significant) in the first encoded byte should be handled as per earlier general note: left as `0` + * 64-bit `double` is handled similarly, so with sample value `-29.9510` we get: + * raw 64-bits: `0xc03df374bc6a7efa` + * encoded in 10 data bytes (1 bit in first, 9 x 7 bits) + * `0x01 0x40 0x1e 0x7c 0x6e 0x4b 0x63 0x29 0x7d 0x7a` + * Actual encoded value also has "type prefix byte -- `0x28` for `float` and `0x29` for `double` * "Big" decimal/integer values use "safe" binary encoding * "Safe" binary encoding simply uses 7 LSB (sign bit, MSB, is left as 0). * The last encoded byte contains 1 - 7 bits: if less than 7, data is "right-aligned", contained in Least-Significant Bits; there will be 0-6 MSB padding bits. From 1062abfd211e789314c056e056630f8f6dc07501 Mon Sep 17 00:00:00 2001 From: Tatu Saloranta Date: Tue, 26 Aug 2025 17:29:16 -0700 Subject: [PATCH 2/2] ... --- smile-specification.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/smile-specification.md b/smile-specification.md index c205720..2b250a2 100644 --- a/smile-specification.md +++ b/smile-specification.md @@ -124,7 +124,7 @@ Some general notes on tokens: * For example, the 32-bit float `29.9510` is encoded as `0x04 0x0F 0x3E 0x37 0x26`. We get to this encoding by taking the IEEE 764 32-bit binary representation of the number `29.9510` -- `0x41ef9ba6` -- and: 1. writing the most-significant 4 bits (with 4-bit MSB padding), 2. followed by next-MSB 7 bits, and - 3. repeating the process until encoding the entire bit-string (5 times for a 32-bit float). + 3. repeating the process until encoding the entire bit-string (5 bytes for a 32-bit float). * As a result we get: 1. `0x04' = '(29.9510 >> 28) & 0x7F` 2. `0x0F` = `(29.9510 >> 21) & 0x7F` @@ -134,8 +134,8 @@ Some general notes on tokens: * "Unused" bits (4 most-significant) in the first encoded byte should be handled as per earlier general note: left as `0` * 64-bit `double` is handled similarly, so with sample value `-29.9510` we get: * raw 64-bits: `0xc03df374bc6a7efa` - * encoded in 10 data bytes (1 bit in first, 9 x 7 bits) - * `0x01 0x40 0x1e 0x7c 0x6e 0x4b 0x63 0x29 0x7d 0x7a` + * encode in 10 data bytes (1 bit in first, 9 x 7 bits), MSB first + * resulting bytes: `0x01 0x40 0x1e 0x7c 0x6e 0x4b 0x63 0x29 0x7d 0x7a` * Actual encoded value also has "type prefix byte -- `0x28` for `float` and `0x29` for `double` * "Big" decimal/integer values use "safe" binary encoding * "Safe" binary encoding simply uses 7 LSB (sign bit, MSB, is left as 0).