Clarify string descriptions #1064

arp242 · 2025-06-08T15:59:29Z

This came out of seeing what (if anything) we want to merge out of #875; it makes the following copy editing changes:

Clearly list allowed codepoints at the start of every string type.
"Unicode" on its own doesn't necessarily mean anything; UTF-16 or UCS-2 is "Unicode". Perhaps a bit pedantic, but "UTF-8" or "codepoints" are "more correct".

Similarly, a "character" or "Unicode character" is quite a tricky thing to define. Multiple codepoints can be one "character". Most of the time "codepoint" is really what's intended.
Don't link to some random page for base64 decode. Guess we could link to Wikipedia, but seems a but redundant to me.

Fixes #875

This came out of seeing what (if anything) we want to merge out of #875; it makes the following copy editing changes: - Clearly list allowed codepoints at the start of every string type. - "Unicode" on its own doesn't necessarily mean anything; UTF-16 or UCS-2 is "Unicode". Perhaps a bit pedantic, but "UTF-8" or "codepoints" are "more correct". Similarly, a "character" or "Unicode character" is quite a tricky thing to define. Multiple codepoints can be one "character". Most of the time "codepoint" is really what's intended. - Don't link to some random page for base64 decode. Guess we could link to Wikipedia, but seems a but redundant to me. Fixes #875

ChristianSi

There's a bit of confusion regarding codepoint sequences (= TOML strings) vs. byte sequences (a whole TOML file is a byte sequence, but arbitrary byte sequences cannot be directly embedded in TOML strings without escaping or encoding). Other than that, it looks good.

ChristianSi · 2025-06-10T17:58:38Z

toml.md


 There are four ways to express strings: basic, multi-line basic, literal, and
-multi-line literal. All strings must contain only Unicode characters.
+multi-line literal. All strings must be encoded as UTF-8.


No, the old language better, we had already discussed this in #875 and the wording "All strings must contain only Unicode characters" was the result – though changing it to "All strings must contain only (Unicode) codepoints" would be better. The new wording, however, suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole ''TOML document'' on disk is encoded as UTF-8 – nothing more, nothing less.

ChristianSi · 2025-06-10T18:04:06Z

toml.md

-byte sequences. For binary data, avoid using these escape codes. Instead,
-external binary-to-text encoding strategies, like hexadecimal sequences or
-[Base64](https://www.base64decode.org/), are recommended for converting between
+All TOML strings are UTF-8 encoded, _not_ byte sequences. For binary data, avoid


We should change "Unicode characters" to "codepoints" or "Unicode codepoints", but other than that, the old wording is better. An UTF-8 encoded string ''is'' a byte sequence, since an encoding (UTF-8 or UTF-16 etc) converts a sequences of codepoints into a sequence of bytes. But conceptually, a TOML string is a sequence of codepoints, not a byte sequence, and we must be careful not to confuse the two. The encoding into bytes happens when the TOML file is written to disk (or into a byte array or whatever), the decoding happens when it's read from disk.

ChristianSi · 2025-06-20T08:23:32Z

@arp242 , @pradyunsg What do you think?

ChristianSi · 2025-07-19T13:27:17Z

@arp242 How do you want to proceed here? I think my suggestions for improvement make sense, don't you agree?

If we have trouble finding an unified version here, we could also consider just closing this. The current version is basically already fine and there is no urgent need to change anything.

This is now the last issue blocking TOML 1.1, and there is certainly no good reason to allow such a minor matter, which doesn't even change anything in the language, to block the next release, which is already so much overdue!

pradyunsg · 2025-07-20T13:06:26Z

I also prefer the existing language over the suggested one in this PR, in multiple spots. I don't think we need to block 1.1 on this, if there isn't agreement between us. 😅

ChristianSi · 2025-07-29T08:32:43Z

OK, if no one suggests another course of action, I'll close this in a week or so. For 1.1 the used language should be fine, and if it turns out helpful, we can always revise it for 1.2 or later.

epage approved these changes Jun 9, 2025

View reviewed changes

arp242 mentioned this pull request Jun 9, 2025

Clarify integer size and float precision levels #1058

Merged

ChristianSi requested changes Jun 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify string descriptions #1064

Clarify string descriptions #1064

Uh oh!

arp242 commented Jun 8, 2025

Uh oh!

ChristianSi left a comment •

edited

Loading

Uh oh!

ChristianSi Jun 10, 2025 •

edited

Loading

Uh oh!

ChristianSi Jun 10, 2025 •

edited

Loading

Uh oh!

ChristianSi commented Jun 20, 2025

Uh oh!

ChristianSi commented Jul 19, 2025 •

edited

Loading

Uh oh!

pradyunsg commented Jul 20, 2025

Uh oh!

ChristianSi commented Jul 29, 2025

Uh oh!

Uh oh!

Clarify string descriptions #1064

Are you sure you want to change the base?

Clarify string descriptions #1064

Uh oh!

Conversation

arp242 commented Jun 8, 2025

Uh oh!

ChristianSi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristianSi Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristianSi Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristianSi commented Jun 20, 2025

Uh oh!

ChristianSi commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pradyunsg commented Jul 20, 2025

Uh oh!

ChristianSi commented Jul 29, 2025

Uh oh!

Uh oh!

ChristianSi left a comment •

edited

Loading

ChristianSi Jun 10, 2025 •

edited

Loading

ChristianSi Jun 10, 2025 •

edited

Loading

ChristianSi commented Jul 19, 2025 •

edited

Loading