-
Notifications
You must be signed in to change notification settings - Fork 875
Clarify string descriptions #1064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This came out of seeing what (if anything) we want to merge out of #875; it makes the following copy editing changes: - Clearly list allowed codepoints at the start of every string type. - "Unicode" on its own doesn't necessarily mean anything; UTF-16 or UCS-2 is "Unicode". Perhaps a bit pedantic, but "UTF-8" or "codepoints" are "more correct". Similarly, a "character" or "Unicode character" is quite a tricky thing to define. Multiple codepoints can be one "character". Most of the time "codepoint" is really what's intended. - Don't link to some random page for base64 decode. Guess we could link to Wikipedia, but seems a but redundant to me. Fixes #875
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit of confusion regarding codepoint sequences (= TOML strings) vs. byte sequences (a whole TOML file is a byte sequence, but arbitrary byte sequences cannot be directly embedded in TOML strings without escaping or encoding). Other than that, it looks good.
|
||
There are four ways to express strings: basic, multi-line basic, literal, and | ||
multi-line literal. All strings must contain only Unicode characters. | ||
multi-line literal. All strings must be encoded as UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the old language better, we had already discussed this in #875 and the wording "All strings must contain only Unicode characters" was the result – though changing it to "All strings must contain only (Unicode) codepoints" would be better. The new wording, however, suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole ''TOML document'' on disk is encoded as UTF-8 – nothing more, nothing less.
byte sequences. For binary data, avoid using these escape codes. Instead, | ||
external binary-to-text encoding strategies, like hexadecimal sequences or | ||
[Base64](https://www.base64decode.org/), are recommended for converting between | ||
All TOML strings are UTF-8 encoded, _not_ byte sequences. For binary data, avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change "Unicode characters" to "codepoints" or "Unicode codepoints", but other than that, the old wording is better. An UTF-8 encoded string ''is'' a byte sequence, since an encoding (UTF-8 or UTF-16 etc) converts a sequences of codepoints into a sequence of bytes. But conceptually, a TOML string is a sequence of codepoints, not a byte sequence, and we must be careful not to confuse the two. The encoding into bytes happens when the TOML file is written to disk (or into a byte array or whatever), the decoding happens when it's read from disk.
@arp242 , @pradyunsg What do you think? |
@arp242 How do you want to proceed here? I think my suggestions for improvement make sense, don't you agree? If we have trouble finding an unified version here, we could also consider just closing this. The current version is basically already fine and there is no urgent need to change anything. This is now the last issue blocking TOML 1.1, and there is certainly no good reason to allow such a minor matter, which doesn't even change anything in the language, to block the next release, which is already so much overdue! |
I also prefer the existing language over the suggested one in this PR, in multiple spots. I don't think we need to block 1.1 on this, if there isn't agreement between us. 😅 |
OK, if no one suggests another course of action, I'll close this in a week or so. For 1.1 the used language should be fine, and if it turns out helpful, we can always revise it for 1.2 or later. |
This came out of seeing what (if anything) we want to merge out of #875; it makes the following copy editing changes:
Clearly list allowed codepoints at the start of every string type.
"Unicode" on its own doesn't necessarily mean anything; UTF-16 or UCS-2 is "Unicode". Perhaps a bit pedantic, but "UTF-8" or "codepoints" are "more correct".
Similarly, a "character" or "Unicode character" is quite a tricky thing to define. Multiple codepoints can be one "character". Most of the time "codepoint" is really what's intended.
Don't link to some random page for base64 decode. Guess we could link to Wikipedia, but seems a but redundant to me.
Fixes #875