-
Notifications
You must be signed in to change notification settings - Fork 874
Clarify string descriptions #1064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -260,12 +260,11 @@ The above TOML maps to the following JSON. | |
## String | ||
|
||
There are four ways to express strings: basic, multi-line basic, literal, and | ||
multi-line literal. All strings must contain only Unicode characters. | ||
multi-line literal. All strings must be encoded as UTF-8. | ||
|
||
**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character | ||
may be used except those that must be escaped: quotation mark, backslash, and | ||
the control characters other than tab (U+0000 to U+0008, U+000A to U+001F, | ||
U+007F). | ||
**Basic strings** are surrounded by quotation marks (`"`). Any codepoint may be | ||
used except those that must be escaped: quotation mark, backslash, and the | ||
control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F). | ||
|
||
```toml | ||
str = "I'm a string. \"You can quote me\". Name\tJos\xE9\nLocation\tSF." | ||
|
@@ -282,19 +281,18 @@ For convenience, some popular characters have a compact escape sequence. | |
\e - escape (U+001B) | ||
\" - quote (U+0022) | ||
\\ - backslash (U+005C) | ||
\xHH - unicode (U+00HH) | ||
\uHHHH - unicode (U+HHHH) | ||
\UHHHHHHHH - unicode (U+HHHHHHHH) | ||
\xHH - codepoint (U+00HH) | ||
\uHHHH - codepoint (U+HHHH) | ||
\UHHHHHHHH - codepoint (U+HHHHHHHH) | ||
``` | ||
|
||
Any Unicode character may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH` | ||
Any codepoint may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH` | ||
forms. The escape codes must be Unicode | ||
[scalar values](https://unicode.org/glossary/#unicode_scalar_value). | ||
|
||
Keep in mind that all TOML strings are sequences of Unicode characters, _not_ | ||
byte sequences. For binary data, avoid using these escape codes. Instead, | ||
external binary-to-text encoding strategies, like hexadecimal sequences or | ||
[Base64](https://www.base64decode.org/), are recommended for converting between | ||
All TOML strings are UTF-8 encoded, _not_ byte sequences. For binary data, avoid | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should change "Unicode characters" to "codepoints" or "Unicode codepoints", but other than that, the old wording is better. An UTF-8 encoded string ''is'' a byte sequence, since an encoding (UTF-8 or UTF-16 etc) converts a sequences of codepoints into a sequence of bytes. But conceptually, a TOML string is a sequence of codepoints, not a byte sequence, and we must be careful not to confuse the two. The encoding into bytes happens when the TOML file is written to disk (or into a byte array or whatever), the decoding happens when it's read from disk. |
||
using these escape codes. Instead, external binary-to-text encoding strategies, | ||
like hexadecimal sequences or base64, are recommended for converting between | ||
bytes and strings. | ||
|
||
All other escape sequences not listed above are reserved; if they are used, TOML | ||
|
@@ -307,6 +305,11 @@ like to break up a very long string into multiple lines. TOML makes this easy. | |
side and allow newlines. A newline immediately following the opening delimiter | ||
will be trimmed. All other whitespace and newline characters remain intact. | ||
|
||
Any codepoint may be used except those that must be escaped: backslash and the | ||
control characters other than tab, line feed, and carriage return (U+0000 to | ||
U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns (U+000D) are | ||
only allowed as part of a newline sequence. | ||
|
||
```toml | ||
str1 = """ | ||
Roses are red | ||
|
@@ -349,11 +352,6 @@ str3 = """\ | |
""" | ||
``` | ||
|
||
Any Unicode character may be used except those that must be escaped: backslash | ||
and the control characters other than tab, line feed, and carriage return | ||
(U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns | ||
(U+000D) are only allowed as part of a newline sequence. | ||
|
||
You can write a quotation mark, or two adjacent quotation marks, anywhere inside | ||
a multi-line basic string. They can also be written just inside the delimiters. | ||
|
||
|
@@ -371,8 +369,10 @@ If you're a frequent specifier of Windows paths or regular expressions, then | |
having to escape backslashes quickly becomes tedious and error-prone. To help, | ||
TOML supports literal strings which do not allow escaping at all. | ||
|
||
**Literal strings** are surrounded by single quotes. Like basic strings, they | ||
must appear on a single line: | ||
**Literal strings** are surrounded by single quotes and don't support `\` | ||
escapes. Any codepoint may be used except for control characters other than tab. | ||
|
||
Like basic strings, they must appear on a single line: | ||
|
||
```toml | ||
# What you see is what you get. | ||
|
@@ -383,11 +383,13 @@ regex = '<\i\c*\s*>' | |
``` | ||
|
||
Since there is no escaping, there is no way to write a single quote inside a | ||
literal string enclosed by single quotes. Luckily, TOML supports a multi-line | ||
version of literal strings that solves this problem. | ||
literal string enclosed by single quotes. TOML supports a multi-line version of | ||
literal strings that solves this problem. | ||
|
||
**Multi-line literal strings** are surrounded by three single quotes on each | ||
side and allow newlines. Like literal strings, there is no escaping whatsoever. | ||
side and allow newlines. Like literal strings, there are `\` escapes. Any | ||
codepoint may be used except for control characters other than tab. | ||
|
||
A newline immediately following the opening delimiter will be trimmed. TOML | ||
parsers must normalize newlines in the same manner as multi-line basic strings. | ||
|
||
|
@@ -417,8 +419,6 @@ apos15 = "Here are fifteen apostrophes: '''''''''''''''" | |
str = ''''That,' she said, 'is still pointless.'''' | ||
``` | ||
|
||
Control characters other than tab are not permitted in a literal string. | ||
|
||
## Integer | ||
|
||
Integers are whole numbers. Positive numbers may be prefixed with a plus sign. | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the old language better, we had already discussed this in #875 and the wording "All strings must contain only Unicode characters" was the result – though changing it to "All strings must contain only (Unicode) codepoints" would be better. The new wording, however, suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole ''TOML document'' on disk is encoded as UTF-8 – nothing more, nothing less.