Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 25 additions & 25 deletions toml.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,12 +260,11 @@ The above TOML maps to the following JSON.
## String

There are four ways to express strings: basic, multi-line basic, literal, and
multi-line literal. All strings must contain only Unicode characters.
multi-line literal. All strings must be encoded as UTF-8.
Copy link
Contributor

@ChristianSi ChristianSi Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the old language better, we had already discussed this in #875 and the wording "All strings must contain only Unicode characters" was the result – though changing it to "All strings must contain only (Unicode) codepoints" would be better. The new wording, however, suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole ''TOML document'' on disk is encoded as UTF-8 – nothing more, nothing less.


**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character
may be used except those that must be escaped: quotation mark, backslash, and
the control characters other than tab (U+0000 to U+0008, U+000A to U+001F,
U+007F).
**Basic strings** are surrounded by quotation marks (`"`). Any codepoint may be
used except those that must be escaped: quotation mark, backslash, and the
control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F).

```toml
str = "I'm a string. \"You can quote me\". Name\tJos\xE9\nLocation\tSF."
Expand All @@ -282,19 +281,18 @@ For convenience, some popular characters have a compact escape sequence.
\e - escape (U+001B)
\" - quote (U+0022)
\\ - backslash (U+005C)
\xHH - unicode (U+00HH)
\uHHHH - unicode (U+HHHH)
\UHHHHHHHH - unicode (U+HHHHHHHH)
\xHH - codepoint (U+00HH)
\uHHHH - codepoint (U+HHHH)
\UHHHHHHHH - codepoint (U+HHHHHHHH)
```

Any Unicode character may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH`
Any codepoint may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH`
forms. The escape codes must be Unicode
[scalar values](https://unicode.org/glossary/#unicode_scalar_value).

Keep in mind that all TOML strings are sequences of Unicode characters, _not_
byte sequences. For binary data, avoid using these escape codes. Instead,
external binary-to-text encoding strategies, like hexadecimal sequences or
[Base64](https://www.base64decode.org/), are recommended for converting between
All TOML strings are UTF-8 encoded, _not_ byte sequences. For binary data, avoid
Copy link
Contributor

@ChristianSi ChristianSi Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change "Unicode characters" to "codepoints" or "Unicode codepoints", but other than that, the old wording is better. An UTF-8 encoded string ''is'' a byte sequence, since an encoding (UTF-8 or UTF-16 etc) converts a sequences of codepoints into a sequence of bytes. But conceptually, a TOML string is a sequence of codepoints, not a byte sequence, and we must be careful not to confuse the two. The encoding into bytes happens when the TOML file is written to disk (or into a byte array or whatever), the decoding happens when it's read from disk.

using these escape codes. Instead, external binary-to-text encoding strategies,
like hexadecimal sequences or base64, are recommended for converting between
bytes and strings.

All other escape sequences not listed above are reserved; if they are used, TOML
Expand All @@ -307,6 +305,11 @@ like to break up a very long string into multiple lines. TOML makes this easy.
side and allow newlines. A newline immediately following the opening delimiter
will be trimmed. All other whitespace and newline characters remain intact.

Any codepoint may be used except those that must be escaped: backslash and the
control characters other than tab, line feed, and carriage return (U+0000 to
U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns (U+000D) are
only allowed as part of a newline sequence.

```toml
str1 = """
Roses are red
Expand Down Expand Up @@ -349,11 +352,6 @@ str3 = """\
"""
```

Any Unicode character may be used except those that must be escaped: backslash
and the control characters other than tab, line feed, and carriage return
(U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns
(U+000D) are only allowed as part of a newline sequence.

You can write a quotation mark, or two adjacent quotation marks, anywhere inside
a multi-line basic string. They can also be written just inside the delimiters.

Expand All @@ -371,8 +369,10 @@ If you're a frequent specifier of Windows paths or regular expressions, then
having to escape backslashes quickly becomes tedious and error-prone. To help,
TOML supports literal strings which do not allow escaping at all.

**Literal strings** are surrounded by single quotes. Like basic strings, they
must appear on a single line:
**Literal strings** are surrounded by single quotes and don't support `\`
escapes. Any codepoint may be used except for control characters other than tab.

Like basic strings, they must appear on a single line:

```toml
# What you see is what you get.
Expand All @@ -383,11 +383,13 @@ regex = '<\i\c*\s*>'
```

Since there is no escaping, there is no way to write a single quote inside a
literal string enclosed by single quotes. Luckily, TOML supports a multi-line
version of literal strings that solves this problem.
literal string enclosed by single quotes. TOML supports a multi-line version of
literal strings that solves this problem.

**Multi-line literal strings** are surrounded by three single quotes on each
side and allow newlines. Like literal strings, there is no escaping whatsoever.
side and allow newlines. Like literal strings, there are `\` escapes. Any
codepoint may be used except for control characters other than tab.

A newline immediately following the opening delimiter will be trimmed. TOML
parsers must normalize newlines in the same manner as multi-line basic strings.

Expand Down Expand Up @@ -417,8 +419,6 @@ apos15 = "Here are fifteen apostrophes: '''''''''''''''"
str = ''''That,' she said, 'is still pointless.''''
```

Control characters other than tab are not permitted in a literal string.

## Integer

Integers are whole numbers. Positive numbers may be prefixed with a plus sign.
Expand Down