-
Notifications
You must be signed in to change notification settings - Fork 16
Redefine char and unicodeChar for correct Unicode usage #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mbtaylor
commented
Aug 7, 2025
- char primitives are now UTF-8-encoded bytes,
- unicodeChar primitives are now UTF-16-encoded byte pairs for BMP characters (non-BMP is excluded)
- unicodeChar is deprecated
- a little bit of surrounding text is rephrased to match Unicode concepts
- there is some non-normative text explaining the implications of UTF-8 being a variable-width encoding
- references to UCS-2 are removed
- char primitives are now UTF-8-encoded bytes, - unicodeChar primitives are now UTF-16-encoded byte pairs for BMP characters (non-BMP is excluded) - unicodeChar is deprecated - a little bit of surrounding text is rephrased to match Unicode concepts - there is some non-normative text explaining the implications of UTF-8 being a variable-width encoding - references to UCS-2 are removed
|
This is the PR that I threatened on the Apps mailing list on 17 July, and follows on from the discussion on that thread and from @msdemlei's presentation in College Park. It tackles issues #55 and #69; covering both getting rid of references to UCS-2 and redefining the |
rra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the world of Unicode RFCs, the standards are fairly careful to always use the term "octet" for the individual 8-bit storage unit of an encoding. This text uses "byte" throughout. I think that's a reasonable choice these days, given that all the computing architectures that used non-8-bit bytes are very obsolete, but it might be worth aligning the terminology on octet just to avoid any confusion for readers who are moving back and forth between the RFC world and the IVOA standard world and might wonder if there's some difference between byte and octet.
I'm not sure where to put this in the document, but it feels worthwhile to add a fairly explicit warning for char that if a value is truncated to fit a length restriction on the column, it may be shorter than the number of octets given in arraysize, and therefore implementations cannot use length == arraysize as a flag to detect possibly truncated values.
| For this type the primitive size of two bytes corresponds to a 2-byte | ||
| UTF-16 {\em code unit}. | ||
| Only characters in the Unicode Basic Multilingual Plane, | ||
| which all have 2-byte representations, are permitted for this datatype, | ||
| so that the primitive count matches the character count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you were trying to drop all the UCS-2 references, but I think you've essentially defined UCS-2 here without using that term explicitly. Maybe it would be a bit easier for implementers to understand if the text says that this is UCS-2? As far as I understand it, UCS-2 is exactly UTF-16 with all planes other than the Unicode Basic Multilingual Plane banned so that every character is exactly two octets.
Should there be a recommendation here about what implementations should do if given unicodeChar that is actually in UTF-16 and therefore contains surrogate pairs? I know that we would like to leave unicodeChar behind us, but we discovered that current PyVO generates unicodeChar fields when given a CSV table to upload, so we may be living with it for a while and I bet implementations will encounter people uploading higher plane characters in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to revert the language to UCS-2 (which is indeed identical to BMP-only UTF-16) if people think that's more comprehensible.
It may be that there are unicodeChar fields containing higher-plane characters out there somewhere, but that would have been illegal for earlier versions of VOTable and would be illegal with this version too. Given that, I feel like software is within its rights to do whatever it likes... but in practice it's likely to be UTF-16 so using a UTF-16 decoder would probably do the right thing even in the presence of illegal data, so maybe it's worth recommending that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to revert the language to UCS-2 (which is indeed identical to BMP-only UTF-16) if people think that's more comprehensible.
I didn't express that well -- I like the change and I think it's great to connect this to UTF-16 because UTF-16 encoders are readily available but UCS-2 encoders may be a bit rarer. I was just thinking that it might be good to also note that UTF-16 with this restriction is just UCS-2.
It may be that there are
unicodeCharfields containing higher-plane characters out there somewhere, but that would have been illegal for earlier versions of VOTable and would be illegal with this version too. Given that, I feel like software is within its rights to do whatever it likes... but in practice it's likely to be UTF-16 so using a UTF-16 decoder would probably do the right thing even in the presence of illegal data, so maybe it's worth recommending that.
I like the idea of saying explicitly that you can use a UTF-16 decoder and accept technically invalid VOTables that contain surrogate pairs if you want to be generous in what you accept and can handle UTF-16, but when creating a VOTable, you must not include surrogate pairs.
VOTable.tex
Outdated
| the 2-byte big-endian UTF-16 encoding | ||
| of a Unicode character from the Basic Multilingual Plane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, I think this is just another way of saying UCS-2.
|
I've been trying to read a variety of sources on UCS-2, all non-authoritative, and have not been able to get a completely clear answer to the question of whether there are any valid UCS-2 code points that would be interpreted differently in UTF-16. This is roughly, but not precisely, equivalent to asking whether |
| Note that the primitive size of one byte refers to a single | ||
| UTF-8-encoded byte, not to a single character. | ||
| Since UTF-8 is a variable-width encoding, | ||
| a character may require multiple bytes, and for arrays the | ||
| string length (length in characters) and primitive count (length in bytes) | ||
| will in general differ. | ||
| 7-bit ASCII characters are however all encoded as a single byte in UTF-8, | ||
| so in the case of ASCII characters, which were required for this | ||
| datatype in earlier VOTable versions, the primitive and character count | ||
| are equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm overlooking it, and it's already there, but I think it might be worth an explicit statement in the text that clarifies that a bare char without a length (a one-octet string) is limited to being able to store an ASCII character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to find a formal definition of UCS-2 now because the UCS stuff is basically deprecated, but my understanding is that it is a direct mapping of the Unicode code points to a two-octet number per code point (and comes in either a big-endian or a little-endian variant). UTF-16 is defined to be exactly that mapping (see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G31699) except for surrogate pairs, and the code space for surrogate codes used in surrogate pairs is reserved for that purpose exclusively in https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G24089. |
Following comments from Russ Allberry and Gregory D-F, make a couple of adjustments to the description of the (now deprecated) unicodeChar type: - note that the required encoding is just a rebadged UCS-2 - explicitly allow readers to treat it as standard UTF-16
I have to confess to working from wikipedia rather than the RFCs for my Unicode information; the wikipedia pages on e.g. UTF-8 and UTF-16 use the term "byte" throughout with no mention of octets. I feel like if it's good enough for wikipedia it's probably good enough here, and given that the rest of the VOTable document uses the term byte throughout as well, I think confusion will be minimised by leaving the terminology as is. But if majority opinion is against me I'll change it. |
I've tried to address this in c9859ed. |
rra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version looks great to me.
|
With apologies for being late to the party, I have a few thoughts on this. I do think that this specification is clear and unambiguous. Usually that would be enough for me, but... Clearly for deserializing BINARY and BINARY2 we need to know the number of bytes per value. This is also probably true for 2D My main concern is for providers, for whom one of the most common VOTable errors we see is A safe For consumers, the practical issues are smaller and more about possible confusion.
|
|
On Mon, Oct 27, 2025 at 05:36:42PM -0700, Tom wrote:
Clearly for deserializing BINARY and BINARY2 we need to know the
number of bytes per value. This is also probably true for 2D
`char` arrays in the TABLEDATA serialization, but there it is less
clear to me since xml parsing and string storage techniques vary
Uh. Right. nD char arrays in Tabledata could be a killer.
The only way we can define that is to say "You need to UTF-8 encode
the payload before splitting it". Yikes.
|
That's right. I've included the following text in Section 2.2:
Do you think this needs spelling out more explicitly? I've presented it here from a read rather than write point of view, it should maybe be written "To encode or decode a multi-dimensional char array...". Is more rewording required? |
|
@tomdonaldson, thanks for your thoughts.
If it was just TABLEDATA then you could (and probably should) count number of characters rather than number of bytes when looking at 2-d char arrays. But since for BINARY/2 it really has to be by (UTF-8) byte, I'm proposing it works the same way for TABLEDATA. Otherwise, the meaning of
Fortunately, this isn't true: if you write code that's correct for VOTable 1.6 (the system proposed by this PR) it will work correctly for all legal instances of earlier VOTable versions. The reason is that in VOTable versions to date it is only allowed to have 7-bit ASCII characters in
That is true; if you're in a language where there is no facility for converting between a character string and a UTF-8 byte array then implementation is going to be tricky. Is this likely to be the case for any language we're interested in? The other points you make generally relate to difficulties in working out how many characters are required for a given arraysize declaration or vice versa. It's true that's hard and/or inefficient in the proposed scheme. My feeling is that at least for scalar strings (1-d char arrays) the best thing in most cases for providers will be now to use variable-length arrays rather than a fixed arraysize which is, as you say, of questionable utility. When parsing these things I'd consider just reading them in as variable-length strings rather than fixed-length ones even in the case of fixed arraysize, though there may be efficiency considerations for that depending on the platform. Admittedly you can't dodge a fixed arraysize dimension when working with 2-d char arrays, but that is a fairly unusual data type (and might possibly be addressed one day by xtype="strings"). One exception to the always-use-variable-arraysize rule of thumb could be (when authoring a VOTable) to use a fixed arraysize for strings known to be ASCII-like, for instance ISO-8601 dates. I should note that @fxpineau made a suggestion relevant to this, to do with using the |
VOTable.tex
Outdated
| and a character string may therefore be terminated by an ASCII | ||
| NULL [0x00] | ||
| indicates a Unicode string composed of UTF-8 encoded text. | ||
| A string may be terminated by a NULL code point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the previous wording, it was explicit that the NULL termination applied to BINARY/BINARY2 (not TABLEDATA). Is it worth clarifying that here?
VOTable.tex
Outdated
| by this standard, but readers MAY treat such characters normally | ||
| if encountered, for instance by using a UTF-16 decoder on BINARY data, | ||
| though note in this case the arraysize may no longer match the character count. | ||
| Note this datatype is {\bf deprecated} from VOTable 1.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For emphasis, would it be useful to move this deprecation note to the beginning of the section.
tomdonaldson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbtaylor for the responses and for this PR in general. I am now comfortable that this approach is a very reasonable way to handle the subtle complexities of UTF-8 values. In any case, I can't think of another approach that would be cleaner to readers or writers. I'm particularly compelled by the point that for 1D char arrays, writers are not required or even encouraged to use a fixed arraysize, and that readers other than validators are similarly not required care use the arraysize.
I'm comfortable with what's written as is, but left a couple minor suggestions for your consideration.
Another question occurred to me but also seems minor. That is, PARAM values use the same datatypes as FIELDs, but we don't explicitly mention that their value is essentially a quoted string matching a TABLEDATA serialization. In the unlikely event that one used multidimensional strings in a PARAM, is that worth clarifying in the value desccription?
There are a couple of adjustments to the text clarifying the intention of the Unicode-related changes.
|
Thanks @tomdonaldson I've made those adjustments as suggested. Incidentally a multi-dimensional char array in a |