Parsing JSON without allocating

Currently built in JSON parsing works something like this:
* parse the JSON bytes and create a `JsonValue` - effectively a recursive enum of vec, map, int, float, string, bool, null
* pass the `JsonValue` to the validator
* validator iterates over the `JsonValue` to run validation and thereby create python types

This approach is not optimal for a number of reasons:
* allocating the vecs, maps & strings is slow
* iterating over the resultant vecs and maps is slow
* we do all the JSON parsing before starting validation - in some scenarios we could stop JSON parsing early if we know we had the wrong type
* we do all the work of checking JSON strings are valid and creating the UTF8 string, even when we will later want the raw bytes - e.g. dates, datetimes, ints and float
* (limitation of serde, not so much the high level approach) serde throws away the line and column number for valid JSON - you can only get the file position if the JSON itself is invalid, we'd like to keep the line number so we can include it in validation errors

A better approach would be to validate the JSON as we parse it. I don't think this will be possible using serde, hence we'll need a new JSON parsing library to accomplish this.

I have an experimental JSON parsing library that I want to finish and open source which can hopefully accomplish this.

The interface will be something like this (note I'm using pseudo python here to explain the interface since it's easier, but obviously it's written in rust):

```py
class JsonIterator:
    def file_location(self) -> tuple[int, int]:
        ...

    def get_next_any(self) -> Result[None | bool | int | float | str | list | dict]:
        ...

    def get_next_array(self) -> Result[Generator[JsonIterator, None, None]]:
        ...

    def get_next_object(self) -> Result[Generator[tuple[str, JsonIterator], None, None]]:
        ...

    def get_next_bool(self) -> Result[bool]:
        ...

    def get_next_int(self) -> Result[int]:
        ...

    def get_next_str(self) -> Result[str]:
        ...

    def get_next_str_bytes(self) -> Result[bytes]:
        ...

    def get_next_none(self) -> Result[None]:
        ...

    # etc.


def parse_json(json: bytes) -> JsonIterator:
    ...
```

We would then implement our `Input` trait on `JsonIterator`.

A few things to iron out, like whether we need to finish parsing the current object in order to include "input_value" in exceptions (which would mostly remove the "stop parsing on error" advantage), but overall I think this could significantly speedup direct JSON parsing.

Feedback welcome. @davidhewitt thought you might have something interesting to say on this.

(To be clear, this will not be included in Pydantic V2, but might be the big feature of 2.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parsing JSON without allocating #353

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parsing JSON without allocating #353

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions