Skip to content

Parsing JSON without allocating #353

Open
@samuelcolvin

Description

@samuelcolvin

Currently built in JSON parsing works something like this:

  • parse the JSON bytes and create a JsonValue - effectively a recursive enum of vec, map, int, float, string, bool, null
  • pass the JsonValue to the validator
  • validator iterates over the JsonValue to run validation and thereby create python types

This approach is not optimal for a number of reasons:

  • allocating the vecs, maps & strings is slow
  • iterating over the resultant vecs and maps is slow
  • we do all the JSON parsing before starting validation - in some scenarios we could stop JSON parsing early if we know we had the wrong type
  • we do all the work of checking JSON strings are valid and creating the UTF8 string, even when we will later want the raw bytes - e.g. dates, datetimes, ints and float
  • (limitation of serde, not so much the high level approach) serde throws away the line and column number for valid JSON - you can only get the file position if the JSON itself is invalid, we'd like to keep the line number so we can include it in validation errors

A better approach would be to validate the JSON as we parse it. I don't think this will be possible using serde, hence we'll need a new JSON parsing library to accomplish this.

I have an experimental JSON parsing library that I want to finish and open source which can hopefully accomplish this.

The interface will be something like this (note I'm using pseudo python here to explain the interface since it's easier, but obviously it's written in rust):

class JsonIterator:
    def file_location(self) -> tuple[int, int]:
        ...

    def get_next_any(self) -> Result[None | bool | int | float | str | list | dict]:
        ...

    def get_next_array(self) -> Result[Generator[JsonIterator, None, None]]:
        ...

    def get_next_object(self) -> Result[Generator[tuple[str, JsonIterator], None, None]]:
        ...

    def get_next_bool(self) -> Result[bool]:
        ...

    def get_next_int(self) -> Result[int]:
        ...

    def get_next_str(self) -> Result[str]:
        ...

    def get_next_str_bytes(self) -> Result[bytes]:
        ...

    def get_next_none(self) -> Result[None]:
        ...

    # etc.


def parse_json(json: bytes) -> JsonIterator:
    ...

We would then implement our Input trait on JsonIterator.

A few things to iron out, like whether we need to finish parsing the current object in order to include "input_value" in exceptions (which would mostly remove the "stop parsing on error" advantage), but overall I think this could significantly speedup direct JSON parsing.

Feedback welcome. @davidhewitt thought you might have something interesting to say on this.

(To be clear, this will not be included in Pydantic V2, but might be the big feature of 2.1)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions