Skip to content

Persistence subsystem refactoring #4133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

leventov
Copy link
Contributor

@leventov leventov commented Mar 17, 2025

📝 Summary

Implements the first stage of persistence subsystem refactoring, as discussed at length in #3176.

🔍 Description of Changes

This PR only has textual detailed design and implementation plan yet. This is the first version of this design and plan that looks consistent to me (doesn't have internal contradictions) and is complete (sufficient for implementation): no high-level hand-weaving and descriptions, just specific implementation details and algorithms.

The hardest part was to get inter-process locking and "new snapshot creation" + "entry writing" machinery correct, especially in the presence of non-cooperative synchronization systems like Dropbox. I iterated through a lot of designs internally and was founding race conditions with them until I arrived at the one described in the document. I don't claim it's optimal (likely not), but at least I'm relatively confident that it's concurrency bug-free.

I'll add the actual code implementation of this detailed plan soon (I already starting to write it with Claude Code), and I expect it to take an order of magnitude less time than it took to create the plan itself (more than 1 month), unless new concurrency or other problems are be found in the process.

I suggest reviewers (@mscolnick and @dmadisetti) to also reason and review on the level of the text description because the code is going to be even bigger in volume than this description.

📋 Checklist

  • I have read the contributor guidelines.
  • For large changes, or changes that affect the public API: this change was discussed or approved through an issue, on Discord, or the community discussions (Please provide a link if applicable).
  • I have added tests for the changes made.
  • I have run the code and verified that it works as expected.

📜 Reviewers

@mscolnick @dmadisetti

Copy link

vercel bot commented Mar 17, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
marimo-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 17, 2025 7:10pm
marimo-storybook ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 17, 2025 7:10pm

Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@dmadisetti
Copy link
Collaborator

Thanks @leventov

It's going to take me a second to properly parse through all 16k words, but I'll try to sometime later today.
If you have a guide to reading that would be appreciated.


After a quick read just some initial thoughts (feel free to chime in if it seems like I misunderstood something):

Persister

I can put in the "Persister" (I think our previous discussions I called it an Adapter, but maybe we can compromise somewhere. Store? Storage? Interface?). I've been eager to rig this up with Redis, and also try directly leverage Nix's caching by manually creating .nar files.

Just a note, Loader vs BasePersistenceLoader has already been decoupled to some extent, but to disambiguate, loader might be better named Converter (though I will keep it loader for now)

@dataclass
class EntryKey:
    execution_hash: str
    hash: str
    cache_type: CacheType
    path: Optional[Path]
# Created is particular to the proposed "Persister/Adapter method"


class Store:
    @abstractmethod
    def get(
        self,
        entry: EntryKey, # feel free to build execution key from this
        **kwargs: Any, # Whatever other args might be needed
    ) -> Optional[bytes]:
        """
        Retrieve an entry from the persistence subsystem.
        """

    @abstractmethod
    def put(
        self,
        cache: Cache,
        BaseConverter: converter
        **kwargs: Any,
    ) -> None:
        """ 
        Store EntryMetadata on self if needed
        """
        blob: bytes = converter.to_blob(cache)
        # Do whatever with the blob, potentially switching on loader.
        ...

    @abstractmethod
    def post_execution_hook(
        self,
        cell: CellImpl,
        runner: cell_runner.Runner,                                                                                                                                                                                                                                             
        run_result: cell_runner.RunResult,      
        ) -> None: ...
        # Do something like cleanup here


    @abstractmethod
    def __del__(self) -> None:
        """
        __del__ over on_exit
        Suggested, since should be singleton- only deleted on shutdown.
        Motivation: no current, easy, shutdown hook.
        """
        ...

Cache

I've not sure what's gained renaming Cache -> Entry. Also, in your description, this removes a fair amount of effort and care required for managing State and UI objects, in the original Cache Spec:

class Cache:
    defs: dict[Name, Any]
    hash: str
    stateful_refs: set[str]
    cache_type: CacheType
    hit: bool

What I will do is expose

execution_hash : str

On the cache object (Which is computed prior to the content hash variables and should be a sub for your base seed for your executionkey, which you might want to bake the file-contents/ path into).

App

The default behaviour, where the object_id_mode is not explicitly passed to FsPersister (or, FsPersister is not explicitly passed to App()) can be controlled via .marimo.toml (so-called user configuration). So, an option like fs_persister.default_object_id_mode can be added to to .marimo.toml. Extending the SQLite metaphour, this would analogous to SQLite's compile-time options.

I don't think this requires a serialization changes so much as just a user (./config/marimo.toml) change.
This can still be controlled on a per notebook level with PEP 723 (see #3794)

@leventov
Copy link
Contributor Author

leventov commented Mar 18, 2025

I can put in the "Persister" (I think our previous discussions I called it an Adapter, but maybe we can compromise somewhere. Store? Storage? Interface?). I've been eager to rig this up with Redis, and also try directly leverage Nix's caching by manually creating .nar files.

Store sounds fine to me, with FsPersistentStore and LegacyPersistentStore as the names of the two impls that were called FsPersister and LegacyPersister in the doc.

@dataclass
class EntryKey:
    execution_hash: str
    hash: str
    cache_type: CacheType
    path: Optional[Path]

Execution hash - is this the equivalent for block_id (block hash) in the doc? It seems ambiguous as the field name because as you pointed here + see my next comment in that thread, this should be the hash of the block code only, not a tree hash. Also, for mo.persistent_cache(name="foo") this name should be the block_id (also, the names of the named cells, when #3054 is implemented).

created_at is also important.

If these changes are OK for you, I'll add EntryKey in the code that I'm writing at the moment. Or you intended to add it on your own, before my PR?

path: Optional[Path] -- is this coming from save_path argument of mo.persistent_cache()?

    @abstractmethod
    def put(
        self,
        cache: Cache,
        BaseConverter: converter
        **kwargs: Any,
    ) -> None:

As part of my wanderings in the design space (software abstraction and interface space, in this case), I considered providing an argument like converter (object) to [Persister|Store].get() and put(). I then stripped that because conversion logic is going to be too intertwined with FsPersistentStore.put() logic: it will create a custom Pickler subclass with overridden Pickler.persistent_id() whose logic depends on the object_id_mode... A nice and clean converter abstraction that would work equally well for legacy store and new store will not survive. Luckily I don't think it's important because we aren't going to have many of them (or any beyond pickle and json) and I don't think users will be eager to add their own. That's why in the design that I proposed, conversion/serialization format is specified via ContentSpec = Mapping[ContentKind, ContentFormat], where ContentFormat = Literal["pickle", "json", "auto"], so, conversion format is specified as "pickle" rather than a converter object.

post_execution_hook

Not sure what you mean by this. Why the logic should be different from the proposed in the doc? In the current proposal, cleanup may be triggered right from put(). Is your concern here that it might make put() not "real-time" so even if cleanup trigger conditions are ripe, you would like the imminent cleanup to be delayed until after post_execution_hook() and be executed in background? I think it's possible, I can do this, but want to clarify that it's what you meant.

def del(self)

https://stackoverflow.com/a/1481512/ argues against del, any objections to using https://docs.python.org/3/library/atexit.html?

I've not sure what's gained renaming Cache -> Entry. Also, in your description, this removes a fair amount of effort and care required for managing State and UI objects, in the original Cache Spec:

They main reason why I want to rename is because "Cache" is just a confusing name for this abstraction: "cache" is universally understood as the store as a whole, not a single entry within it. Also, it's a misnomer because "cache" is something ephemeral (even if on-disk), while FsPersistentStore can be used not just as "ephemeral cache" but also as "persistent execution store". This itself may delude LLM coding agents (though less so humans), but what will definitely confuse them more is conceptual "clash" is there is a mash-up of "entries" and "caches". LLM coding agents are very sensitive to such inconsistencies. Or else I should make the naming thoughout the refactored persistence subsystem completely consistent with "cache" (including cache_log/, cache_snapshots/, CacheMetadata class, etc.). Don't you think it will be way too confusing and misnomer-y for a persistence subsystem?

Having said that, we cannot actually remove nor even rename Cache class itself because the instances of this class are pickled by PickleLoader and we need at least to be able to keep unpickling those already stored in __marimo__/cache, or perhaps even keep pickling them as Cache objects after the refactoring so that __marimo__/cache doesn't store a mash-up of pickled Cache and Entry objects. So, Cache will survive, but I intend to keep it to internals of Loader implementations only (PickleLoader and perhaps JsonLoader), and strip it as a concept and name from anywhere else.

in your description, this removes a fair amount of effort and care required for managing State and UI objects

If my description does this it's my ignorance and I should change the description. I didn't intend to affect the existing functionality. I thought I already did this with putting CacheType in Entry.extra and storing stateful_refs in MemoryLoader along with Entry objects. I will try to understand why it's not enough, specific pointers appreciated.

I don't think this requires a serialization changes so much as just a user (./config/marimo.toml) change.
This can still be controlled on a per notebook level with PEP 723 (see #3794)

All [fs_persister.defaults] and [fs_persister.default_cleanup_strategy] configs exactly mirror the constructor arguments to FsPersistentStore. So, on the notebook level, the "config" is just importing and constructing FsPersistentStore explicitly with needed arguments and then passing it into App() constructor. [fs_persister] is meant to be an additional layer of "global defaults" that apply to all notebooks (apps) withing the scope of that .marimo.toml, to alleviate the need to repeat configurations in every notebook. Does this sound right or you would change this?

@dmadisetti
Copy link
Collaborator

I got Store working with Redis yesterday, and was working on a remote Nix cache and tests, but I'll push what I have up- I think having something concrete to review will go a long way.


In the current proposal, cleanup may be triggered right from put()

I'm not sure it should be abstract level the

atexit

Maybe, we'd have to look over marimo teardown. It's not system exit, it's runtime when this would need to trigger

configs exactly mirror the constructor arguments to FsPersistentStore

You'll see the pattern I use for Redis, this isn't a problem with just config. I think it makes sense to not change serialization much. Top level function changes will potentially make notebooks more dangerous, and considering a push to statically load apps for edit mode.

@dmadisetti
Copy link
Collaborator

@leventov do you think you could build on #4147 (after merge) in a way that doesn't disrupt current functionality? Ideally starting with a proof of concept, that can be expanded with successive PRs?

I'm still not 100% clear on your implementation plans, and your doc doesn't relate on why it's an improvement (even if it's self evident to you, what are the limitations of the current mechanism, and how does your proposal address those concerns. Cleanup and concurrency are fair, but solutions like a Redis with LRU turn on already address those). I've quickly read through your doc several times, and it needs a bit more organization and concision

Happy to put work where I think there is synergy (like #4147)

@leventov
Copy link
Contributor Author

@dmadisetti I already rebased the code on top of the dm/cache-store branch, although it doesn't "integrate" yet because there are interface mismatches, described here. App.save_key is another important outstanding thing.

Cleanup and concurrency are fair, but solutions like a Redis with LRU turn on already address those

Focusing on Redis contradicts the desideratum in the first message of #3176:

Agnostic towards the way cache is shared between machines and/or backed up: Git/Github, git-annex, CI systems cache directories (such as Github Actions's cache action), Marimo Cloud, Dropbox, rsync, backup systems, etc.

Redis in specific is not durable (and if there is a durable mode, I don't trust it: I don't trust Redis in general, I avoid it whenever possible), it's an extra stateful runtime that should be kept around where it seems many serverless things can work (or piggy-backing other server things that are already used by the user, such as Git server).

Another big point is deduplication. Entry/object separation is needed to enable deduplication basically. For persistent execution use case, deduplication should be a big deal because it's expected that there will be lots of cells with differing inputs or codes (and hence different hash) but the output stays the same.

Entries (execution keys more specifically) are also designed to be extensible in the future to accommodate versions, I discuss them in #3176.

What I'm implementing is not a "new" design, it's exactly the design that you agreed with (seemingly) in the end of #3176 discussion, already with concessions from my perspective. The doc committed in this PR is just an elaboration of that design because what seemed "simple" on the hand-weavy level and could be described in a couple of sentences turned out to require a lot of careful adjustment of certain details about locking, operation sequencing, formats, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants