Skip to content

Conversation

michitux
Copy link
Contributor

@michitux michitux commented Aug 5, 2025

Jira URL

https://jira.xwiki.org/browse/XCOMMONS-3250

Changes

Description

  • Introduce a blob storage API with implementations for file system and Amazon S3.

Clarifications

  • This is an initial draft, the S3 version is probably missing some essential configuration options for third-party providers.
  • The code is currently untested and also missing unit tests.
  • Before this is merged/released, I think we should migrate at least one use case in XWiki to this new blob storage to see if the API actually works.
  • This PR exists to provide visibility into the ongoing work and to get feedback on the API.

Screenshots & Video

No UI changes.

Executed Tests

None so far.

Expected merging strategy

  • Prefers squash: Yes
  • Backport on branches:
    • None

@michitux michitux self-assigned this Aug 6, 2025
* A Blob is a piece of data stored in a BlobStore.
*
* @version $Id$
* @since 17.7.0RC1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not going to be the case, but anyway, hard to be sure what to put yet.

* @throws BlobStoreException if the InputStream cannot be read or the blob cannot be written, for example because
* its name is invalid.
*/
void writeFromStream(InputStream inputStream, WriteCondition... conditions) throws BlobStoreException;
Copy link
Member

@tmortagne tmortagne Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think there are cases when getOutputStream would not be enough or that passing the inputstream to write would allow some optimizations ? Otherwise, I don't see much point of exposing this API, even as a helper IOUtils#copy should be good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either if we need it. S3 has special support for efficiently uploading large input streams in the async API (that the current implementation isn't using) so we could avoid that output stream with manual chunking. So from that point of view, it would actually be better to get rid of getOutputStream.

I forgot adding a comment, but I already thought that we should get rid of one of the two methods, depending on which one we actually use.

Copy link
Member

@tmortagne tmortagne Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, getOutputStream is more generic, and I think we need it for some use cases where it might not be easy to provide an InputStream for what we need to write. But if you have a case where passing an InputStream is better for performance in some implementation, then +1 to keep both (and probably recommend using #writeFromStream, when possible, in the javadoc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I've added a comment to reconsider the method.

* @since 17.7.0RC1
*/
@Unstable
public interface BlobStore
Copy link
Member

@tmortagne tmortagne Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A #moveBlob(BlobPath source, BlobPath target) could be interesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, though for S3 this will be just copy + delete from what I can see. For the filesystem backend it probably makes sense to have a better implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for S3 this will be just copy + delete from what I can see

I figured it was why it was not there, but even for S3 maybe there is something we can do to make this operation as atomic as possible. Anyway, it's good to have it since it's a basic operation in many stores.

}

/**
* Create a BlobPath by splitting a slash-delimited string.
Copy link
Member

@tmortagne tmortagne Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this path is to be used as a generic syntax, we probably need to support any path (for example, the current parsing/serializing code does not seem to have any way to escape /). But I guess you have that in mind already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not probably not clear enough, but the idea wasn't that this API would be used with user-supplied paths. The idea was really that this is an internal API that is used with already "safe" paths and just replaces using raw filesystem access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, right now it's not internal at all (it's a public API). I also think that it could be interesting to have a generic syntax to allow us to serialize the BlobPath in case we need to refer to it somewhere else (the database, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant internal in the sense that it shouldn't be exposed to users/user-controlled strings in the same way as regular file access.

In a very first version, I used Path instead of the BlobPath class, but that was terrible from a readability point of view as you had no idea if a Path was an actual native filesystem path or a virtual path that is rooted at the base directory of the store. The idea of BlobPath is to offer something in between a simple string and a full Path, something that is easier to use than just concatenating strings to, e.g., append something to the path, or get the parent "directory". I would also want to try actually using it a bit before we finalize that design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what the current code is missing is a validation that segments of the BlobPath don't contain / or \. We should also forbid . and .. as segments and maybe some other sanity checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added very basic validation for segments for now. I'll revisit this when trying to use the API to see how useful the methods are in actual use.

*/
@Role
@Unstable
public interface BlobStoreManager
Copy link
Member

@tmortagne tmortagne Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we would need a way to dispose/close a BlobStore, when we are sure it won't be needed anymore (for example when the component which needs it has been unloaded), in case it holds some resources (memory, Cache to dispose, some open connection, etc.).

Copy link
Contributor Author

@michitux michitux Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, though at the moment it's not really needed. The only important thing regarding disposing/closing right now is to close all input streams that are returned from an S3Blob in a timely manner as each of them will hold an open HTTP connection and thus blocks one connection of the connection pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I've added very generic dispose support in DefaultBlobStoreManager to dispose all blob stores that implement Disposable when the blob store manager is disposed.

Copy link

socket-security bot commented Oct 8, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedsoftware.amazon.awssdk/​s3@​2.33.161100100100100
Addedsoftware.amazon.awssdk/​apache-client@​2.33.196100100100100

View full report

…n object store

* First draft of module structure, API and filesystem implementation.
…n object store

* Extended the blog store API a bit.
* Added an S3 implementation, the initial draft was by Claude Sonnet 4.
  Untested.
…n object store

* Fix since versions and component metadata.
* Remove duplicate instance caching from S3 blob store manager.
…n object store

* Add TODOs to the S3BlobStore implementation.
…n object store

* Update module versions after rebase on master.
…n object store

* Introduce AbstractBlob to reduce code duplication between Blob
  implementations.
* Remove redundant copy/delete methods in Blob.
* Add a comment to reconsider writeFromStream.
* Add path segment validation to BlobPath.
* Rename the raw path to canonical and directly call "hashCode()" on it.
* Move the S3 blob store configuration to its own component.
* Add the possibility that a blob store can be disposed.
…n object store

* Allow empty BlobPath objects and make getParent easier to use.
* Add a dedicated method to append a suffix to the last segment of a
  BlobPath.
* Add support for moving blob in and between stores.
* Fix copying blobs in S3 between different stores.
* Implement moving blobs with the file system and S3.
* Update the S3 library version.
…n object store

* Add directory-related APIs to BlobStore.
* Simplify the stream implementation of the S3BlobStore.
* Add filter-stream output/input for blobs.
* Let getStream() only throw BlobStoreException for consistency
* Rebase version to 17.9.0-SNAPSHOT
…n object store

* Add support for deleting blobs with a prefix.
…n object store

* Fix S3BlobStore#isEmptyDirectory returning the opposite result.
…n object store

* Add support for migrating between stores.
…n object store

* Add some tests for the blob store API.
* Add extensive tests for the filesystem blob store.
* Add equals and hashCode to all blob stores.
* Some fixes and a lot of bulletproofing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants