opencast · LukasKalbertodt · Mar 20, 2025 · May 22, 2025 · May 22, 2025 · May 22, 2025
diff --git a/docs/common/acl.md b/docs/common/acl.md
@@ -0,0 +1,26 @@
+---
+sidebar_position: 1
+---
+
+# ACL (access control list)
+
+ACLs control access to Opencast entities.
+An ACL is simply a list of `role` + `action` pairs.
+An entry gives users that have that particular `role` the permission to perform the specified `action` on the entity.
+Both, `role` and `action` have the [type `Label`](./types).<sup>(1?)</sup>
+
+There are two special actions recognized by Opencast.
+Other actions can be used for custom purposes by external applications.
+- `read`: generally, gives read access to an entity
+- `write`: generally, gives write access to an entity (changing or deleting it)
+
+*Impl note*: `read` and `write` roles should likely be stored in a way that allows for fast filtering, e.g. in a `read_roles` DB column that has a DB index.
+
+
+---
+
+:::danger[Open questions]
+
+- (1?) Is it fine to restrict roles and actions like that? Or can we restrict it even more?
+
+:::
diff --git a/docs/common/index.md b/docs/common/index.md
@@ -0,0 +1,57 @@
+---
+sidebar_position: 3
+---
+
+# Common specifications
+
+## Data storage
+
+The single source of truth for everything is the database (DB) plus files on the file system¹ referenced by the DB.
+Every piece of information is only stored in one place in the DB.
+
+Only a handful of files are stored on the file system:
+- Binary and/or large files like video, audio, images, ...
+- Files that need to be delivered in a specific format anyway (VTT subtitles, ...)
+
+:::info[Differences from current OC model]
+In particular, textual metadata, ACLs, cutting information or anything like that is _not_ stored on the file system!
+(Some APIs might still accept or produce these information in non-JSON exchange formats.)
+:::
+
+The database never references files by absolut path or URL.
+At most, it stores a path relative to the configured `storage.dir`, but potentially in an even more implicit way.
+
+
+(¹) File system = local file system, or NFS, or S3 storage or potentially others.
+
+### Derived data storage
+
+For different purposes, it might be useful to store the same data again in a different form.
+For example, using a search index for full text search.
+(Note however: whenever possible and useful, use DB indices built into the DBMS.)
+
+These derived data sources can be slightly behind the DB (e.g. due to indexing times), which is acceptable.
+However, it is crucially important that data only ever flows from the DB into other data stores, _never_ the other way around.
+Deleting all derived data stores must never result in data loss as they can always be regenerated from the DB.
+Rebuilding derived data stores must always results in the same result, regardless of what the derived store previously contained.
+Opencast should do its best to keep the derived data stores in sync in a timely manner.
+
+
+## Promised properties
+
+This data model promises certain properties about certain fields/data, for example: "there is a non-empty title", "this is an array of strings" or "the duration matches the duration all tracks".
+
+- It's Opencast's responsibility to ensure these properties. Whenever an entity is added or changed, these properties need to be maintained, usually by rejecting the change request (e.g. 4xx response in API).
+- If an entity does not have these properties, this should be considered a bug in Opencast and should be fixed ASAP.
+  - We should never find us in the situation where external apps (e.g. LMS plugins) need to work around a broken property of Opencast.
+- The same goes for legacy events, which might be broken in the new model. They cannot be kept as is, they need to be changed/migrated to exhibit these properties.
+- The implementation should try, wherever possible, to make broken events impossible to represent. As a simple example, the title field in the DB should be `non null`.
+
+
+## Well defined API response
+
+Opencast's API should have a well defined/typed response that is derived from code in a [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) fashion.
+Specifically, the API documentation for e.g. `GET /event/{id}` needs to specify what kind of JSON object will be returned by the API.
+This could be done via a [JSON Schema](https://json-schema.org/) or via GraphQL or other means.
+Someone interested in using the API should know _exactly_ what response to expect, without sending a single test request to the API.
+It is important, that the response specification is generated from the same code that is used for the actual API response serialization, to ensure they are always in sync.
diff --git a/docs/common/types.md b/docs/common/types.md
@@ -0,0 +1,53 @@
+---
+sidebar_position: 2
+---
+
+# Common types
+
+These are types used throughout the rest of this specification and defined here once to avoid repetition.
+
+- `string`: a valid UTF-8 string. While being processed in code, it might be in a different encoding temporarily, but in the public interface of Opencast, these are always valid UTF-8.
+- `NonBlankString`: A string that is not "blank", meaning it is not empty and does not consists only of [Unicode `White_Space`](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt).
+- `NonBlankAsciiString`: A `NonBlankString` that is also restricted to only using ASCII characters.
+- `Label`: a `NonBlankAsciiString` that only consists of letters, numbers or `-._~!*:@,;`. This means a label is URL-safe except for use in the domain part.<sup>(2?)</sup>
+- `ID`: a `Label` that cannot be changed after being created.
+- `Username`: TODO define rules for usernames
+- `LangCode`: specifies a language and optionally a region, e.g. `en` or `en-US`. Based on the [IETF BCP 47 language tag specification](https://www.rfc-editor.org/info/rfc5646): a two letter language code, optionally followed by a hyphen and a two letter region tag.
+- `int8`, `int16`, `int32`, `int64`: signed integers of specific bit width.
+- `uint8`, `uint16`, `uint32`, `uint64`: unsigned integers of specific bit width.<sup>(1?)</sup>
+- `Milliseconds`: a `uint64` representing a duration or a video timestamp in milliseconds (ms). Impl note: whenever possible, in code, this should be a custom type and not just `int`.
+- `DateTime`: a date + time with timezone, i.e. a specific moment in a specific timezone.
+- `Timestamp`: a specific moment in time, without time zone (e.g. always stored as UTC).
+
+Generally, this basically uses TypeScript syntax:
+
+- `T?`: denotes an optional type, i.e. `bool?` means the field could be either `true`, `false` or undefined. All fields without `?` are _required_ / `non null`.
+- `T[]`: array of type `T`.
+- `[T, U, ...]`: a tuple of values.
+- `"foo" | "bar"`: one of the listed constant values.
+
+## JSON serialization
+
+For most types, the JSON serialization is the obvious one, but there are some minor important details.
+- `bool` as `bool`
+- `string` and all "string with extra requirements" (e.g. `Label`, `ID`, `NonBlankAsciiString`) as string
+- Integers as number.
+  - Note on 64 bit integers: In JavaScript, there is only one `number` type, which is a 64 bit floating point number (`double`, `f64`).
+  Those can only exactly represent integers up to 2<sup>53</sup>.
+  While JSON is closely related to JS, the format itself is allowed to exceed `f64` precision and may in fact encode arbitrary precision numbers.
+  Opencast should serialize a 64 bit integer as exact integer into JSON and *not* rounded like an `f64`.
+  Rounding might happen in the frontend, but the API should emit the exact integer value.
+- Arrays as arrays
+- Tuples as arrays
+- `Map<string, string>` is serialized as object
+- `DateTime`: as ISO 8601-compatible formatted string. The ISO standard actually allows a number of different formats by ommitting parts of the string. Opencast shall format all date times as either `YYYY-MM-DDTHH:mm:ss.sssZ` or `YYYY-MM-DDTHH:mm:ssZ`, i.e. only the sub-second part is optional. The parts on this format string are best described in [the ECMAScript specification](https://tc39.es/ecma262/multipage/numbers-and-dates.html#sec-date-time-string-format) (which again, is a subset of ISO 8601). Only thing of note: `Z` could either be literal `Z` or a timezone offset like `+02`.
+- `Timestamp`: like `DateTime` but always in UTC, so always ending with literal `Z`.
+
+---
+
+:::danger[Open questions]
+
+- (1?) Java famously has no/bad support for unsigned integers. Decide how to deal with that: do we just give up one bit or do we require proper unsigned usage via `Integer.*Unsigned` methods? Either way: these values must never be negative!
+- (2?) Maybe disallow more of these special characters?
+
+:::
diff --git a/docs/event/acl.md b/docs/event/acl.md
@@ -0,0 +1,12 @@
+---
+sidebar_position: 3
+---
+
+# ACL
+
+See [the common ACL specifications](../common/acl).
+
+- `read`: allows a user to read all metadata, the ACL and all non-internal assets (their metadata and the asset files themselves).
+- `write`: allows a user to change any editable metadata, change the ACL, change anything about assets (delete, change, add). <sup>TODO: what about internal assets?</sup>
+
+TODO: specify how `listed` works.
diff --git a/docs/event/index.md b/docs/event/index.md
@@ -0,0 +1,26 @@
+---
+sidebar_position: 4
+---
+
+# Event
+
+An event<sup>(1?)</sup> is the core entity of Opencast, representing a multimedia content.
+An event consists of:
+- [Metadata](./metadata)
+- [ACL](./acl)
+- [Assets](./assets)
+
+As described [here](../common#data-storage), almost all of this data is stored in the DB.
+Only the actual asset files are stored on the file system (the metadata about assets is still stored in the DB).
+
+
+---
+
+:::danger[Open questions]
+
+- (1?) Potentially very controversial: rename "event"/"episode" to "video"?
+  - Intuitively, most people call it "video"
+  - "Event" is a very generic term and can mean many other things, "episode" implies being part of a series.
+  - Yes, there can be two _video files_, but we already have a name for that: video stream. So Idon't see a confusion risk here. I don't see any problems with calling a thing a video even if it contains two video streams.
+  - New name in API would make clear that data model has changed.
+:::
diff --git a/docs/important-differences.md b/docs/important-differences.md
@@ -0,0 +1,79 @@
+---
+sidebar_position: 2
+---
+
+# Important differences from the current model
+
+This page mentions a number of major ways, in how this specification differs from the Opencast status quo.
+
+
+## No snapshot system anymore
+
+The old system of creating snapshots and using hardlinks on the file system is no more.
+Whether and how want to version parts of an entity's data is still questionable (see [Open Questions](./open-questions)).
+
+
+## No publications
+
+There is no "engage", "external API", OAIMPH or any other internal _publication_ anymore.
+There might still be a place for external publications in the sense of interacting with another system like YouTube.
+These would require some async data synchronization and stuff.
+But hardly anyone is using that, so while reading this specification just think: there are no publications at all.
+The term does not exist anymore.
+
+Instead, the DB, file system and all APIs have the same view of the world.
+If an event with title "Banana" exists in Opencast, then it exists _everywhere_, i.e. in the DB, on the file system, and in all APIs¹.
+
+This also includes modifications and deletions.
+There is no staging area for changes anymore: all metadata and ACL changes to Opencast entities (event, series, ...) are instantly reflected in all APIs¹.
+Changing metadata and ACLs does not require running a workflow anymore.
+APIs for modifying this data promise that once they return 2xx, the change has been finalized to the database (the single source of truth).
+
+A small number of Opencast users might like the two-stage metadata changing.
+_If_ it is really desired, this "feature" can be implemented on top of the core Opencast, e.g. in the Admin UI (but disabled by default).
+
+(¹) A small delay to update the search index is fine.
+
+### Long running operations
+
+Of course, there are some modifications or operations that cannot be done immediately, e.g. encoding a video or generating subtitles.
+APIs starting these operations are _async_, i.e. they return 2xx to just state the operation has been started, but don't wait for the operation to finish.
+But even with these operations, there is still only one view of the world.
+For example, say a subtitle generation for an event was started: until the moment that operation finishes, the event has its previous subtitles (e.g. none) and that's reflected in all APIs.
+
+An event is visible in APIs immediately after ingesting.
+Of course, while the video is not encoded yet, there are no URLs to video tracks yet.
+The API should represent that fact in a way that makes it easy for external apps to check if a video is still processing.
+
+Sometimes, long running operations need to be run on metadata changes, e.g. to generate thumbnails with metadata in them (aside: this is usually not a great idea).
+This can still be done, with the difference that the DB/API immediately reflects the changed metadata, while the thumbnail needs to catch up.
+Again: the DB is the single source of truth.
+Everything derived from it (e.g. search index, thumbnails, ...) needs to catch up.
+
+As an aside, we should treat fewer operations as "long running" and thus offer synchronous APIs for them.
+Cutting subtitles, generating thumbnails in different sizes, and more are things that can be easily done in tens of milliseconds.
+
+## Storage format & API format
+
+### Independence
+
+How Opencast stores data should be independent of how Opencast exposes data in its API.
+Just because the API format is JSON, does not mean that Opencast should store everything as JSON in the DB or on the file system.
+
+Further, the structure of classes in Opencast code or the format in the search service should also not leak into the API.
+The structure of the API response should be selected purely based on good API design and not on internals.
+Avoiding to leak internals makes it easier to change these internals without breaking the API.
+(The rewrite of the search service from Solr to ElasticSearch demonstrates how badly this can fail: the very widely used search API changed a lot.)
+
+The implementation should do everything to ensure this separation.
+For example, by having separate `record` definitions which are *only* used for API serialization.
+This also makes it a lot harder to accidentally change the API.
+
+### Unified response for all entities
+
+An event in the API should always be represented with the same JSON response, regardless of whether it was fetched by ID, or returned from a full text search, or as the entry of a series.
+Previously, this differed depending on whether it was loaded from the search index or the database or elsewhere.
+
+Ideally, there shouldn't be a separate `search` endpoint anyway, but rather have the search feature be part of the external event API.
+As an API user, I don't care what indices or data structures Opencast uses to give me the data.
+And now that we use ElasticSearch/OpenSearch, there is no reason why there are nodes that couldn't perform that search.
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,49 @@
+---
+sidebar_position: 1
+title: Introduction
+---
+
+# Opencast Data Model
+
+This document specifies the _future_ data model of Opencast.
+The data model describes everything that is stored, what types and requirements certain data has, how it is represented in the API, how data can be changed, and more.
+
+:::warning
+This specification does *not* describe the current state of Opencast!
+Also, it is a work in progress and is currently being developed and discuss in the community.
+:::
+
+Readers familiar with Opencast should ignore their prior knowledge while reading this, and treat this as a specification for a completely new software.
+Do not interpret any existing OC behavior into this specification, if it isn't explicitly mentioned.
+Also read the special [Important Differences](./important-differences) page, which explains where this data model differs in significant ways from the current Opencast.
+
+
+## Goals
+
+There are multiple reasons we are proposing this new data model:
+- Improve robustness of Opencast by having a stricter and well defined data model. Be clear about what is allowed and what isn't, and catch invalid data as early as possible.
+- Simplify developement of external applications: currently, the API responses are grossly underspecified and it is unclear what properties apps can expect from Opencast (e.g. do I need to deal with duration = -1?).
+- Improve robustness by clearly specifying the source of truth for data and reducing the number of places/APIs that store/return data.
+- Enable immediate modification of metadata (e.g. changing a video's title) without running a workflow.
+- Improve performance by changing how data is stored.
+
+The goal behind this very specification is to allow for easy discussion in the community, and eventually to have a written specification.
+
+This specification is written mainly as if it was talking to API users, i.e. developers of external apps who want to integrate with Opencast.
+I think this is a useful choice to define the "public interface" of Opencast.
+The document does contain quite a bit of implementation notes, too, which just define how things should be handled inside Opencast.
+
+## Contributing to this specification
+
+Discussing every single detail in the community beforehand is not viable and not necessary.
+Instead, the idea is that there is one main person working on this spec, writing most of the text, therefore proposing parts of the model.
+These proposals are discussed in regular meetings and on GitHub.
+See [the `opencast/data-model`](https://github.com/opencast/data-model) repository, and in particular the pull requests and discussions tabs.
+
+## Backwards compatibility and breaking changes
+
+It is very clear that we need to be able to migrate existing data to the new model.
+We also don't want to change every single piece without good reason, in order to keep the overall change managable.
+The new model was designed with that in mind.
+That said, this document (especially its initial version) does contain incompatibilities and breaking changes, and does not yet consider every single use case.
+I expect these use cases to be discussed during the community review of this.
diff --git a/docs/open-questions.md b/docs/open-questions.md
@@ -0,0 +1,18 @@
+# Open questions
+
+- Should all data be versioned?
+  - It adds complexity, but having access to old data is nice.
+  - Storage wise, keeping old metadata does not cost much.
+  - Via the `internal` asset system, we can already kind of version assets.
+  - Get rid of the current asset manager/snapshot system to avoid hardlinks.
+- What do we generally think about size limitations for various fields?
+  - Abuse protection: this is just to prevent abuse, DOS, slow downs and stuff like that. Limit `description` to 2<sup>16</sup> bytes, limit `title`, `license`, ... to 1024 bytes. I think these limits make sense and should prevent OC suffering from bad payloads.
+  - Semantic limits: for example, for `license`, we could say "it should just be a identifier for a license, so limit to 64 bytes". This is a lot more tricky as one has to really think of the intended use case and runs the risk of making use cases impossible.
+
+
+## TODO
+
+- Metadata can be changed when a workflow is running or an event is scheduled
+  - Mhhh small problem: some workflows might depend on metadata, e.g. when creating images with metadata in them. So maybe workflows can declare dependencies to metadata?
+  - So maybe we cannot do this now, this feature we can still add in a second step. When we rework the workflow system 😈
+- Explain how snapshots are removed