-
Notifications
You must be signed in to change notification settings - Fork 0
Discuss: remaining topics #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: discussions/rest/base
Are you sure you want to change the base?
Changes from all commits
3a555db
6969894
d2a6f6d
240e651
1f8d245
e23974f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| --- | ||
| sidebar_position: 1 | ||
| --- | ||
|
|
||
| # ACL (access control list) | ||
|
|
||
| ACLs control access to Opencast entities. | ||
| An ACL is simply a list of `role` + `action` pairs. | ||
| An entry gives users that have that particular `role` the permission to perform the specified `action` on the entity. | ||
| Both, `role` and `action` have the [type `Label`](./types).<sup>(1?)</sup> | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KatrinIhler wrote:
|
||
|
|
||
| There are two special actions recognized by Opencast. | ||
| Other actions can be used for custom purposes by external applications. | ||
| - `read`: generally, gives read access to an entity | ||
| - `write`: generally, gives write access to an entity (changing or deleting it) | ||
|
Comment on lines
+14
to
+15
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KatrinIhler said:
|
||
|
|
||
| *Impl note*: `read` and `write` roles should likely be stored in a way that allows for fast filtering, e.g. in a `read_roles` DB column that has a DB index. | ||
|
|
||
|
|
||
| --- | ||
|
|
||
| :::danger[Open questions] | ||
|
|
||
| - (1?) Is it fine to restrict roles and actions like that? Or can we restrict it even more? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Relevant discussion for user names and thus user roles: https://github.com/orgs/opencast/discussions/6539. |
||
|
|
||
| ::: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| --- | ||
| sidebar_position: 3 | ||
| --- | ||
|
|
||
| # Common specifications | ||
|
|
||
| ## Data storage | ||
|
|
||
| The single source of truth for everything is the database (DB) plus files on the file system¹ referenced by the DB. | ||
| Every piece of information is only stored in one place in the DB. | ||
|
|
||
| Only a handful of files are stored on the file system: | ||
| - Binary and/or large files like video, audio, images, ... | ||
| - Files that need to be delivered in a specific format anyway (VTT subtitles, ...) | ||
|
|
||
| :::info[Differences from current OC model] | ||
| In particular, textual metadata, ACLs, cutting information or anything like that is _not_ stored on the file system! | ||
LukasKalbertodt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| (Some APIs might still accept or produce these information in non-JSON exchange formats.) | ||
| ::: | ||
|
|
||
| The database never references files by absolut path or URL. | ||
LukasKalbertodt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| At most, it stores a path relative to the configured `storage.dir`, but potentially in an even more implicit way. | ||
|
|
||
|
|
||
| (¹) File system = local file system, or NFS, or S3 storage or potentially others. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are so bold, maybe make S3 more like the default location 🙈? Do we still differentiate between asset manager and delivery for locations? If we mix the two, this can be difficult for configuring serving files. If we just put everything in one |
||
|
|
||
| ### Derived data storage | ||
|
|
||
| For different purposes, it might be useful to store the same data again in a different form. | ||
| For example, using a search index for full text search. | ||
| (Note however: whenever possible and useful, use DB indices built into the DBMS.) | ||
|
|
||
| These derived data sources can be slightly behind the DB (e.g. due to indexing times), which is acceptable. | ||
| However, it is crucially important that data only ever flows from the DB into other data stores, _never_ the other way around. | ||
| Deleting all derived data stores must never result in data loss as they can always be regenerated from the DB. | ||
| Rebuilding derived data stores must always results in the same result, regardless of what the derived store previously contained. | ||
| Opencast should do its best to keep the derived data stores in sync in a timely manner. | ||
|
|
||
|
|
||
| ## Promised properties | ||
|
|
||
| This data model promises certain properties about certain fields/data, for example: "there is a non-empty title", "this is an array of strings" or "the duration matches the duration all tracks". | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I strongly disagree with that statement. Not all tracks in a "media package / events / whatever this is now called" should be required to have the same duration. In practice, not only is this strictly not true (e.g. stopping could be executed at slightly different times resulting in a difference of milliseconds to seconds), but we have use cases in Opencast where various tracks of different lengths are uploaded (e.g. concat partially recorded videos, video conference recordings, intro- outro if you don't use themes, ...). I don't see a problem with the way this is done right now where every track has its own duration. For Tobira, you will instrument the player with a specific track. Then just use the duration of that track. |
||
|
|
||
| - It's Opencast's responsibility to ensure these properties. Whenever an entity is added or changed, these properties need to be maintained, usually by rejecting the change request (e.g. 4xx response in API). | ||
| - If an entity does not have these properties, this should be considered a bug in Opencast and should be fixed ASAP. | ||
| - We should never find us in the situation where external apps (e.g. LMS plugins) need to work around a broken property of Opencast. | ||
| - The same goes for legacy events, which might be broken in the new model. They cannot be kept as is, they need to be changed/migrated to exhibit these properties. | ||
| - The implementation should try, wherever possible, to make broken events impossible to represent. As a simple example, the title field in the DB should be `non null`. | ||
|
|
||
|
|
||
| ## Well defined API response | ||
|
|
||
| Opencast's API should have a well defined/typed response that is derived from code in a [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) fashion. | ||
| Specifically, the API documentation for e.g. `GET /event/{id}` needs to specify what kind of JSON object will be returned by the API. | ||
| This could be done via a [JSON Schema](https://json-schema.org/) or via GraphQL or other means. | ||
| Someone interested in using the API should know _exactly_ what response to expect, without sending a single test request to the API. | ||
| It is important, that the response specification is generated from the same code that is used for the actual API response serialization, to ensure they are always in sync. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| --- | ||
| sidebar_position: 2 | ||
| --- | ||
|
|
||
| # Common types | ||
|
|
||
| These are types used throughout the rest of this specification and defined here once to avoid repetition. | ||
|
|
||
| - `string`: a valid UTF-8 string. While being processed in code, it might be in a different encoding temporarily, but in the public interface of Opencast, these are always valid UTF-8. | ||
| - `NonBlankString`: A string that is not "blank", meaning it is not empty and does not consists only of [Unicode `White_Space`](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt). | ||
| - `NonBlankAsciiString`: A `NonBlankString` that is also restricted to only using ASCII characters. | ||
| - `Label`: a `NonBlankAsciiString` that only consists of letters, numbers or `-._~!*:@,;`. This means a label is URL-safe except for use in the domain part.<sup>(2?)</sup> | ||
| - `ID`: a `Label` that cannot be changed after being created. | ||
| - `Username`: TODO define rules for usernames | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See discussion here: https://github.com/orgs/opencast/discussions/6539 |
||
| - `LangCode`: specifies a language and optionally a region, e.g. `en` or `en-US`. Based on the [IETF BCP 47 language tag specification](https://www.rfc-editor.org/info/rfc5646): a two letter language code, optionally followed by a hyphen and a two letter region tag. | ||
LukasKalbertodt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - `int8`, `int16`, `int32`, `int64`: signed integers of specific bit width. | ||
| - `uint8`, `uint16`, `uint32`, `uint64`: unsigned integers of specific bit width.<sup>(1?)</sup> | ||
| - `Milliseconds`: a `uint64` representing a duration or a video timestamp in milliseconds (ms). Impl note: whenever possible, in code, this should be a custom type and not just `int`. | ||
| - `DateTime`: a date + time with timezone, i.e. a specific moment in a specific timezone. | ||
| - `Timestamp`: a specific moment in time, without time zone (e.g. always stored as UTC). | ||
|
|
||
| Generally, this basically uses TypeScript syntax: | ||
|
|
||
| - `T?`: denotes an optional type, i.e. `bool?` means the field could be either `true`, `false` or undefined. All fields without `?` are _required_ / `non null`. | ||
| - `T[]`: array of type `T`. | ||
| - `[T, U, ...]`: a tuple of values. | ||
| - `"foo" | "bar"`: one of the listed constant values. | ||
|
|
||
| ## JSON serialization | ||
|
|
||
| For most types, the JSON serialization is the obvious one, but there are some minor important details. | ||
| - `bool` as `bool` | ||
| - `string` and all "string with extra requirements" (e.g. `Label`, `ID`, `NonBlankAsciiString`) as string | ||
| - Integers as number. | ||
| - Note on 64 bit integers: In JavaScript, there is only one `number` type, which is a 64 bit floating point number (`double`, `f64`). | ||
| Those can only exactly represent integers up to 2<sup>53</sup>. | ||
| While JSON is closely related to JS, the format itself is allowed to exceed `f64` precision and may in fact encode arbitrary precision numbers. | ||
| Opencast should serialize a 64 bit integer as exact integer into JSON and *not* rounded like an `f64`. | ||
| Rounding might happen in the frontend, but the API should emit the exact integer value. | ||
| - Arrays as arrays | ||
| - Tuples as arrays | ||
| - `Map<string, string>` is serialized as object | ||
| - `DateTime`: as ISO 8601-compatible formatted string. The ISO standard actually allows a number of different formats by ommitting parts of the string. Opencast shall format all date times as either `YYYY-MM-DDTHH:mm:ss.sssZ` or `YYYY-MM-DDTHH:mm:ssZ`, i.e. only the sub-second part is optional. The parts on this format string are best described in [the ECMAScript specification](https://tc39.es/ecma262/multipage/numbers-and-dates.html#sec-date-time-string-format) (which again, is a subset of ISO 8601). Only thing of note: `Z` could either be literal `Z` or a timezone offset like `+02`. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why allow variability? Maybe always require sub-seconds and UTC?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mh I guess the subsecond thing is fair, the only reasons for the possible omission I can come up with could also be used to argue for the possible omission of seconds. The timezone is different tho: yes, when you just want to communicate an instant, always UTC is a good idea. But sometimes you want to communicate the timezone as well, so that the frontend could for example show "this lecture starts at 10am local time (which is 2pm your time)". For that to work, the frontend need to know what the local time is. Of course we can argue whether that is ever important, but to me it feels semantically correct to store stuff like bibliographical date with timezone. Sure, technical timestamps like |
||
| - `Timestamp`: like `DateTime` but always in UTC, so always ending with literal `Z`. | ||
|
|
||
| --- | ||
|
|
||
| :::danger[Open questions] | ||
|
|
||
| - (1?) Java famously has no/bad support for unsigned integers. Decide how to deal with that: do we just give up one bit or do we require proper unsigned usage via `Integer.*Unsigned` methods? Either way: these values must never be negative! | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just use validation annotations? So yes give up one bit. |
||
| - (2?) Maybe disallow more of these special characters? | ||
|
|
||
| ::: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| --- | ||
| sidebar_position: 3 | ||
| --- | ||
|
|
||
| # ACL | ||
|
|
||
| See [the common ACL specifications](../common/acl). | ||
|
|
||
| - `read`: allows a user to read all metadata, the ACL and all non-internal assets (their metadata and the asset files themselves). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is considered iternal / non-internal?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is explained in the event asset doc. Or what are you asking exactly?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was reading this before the assets doc. But still, (a) I'm not convinced of the internal / non-internal concept and (b) the main question for me is if I, as a SysAd, can decide this. |
||
| - `write`: allows a user to change any editable metadata, change the ACL, change anything about assets (delete, change, add). <sup>TODO: what about internal assets?</sup> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Opencast SysAd should configure what assets are allowed to be manipulated. We don't want people to shot themselves in the foot and steer users to allowed manipulations of assets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @mtneug here. We currently have lots of restriction (via roles) on what users are allowed to do themselves, changing acls for example is not possible for users. Currently, Opencast needs workflows to change assets, not sure how this would work when I'm trying to forget the word 'publication'. I would expect it to happen orderly and in a way that users can not (involuntarily) destroy their events. This should respect roles like |
||
|
|
||
| TODO: specify how `listed` works. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| --- | ||
| sidebar_position: 4 | ||
| --- | ||
|
|
||
| # Event | ||
|
|
||
| An event<sup>(1?)</sup> is the core entity of Opencast, representing a multimedia content. | ||
| An event consists of: | ||
| - [Metadata](./metadata) | ||
| - [ACL](./acl) | ||
| - [Assets](./assets) | ||
|
|
||
| As described [here](../common#data-storage), almost all of this data is stored in the DB. | ||
| Only the actual asset files are stored on the file system (the metadata about assets is still stored in the DB). | ||
|
|
||
|
|
||
| --- | ||
|
|
||
| :::danger[Open questions] | ||
|
|
||
| - (1?) Potentially very controversial: rename "event"/"episode" to "video"? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please don't. Video does not quite capture what a media package entails. Also, we can have audio-only media packages. However, I agree to consolidate. Personally, I would either call it media package everywhere or call it media package internally and event on the outside. IMO we can drop episode.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KatrinIhler wrote:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re Katrin: Mh I don't see any confusion really. There are (video) tracks and (video) streams already, but those are separate terms. I agree with avoiding this mess of having three different names, but I think renaming to "video" (i.e. a big change) might help with that. Us just deciding "it's called recording now" might not reprogram peoples' minds. But the point about audio-only makes sense I guess... mh. Throwing out some other options:
Mhh dunno. I still don't really like "event" and I think going with something new would be nice to cleanse our brains. But yeah, so there is no alternative that really convinces me. I personally wouldn't mind "video", even if we allow audio-only I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not a fan of "event" either (too open to interpretation, it could be a party or a button click). But "video" doesn't really fit either (too narrow, could be "audio"). There is probably no perfect match. Personally, I could imagine "recording" as an alternative (although one could argue that a live stream is not a recording...). |
||
| - Intuitively, most people call it "video" | ||
| - "Event" is a very generic term and can mean many other things, "episode" implies being part of a series. | ||
| - Yes, there can be two _video files_, but we already have a name for that: video stream. So Idon't see a confusion risk here. I don't see any problems with calling a thing a video even if it contains two video streams. | ||
| - New name in API would make clear that data model has changed. | ||
| ::: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| --- | ||
| sidebar_position: 2 | ||
| --- | ||
|
|
||
| # Important differences from the current model | ||
|
|
||
| This page mentions a number of major ways, in how this specification differs from the Opencast status quo. | ||
|
|
||
|
|
||
| ## No snapshot system anymore | ||
|
|
||
| The old system of creating snapshots and using hardlinks on the file system is no more. | ||
| Whether and how want to version parts of an entity's data is still questionable (see [Open Questions](./open-questions)). | ||
|
|
||
|
|
||
| ## No publications | ||
|
|
||
| There is no "engage", "external API", OAIMPH or any other internal _publication_ anymore. | ||
| There might still be a place for external publications in the sense of interacting with another system like YouTube. | ||
| These would require some async data synchronization and stuff. | ||
| But hardly anyone is using that, so while reading this specification just think: there are no publications at all. | ||
| The term does not exist anymore. | ||
|
|
||
| Instead, the DB, file system and all APIs have the same view of the world. | ||
| If an event with title "Banana" exists in Opencast, then it exists _everywhere_, i.e. in the DB, on the file system, and in all APIs¹. | ||
|
|
||
| This also includes modifications and deletions. | ||
| There is no staging area for changes anymore: all metadata and ACL changes to Opencast entities (event, series, ...) are instantly reflected in all APIs¹. | ||
| Changing metadata and ACLs does not require running a workflow anymore. | ||
| APIs for modifying this data promise that once they return 2xx, the change has been finalized to the database (the single source of truth). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this can be done for everything without having long running requests. Why not allow 202 as response code in certain cases? |
||
|
|
||
| A small number of Opencast users might like the two-stage metadata changing. | ||
| _If_ it is really desired, this "feature" can be implemented on top of the core Opencast, e.g. in the Admin UI (but disabled by default). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This means that every UI needs to implement this instead of implementing this once in Opencast. IMO this doesn't make sense and we should decide if we want this in general or not.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair, but since this is a very niche feature I would expect only the admin UI to potentially implement this. I can promise you that we won't add it to Tobira and I doubt any LMS plugin will add it. But yeah, I am absolutely fine with also just deciding "nope, we don't want that". |
||
|
|
||
| (¹) A small delay to update the search index is fine. | ||
|
|
||
| ### Long running operations | ||
|
|
||
| Of course, there are some modifications or operations that cannot be done immediately, e.g. encoding a video or generating subtitles. | ||
| APIs starting these operations are _async_, i.e. they return 2xx to just state the operation has been started, but don't wait for the operation to finish. | ||
| But even with these operations, there is still only one view of the world. | ||
| For example, say a subtitle generation for an event was started: until the moment that operation finishes, the event has its previous subtitles (e.g. none) and that's reflected in all APIs. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The transactional model should be very clearly specified. Otherwise I see races everywhere. |
||
|
|
||
| An event is visible in APIs immediately after ingesting. | ||
| Of course, while the video is not encoded yet, there are no URLs to video tracks yet. | ||
| The API should represent that fact in a way that makes it easy for external apps to check if a video is still processing. | ||
|
|
||
| Sometimes, long running operations need to be run on metadata changes, e.g. to generate thumbnails with metadata in them (aside: this is usually not a great idea). | ||
| This can still be done, with the difference that the DB/API immediately reflects the changed metadata, while the thumbnail needs to catch up. | ||
| Again: the DB is the single source of truth. | ||
| Everything derived from it (e.g. search index, thumbnails, ...) needs to catch up. | ||
|
|
||
| As an aside, we should treat fewer operations as "long running" and thus offer synchronous APIs for them. | ||
| Cutting subtitles, generating thumbnails in different sizes, and more are things that can be easily done in tens of milliseconds. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But you don't want your servers to get overwhelmed. A scheduling system has its delays, but you can make sure that the workload is handled over time and servers don't get overwhelmed with parallel operations.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HTTP server frameworks do have this as well: they also balance stuff and make sure there are not too many operations at the same time. And that operations finish in a somewhat "fair" way. But sure, this is not high priority, don't need to change this here. It's "just an aside" anyway :P |
||
|
|
||
| ## Storage format & API format | ||
|
|
||
| ### Independence | ||
|
|
||
| How Opencast stores data should be independent of how Opencast exposes data in its API. | ||
| Just because the API format is JSON, does not mean that Opencast should store everything as JSON in the DB or on the file system. | ||
|
|
||
| Further, the structure of classes in Opencast code or the format in the search service should also not leak into the API. | ||
| The structure of the API response should be selected purely based on good API design and not on internals. | ||
| Avoiding to leak internals makes it easier to change these internals without breaking the API. | ||
| (The rewrite of the search service from Solr to ElasticSearch demonstrates how badly this can fail: the very widely used search API changed a lot.) | ||
|
|
||
| The implementation should do everything to ensure this separation. | ||
| For example, by having separate `record` definitions which are *only* used for API serialization. | ||
| This also makes it a lot harder to accidentally change the API. | ||
|
|
||
| ### Unified response for all entities | ||
|
|
||
| An event in the API should always be represented with the same JSON response, regardless of whether it was fetched by ID, or returned from a full text search, or as the entry of a series. | ||
| Previously, this differed depending on whether it was loaded from the search index or the database or elsewhere. | ||
|
|
||
| Ideally, there shouldn't be a separate `search` endpoint anyway, but rather have the search feature be part of the external event API. | ||
| As an API user, I don't care what indices or data structures Opencast uses to give me the data. | ||
| And now that we use ElasticSearch/OpenSearch, there is no reason why there are nodes that couldn't perform that search. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The external API actually uses OpenSearch and you can also search, but it's different from the search endpoint. In general, I agree. But I also see valid use cases for having (internal) APIs that surface data differently e.g. for Tobira or the Admin UI. However, this should probably be an exception if there is an actual reason.
Comment on lines
+77
to
+79
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KatrinIhler said:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mhhh. We need to talk about the technical details (where I lack lots of knowledge), but I would certainly prefer if the API is designed just with "nice API" in mind and not leak internal implementation details like services. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| --- | ||
| sidebar_position: 1 | ||
| title: Introduction | ||
| --- | ||
|
|
||
| # Opencast Data Model | ||
|
|
||
| This document specifies the _future_ data model of Opencast. | ||
| The data model describes everything that is stored, what types and requirements certain data has, how it is represented in the API, how data can be changed, and more. | ||
|
|
||
| :::warning | ||
| This specification does *not* describe the current state of Opencast! | ||
| Also, it is a work in progress and is currently being developed and discuss in the community. | ||
| ::: | ||
|
|
||
| Readers familiar with Opencast should ignore their prior knowledge while reading this, and treat this as a specification for a completely new software. | ||
| Do not interpret any existing OC behavior into this specification, if it isn't explicitly mentioned. | ||
| Also read the special [Important Differences](./important-differences) page, which explains where this data model differs in significant ways from the current Opencast. | ||
|
|
||
|
|
||
| ## Goals | ||
|
|
||
| There are multiple reasons we are proposing this new data model: | ||
| - Improve robustness of Opencast by having a stricter and well defined data model. Be clear about what is allowed and what isn't, and catch invalid data as early as possible. | ||
| - Simplify developement of external applications: currently, the API responses are grossly underspecified and it is unclear what properties apps can expect from Opencast (e.g. do I need to deal with duration = -1?). | ||
| - Improve robustness by clearly specifying the source of truth for data and reducing the number of places/APIs that store/return data. | ||
| - Enable immediate modification of metadata (e.g. changing a video's title) without running a workflow. | ||
| - Improve performance by changing how data is stored. | ||
|
|
||
| The goal behind this very specification is to allow for easy discussion in the community, and eventually to have a written specification. | ||
|
|
||
| This specification is written mainly as if it was talking to API users, i.e. developers of external apps who want to integrate with Opencast. | ||
| I think this is a useful choice to define the "public interface" of Opencast. | ||
| The document does contain quite a bit of implementation notes, too, which just define how things should be handled inside Opencast. | ||
|
|
||
| ## Contributing to this specification | ||
|
|
||
| Discussing every single detail in the community beforehand is not viable and not necessary. | ||
| Instead, the idea is that there is one main person working on this spec, writing most of the text, therefore proposing parts of the model. | ||
| These proposals are discussed in regular meetings and on GitHub. | ||
| See [the `opencast/data-model`](https://github.com/opencast/data-model) repository, and in particular the pull requests and discussions tabs. | ||
|
|
||
| ## Backwards compatibility and breaking changes | ||
|
|
||
| It is very clear that we need to be able to migrate existing data to the new model. | ||
| We also don't want to change every single piece without good reason, in order to keep the overall change managable. | ||
| The new model was designed with that in mind. | ||
| That said, this document (especially its initial version) does contain incompatibilities and breaking changes, and does not yet consider every single use case. | ||
| I expect these use cases to be discussed during the community review of this. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # Open questions | ||
|
|
||
| - Should all data be versioned? | ||
| - It adds complexity, but having access to old data is nice. | ||
| - Storage wise, keeping old metadata does not cost much. | ||
| - Via the `internal` asset system, we can already kind of version assets. | ||
| - Get rid of the current asset manager/snapshot system to avoid hardlinks. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Getting rid of hardlinks can be nice, but this also means we need to keep track of things. We already do that for the asset manager in S3 (we don't do that for the distribution S3). Also this would make it easier for S3 to become the default ;)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KatrinIhler wrote:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re Katrin: I am not suggesting at all to actually store the bytes of two identical files twice. If we do not use hard links, we would deduplicate the data in a different way. Apart from event duplication, the only reason we event have duplicate files right now is due to the versioning works. And thus we rely on hard links to not waste space. But this topic we should discuss in the next meeting. |
||
| - What do we generally think about size limitations for various fields? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally +1. I guess many fields are already limited by the DB. So making this explicit is good.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KatrinIhler wrote:
|
||
| - Abuse protection: this is just to prevent abuse, DOS, slow downs and stuff like that. Limit `description` to 2<sup>16</sup> bytes, limit `title`, `license`, ... to 1024 bytes. I think these limits make sense and should prevent OC suffering from bad payloads. | ||
| - Semantic limits: for example, for `license`, we could say "it should just be a identifier for a license, so limit to 64 bytes". This is a lot more tricky as one has to really think of the intended use case and runs the risk of making use cases impossible. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, in this example for one use case we actually used URLs for licenses (but we stored them as extended metadata). So not too strict would be good 😅. |
||
|
|
||
|
|
||
| ## TODO | ||
|
|
||
| - Metadata can be changed when a workflow is running or an event is scheduled | ||
| - Mhhh small problem: some workflows might depend on metadata, e.g. when creating images with metadata in them. So maybe workflows can declare dependencies to metadata? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least as Opencast is currently set up, this can be a problem with the includes / conditionals. |
||
| - So maybe we cannot do this now, this feature we can still add in a second step. When we rework the workflow system 😈 | ||
| - Explain how snapshots are removed | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XACML also has
effectsuch asdenyorpermit, but I don't think anything other thanpermitis actually possible. However, I'm not and maybe people usedenyrules.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm aware but deny rules are not supported by the search index or something like that, so my strong assumption was that no one is using them, since not all parts of Opencast support them.