-
Notifications
You must be signed in to change notification settings - Fork 1
2024 05 06
:::info Next meeting: 2024-05-21 12:30 CET (12:30pm) :::
Netlify has accepted our open source plan request! Astro site is deployed! It is deployed at: https://openpodcastapi.netlify.app/ and https://openpodcastapi.org
Last meeting notes: https://pad.funkwhale.audio/kIRwEOYDRNqTA4np6vbBVg#
| Test case | Comments | Test case OK? ✖✔ |
|---|---|---|
|
rss feed without episode guids (old) IDs need to be generated once |
||
|
server doesn’t know of new sync-ID, and phone doesn’t know of old one Old phone dies after listening, publisher changes GUID, new phone gets new GUID from feed and then synchronises with sync server. |
We need a mechanism to deduplicate episdes. | |
|
new client connects to server Client doesn't yet have sync-IDs of the episodes it finds in the RSS feed. |
We need a matching mechanism. If GUID is podcast-unique it's easy. But we need a fall-back/priority list (helpful across all scenarios). | |
|
Same feed has identical enclosure URL as an existing episode Podcaster goes on holiday and republishes existing episode (although often they'd record a new intro). |
||
Episode deduplication in Podverse: at beginning of parsing, compare to already known episodes; a) hide items that are no longer in the feed & b) add items that are new in the feed; after a while, clean up (delete) episodes. Episode matching:
- guid (coverage seems to have improved a lot over time), icw identifier of the podcast
- enclosure URL as fallback (which is problematic as different feeds can point to same episode)
Approach: deletion of episode from feed seen as author wants episode to be deleted from the internet. Exception: clips - linked episodes are kept as 'hidden' rows.
Podverse has been cleaning the dataset and dropping 'Title' from episode data. However, it could still act as 'second class' citizen.
Enclosure URL might be a better candidate.
https://pad.funkwhale.audio/oCfs5kJ6QTu02d_oVHW7DA#Fetch-hash:
- GUID
- Enclosure URL
- Episode URL
However, this wouldn't consider Hans-Peter's comments about title & date being helpful.
Also: guideline or hard spec? Former; we only need to specify which data is mandatory for episode matching/deduplication.
Is there a difference between 'episode matching' (assigning a sync-ID at client level) and deduplication?
Waterfall options:
- If step 1 (GUID) fails, check for step 2 etc
- If step 1 (GUID) fails, consider a new episode & continue, then at any later time let client use the deduplication mechanism and -endpoint to deduplicate
When & where is this waterfall run?
- At client side when refreshing feed.
- At server side when receiving 'new episode' from client.
- GUID
- Enclosure URL
- Matching of at least 2 out of 3 relevant fields:
- Publishing Date
- Episode link
- Title
If there's no match, go to the next step. If there's more than 1 match, take only those matches to the next stage.
If you fall through the entire waterfall, you assume duplicates, which then need to be flagged as such (see deduplication end-point).
We haven't reviewed all edge-cases; table is incomplete. We accept, however, that there will be edge-cases (with data loss) and get back to the edge-cases (completing the table) when we discuss deduplication endpoint. We can focus on the 99%, for episode matching based on waterfall.
Current state of data fields: https://pad.funkwhale.audio/s/88C5eXrRq
TODO: criticially examine which of the fields need a timestamp. Think of naming scheme for timestamp fields & JSON structure for endpoints.
Collate the information:
- Which fields need timestamps + proposal: https://pad.funkwhale.audio/s/6mWuDexgz#Data-timestamps
- Actions (calls) under the episode endpoint: https://pad.funkwhale.audio/oCfs5kJ6QTu02d_oVHW7DA#
- Deduplication endpoint already necessary? (Not a high priority; versioning & authentication to be done first).