This repository contains a pipeline for generating SHACLs, UMLs and Sempyro Pydantic classes from LinkML schema definitions, specifically designed for DCAT and related semantic web vocabularies.
.
├── metadata_automation/ # Code for this repository
│ ├── sempyro/templates/ # Custom Jinja templates for Pydantic generation
│ │ └── ...
│ └── ...
├── linkml-definitions/ # LinkML YAML schema definitions, organized by namespace
│ ├── dcat/
│ │ ├── dcat_resource.yaml
│ │ └── dcat_dataset.yaml
│ └── ...
├── inputs/ # Additional inputs besides the LinkML definitions
├── outputs/
│ ├── sempyro_classes/ # Generated Python files with Sempyro Pydantic classes
│ │ ├── dcat/
│ │ │ ├── dcat_resource.py
│ │ │ └── dcat_dataset.py
│ │ └── ...
│ ├── shacl_shapes/ # Generated Turtle files with SHACL shapes
│ └── ...
├── gen_sempyro.py # Generator script with custom import definitions
└── README.md
python 0_gen_linkml.py
The output directory is currently hardcoded to './temp-linkml'.
gen-shacl --include-annotations ./linkml-definitions/dcat/dcat_dataset.yaml > ./outputs/shacl_shapes/dcat_dataset.ttl
If the key and/or value in a class or property/slot under 'annotations' contains ':' , it will be parsed as an URI.
The SHACLs are currently generated with all properties 'inline', which matches HealthDCAT-AP. The previous Health-RI v2 SHACLs had the properties separately.
gen-plantuml ./linkml-definitions/dcat/dcat_dataset.yaml --classes DCATDataset --classes DCATResource --directory ./tmp --classes FOAFAgent --classes DCATVCard
Run the generation script to convert LinkML definitions to Sempyro Pydantic classes:
python gen_sempyro.py
This will:
- Read LinkML YAML files from
./linkml-definitions/
- Adds a link to
../rdf_model
where necessary. The RDFModel class is only relevant for Sempyro, not for the SHACLs or UML. - Adds validation logic from
./inputs/sempyro/validation_logic.yaml
to the relevant classes. - Apply custom Jinja templates from
./templates/sempyro/
- Generate Python classes in
./sempyro_classes/
The Pydantic generation uses adapted Jinja templates located in ./templates/sempyro/
. These templates are necessary to:
- Handle Sempyro-specific class generation
- Customize output formatting
- Issue: LinkML's Pydantic generator doesn't handle the
meaning
property correctly for enums - Workaround: We misuse the
description
property to generate proper enum values - Example:
Status: permissible_values: Completed: meaning: ADMSStatus.Completed description: ADMSStatus.Completed # Used for actual enum value
- LinkML generation: Find a way to keep the single source of truth as clean as possible.
- LinkML generation: Integrate DCAT-AP and HealthDCAT-AP, either through defining it in the Single source of truth, or directly in LinkML.
- Sempyro generation: Per slot, swap 'range' with 'annotations/sempyro_range' so the right types are defined in the Sempyro classes.
- SHACL & Sempyro: Fix enums so they are compatible with the SHACLs and Sempyro
- Sempyro: Agree on a workflow to update Sempyro based on the Single source of truth.
- UML generation: Implement UML generation
- Generate CKAN properties (https://github.com/ckan/ckanext-dcat/tree/master/ckanext/dcat)
- Generate Discovery service mappings (https://github.com/GenomicDataInfrastructure/gdi-userportal-dataset-discovery-service)
- Generate HTML tables for Bikeshed
-
Single Source of Truth Integration
- Investigate generating LinkML YAMLs from our canonical data models
- Establish automated pipeline from source models to LinkML definitions
-
SHACL Validation
- Verify that SHACLs generated from these LinkML schemas match our requirements
- Test round-trip compatibility: LinkML → SHACL → validation
- HealthDCAT-AP Integration
- Convert existing HealthDCAT-AP SHACL constraints to LinkML format
- Generate Sempyro classes directly from HealthDCAT-AP specifications
The gen_sempyro.py
script defines custom imports to ensure generated classes have the correct Sempyro dependencies:
imports_dcat_resource = (
Imports() +
Import(module="sempyro", objects=[ObjectImport(name="LiteralField"), ObjectImport(name="RDFModel")]) +
Import(module="sempyro.foaf", objects=[ObjectImport(name="Agent")]) +
# ... more imports
)
Templates in ./templates/sempyro/
override default LinkML behavior to:
- Use Sempyro base classes instead of standard Pydantic
- Include RDF-specific annotations
- Handle semantic web type mappings
When adding new LinkML definitions:
- Place YAML files in appropriate namespace folders under
./linkml-definitions/
- Update
gen_sempyro.py
with any new import requirements - Test generation and verify output in
./sempyro_classes/
- Document any new issues or workarounds in this README
- Generate CKAN fields (https://github.com/ckan/ckanext-dcat/tree/master/ckanext/dcat) from the LinkML definitions
- Generate OpenAPI/Discovery service specification from the LinkML definitions