Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substentially. This results in more robust data harmonization and improved performance in real-world scenarios.
(Formerly: INDEX – the Intelligent Data Steward Toolbox)
- LLM Embeddings: Uses state-of-the-art language models to capture semantic similarity.
- Intelligent Mapping: Improves over traditional string matching with context-aware comparisons.
- Extensible: Designed for integration into modern data harmonization pipelines.
Run the frontend client, api, vector database and local embedding model using the local docker-compose file:
docker-compose -f docker-compose.local.yaml up
You can access the frontend on localhost:4200
The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and needs, you can choose from the following options:
-
Importing from OLS (Pre-integrated):
This is the most straightforward method. The API is integrated with the Ontology Lookup Service (OLS), allowing you to import any ontology available in their catalog.
curl -X 'PUT' \ '{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \ -H 'accept: application/json'
terminology_id
(required): The ID of the ontology you want to import (e.g.,hp
,efo
,chebi
, etc.).vectorizer_model
(optional), vectorizer model to be used for generating embeddings.- Example:
curl -X 'PUT' \ '{api_url}/imports/terminology?terminology_id=hp' \ -H 'accept: application/json'
-
Importing SNOMED CT:
- SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with terminology_id=snomed, but provides a cleaner interface.
curl -X 'PUT' \ '{api_url}/imports/terminology/snomed?model={vectorizer_model}' \ -H 'accept: application/json'
vectorizer_model
(optional), vectorizer model to be used for generating embeddings.
-
Importing Your Own Ontology (JSONL Files):
For full flexibility, you can upload your own ontology using
.jsonl
(JSON Lines) files. This allows you to import:- Terminologies (namespaces)
- Concepts (terms within the terminology)
- Mappings (links between embeddings and existing concepts)
⚠️ The objects should be imported in the following order:- "Terminology"
- "Concepts"
- "Mappings"
curl -X 'PUT' \ '{api_url}/imports/jsonl?object_type={object_type}' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'file=@{your_file}.jsonl'
object_type
(required): One ofterminology
,concept
, ormapping
file
(required): The.jsonl
file to be uploaded (multipart/from-data)
Each line in your .jsonl
file must represent a single object with the following structure
{
"class": "<Terminology | Concept | Mapping>",
"id": "<uuid>",
"properties": { ... },
"references": { ... }, // for Concept and Mapping
"vectors": { ... } // optional, for Mapping only
}
class
, referring to the corresponding Terminology, Concept, and Mapping collections.id
, a unique id per object generated by uuid.properties
, a dictionary containing the properties of the object.
In addition, an object can contain the following if applicable:
references
, a dictionary specifying a referencing between objects of different collections by theirid
. Not applicable for Terminology collection.vector
, a dictionary containing the sentence embedding. Only applicable to Mapping collection.
Terminology has one attribute in its properties called name
referring to the name of the terminology being imported.
{
"class": "Terminology",
"id": "6c7b7146-5895-5097-a84e-df41b520c936",
"properties": {
"name": "OHDSI"
}
}
Concept has two attributes in its properties called conceptID
and prefLabel
referring to the concept entry ID within the terminology and preferred label for the entry, respectively.
A concept object also contain a reference attribute hasTerminology
pointing to the terminology it belongs to.
{
"class": "Concept",
"id": "818fc18f-77ff-5889-9a23-51d1e85c368e",
"properties": {
"conceptID": "37523947",
"prefLabel": "Body Fat Percentage"
},
"references": {
"hasTerminology": "6c7b7146-5895-5097-a84e-df41b520c936"
}
}
A mapping object may or may not contain the vectors and the structure of the JSONL file will change accordingly. The structure of the file also depends on whether you are utilizing Weaviate vectorizers or not.
Regardless a mapping object will always contain a reference attribute hasConcept
pointing to the concept it belongs to.
A mapping object without utilizing Weaviate vectorizers will have two attributes in its properties called text
and hasSentenceEmbedder
referring to the description of its corresponding concept and the vectorizer model used to embed the description, respectively.
A pre-computed vector can be stored in vectors
dictionary with the key default
.
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage",
"hasSentenceEmbedder": "nomic-embed-text"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
},
"vectors": {
"default": [0.1, 0.2, 0.3]
}
}
The vectors does not have to be pre-computed and if not supplied will be computed during the import process. You can find the structure of JSONL file without vectors below.
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage",
"hasSentenceEmbedder": "nomic-embed-text"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
}
}
Weaviate Vectorizers utilizes named vectors and computes the embeddings during the import process. Thus, eliminating the need for hasSentenceEmbedder
and vectors
attributes.
{
"class": "Mapping",
"id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
"properties": {
"text": "Body Fat Percentage"
},
"references": {
"hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
}
}