Kitsune

Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substentially. This results in more robust data harmonization and improved performance in real-world scenarios.

(Formerly: INDEX – the Intelligent Data Steward Toolbox)

Features

LLM Embeddings: Uses state-of-the-art language models to capture semantic similarity.
Intelligent Mapping: Improves over traditional string matching with context-aware comparisons.
Extensible: Designed for integration into modern data harmonization pipelines.

Installation

Run the frontend client, api, vector database and local embedding model using the local docker-compose file:

docker-compose -f docker-compose.local.yaml up

You can access the frontend on localhost:4200

Ontology Import via API

The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and needs, you can choose from the following options:

Importing from OLS (Pre-integrated):

This is the most straightforward method. The API is integrated with the Ontology Lookup Service (OLS), allowing you to import any ontology available in their catalog.
```
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \
-H 'accept: application/json'
```
- terminology_id (required): The ID of the ontology you want to import (e.g., hp, efo, chebi, etc.).
- vectorizer_model (optional), vectorizer model to be used for generating embeddings.
- Example:
```
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id=hp' \
-H 'accept: application/json'
```
Importing SNOMED CT:
- SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with terminology_id=snomed, but provides a cleaner interface.
```
curl -X 'PUT' \
'{api_url}/imports/terminology/snomed?model={vectorizer_model}' \
-H 'accept: application/json'
```
- vectorizer_model (optional), vectorizer model to be used for generating embeddings.
Importing Your Own Ontology (JSONL Files):

For full flexibility, you can upload your own ontology using .jsonl (JSON Lines) files. This allows you to import:
- Terminologies (namespaces)
- Concepts (terms within the terminology)
- Mappings (links between embeddings and existing concepts)
⚠️ The objects should be imported in the following order:
1. "Terminology"
2. "Concepts"
3. "Mappings"
```
curl -X 'PUT' \
'{api_url}/imports/jsonl?object_type={object_type}' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@{your_file}.jsonl'
```
- object_type(required): One of terminology, concept, or mapping
- file (required): The .jsonl file to be uploaded (multipart/from-data)

JSONL File Structure

Each line in your .jsonl file must represent a single object with the following structure

{
  "class": "<Terminology | Concept | Mapping>",
  "id": "<uuid>",
  "properties": { ... },
  "references": { ... },         // for Concept and Mapping
  "vectors": { ... }             // optional, for Mapping only
}

class, referring to the corresponding Terminology, Concept, and Mapping collections.
id, a unique id per object generated by uuid.
properties, a dictionary containing the properties of the object.

In addition, an object can contain the following if applicable:

references, a dictionary specifying a referencing between objects of different collections by their id. Not applicable for Terminology collection.
vector, a dictionary containing the sentence embedding. Only applicable to Mapping collection.

Example JSONL Structures

Terminology

Terminology has one attribute in its properties called name referring to the name of the terminology being imported.

{
    "class": "Terminology",
    "id": "6c7b7146-5895-5097-a84e-df41b520c936",
    "properties": {
        "name": "OHDSI"
    }
}

Concept

Concept has two attributes in its properties called conceptID and prefLabel referring to the concept entry ID within the terminology and preferred label for the entry, respectively.

A concept object also contain a reference attribute hasTerminology pointing to the terminology it belongs to.

{
    "class": "Concept",
    "id": "818fc18f-77ff-5889-9a23-51d1e85c368e",
    "properties": {
        "conceptID": "37523947",
        "prefLabel": "Body Fat Percentage"
    },
    "references": {
        "hasTerminology": "6c7b7146-5895-5097-a84e-df41b520c936"
    }
}

Mapping

A mapping object may or may not contain the vectors and the structure of the JSONL file will change accordingly. The structure of the file also depends on whether you are utilizing Weaviate vectorizers or not.

Regardless a mapping object will always contain a reference attribute hasConcept pointing to the concept it belongs to.

Mapping Object without Utilizing Weaviate Vectorizers

A mapping object without utilizing Weaviate vectorizers will have two attributes in its properties called text and hasSentenceEmbedder referring to the description of its corresponding concept and the vectorizer model used to embed the description, respectively.

A pre-computed vector can be stored in vectors dictionary with the key default.

{
    "class": "Mapping",
    "id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
    "properties": {
        "text": "Body Fat Percentage",
        "hasSentenceEmbedder": "nomic-embed-text"
    },
    "references": {
        "hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
    },
    "vectors": {
        "default": [0.1, 0.2, 0.3]
    }
}

The vectors does not have to be pre-computed and if not supplied will be computed during the import process. You can find the structure of JSONL file without vectors below.

{
    "class": "Mapping",
    "id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
    "properties": {
        "text": "Body Fat Percentage",
        "hasSentenceEmbedder": "nomic-embed-text"
    },
    "references": {
        "hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
    }
}

Mapping Object Utilizing Weaviate Vectorizers

Weaviate Vectorizers utilizes named vectors and computes the embeddings during the import process. Thus, eliminating the need for hasSentenceEmbedder and vectors attributes.

{
    "class": "Mapping",
    "id": "1b1b7a6e-9000-58ef-8f62-034b4795854a",
    "properties": {
        "text": "Body Fat Percentage"
    },
    "references": {
        "hasConcept": "818fc18f-77ff-5889-9a23-51d1e85c368e"
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 397 Commits
.github		.github
api		api
client		client
doc/workflow		doc/workflow
ollama		ollama
weaviate		weaviate
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
docker-compose.local.yaml		docker-compose.local.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kitsune

Features

Installation

Ontology Import via API

JSONL File Structure

Example JSONL Structures

Terminology

Concept

Mapping

Mapping Object without Utilizing Weaviate Vectorizers

Mapping Object Utilizing Weaviate Vectorizers

About

Uh oh!

Releases 32

Packages

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

License

SCAI-BIO/kitsune

Folders and files

Latest commit

History

Repository files navigation

Kitsune

Features

Installation

Ontology Import via API

JSONL File Structure

Example JSONL Structures

Terminology

Concept

Mapping

Mapping Object without Utilizing Weaviate Vectorizers

Mapping Object Utilizing Weaviate Vectorizers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 32

Packages 0

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

Packages