Skip to content

Feature/entity codes #563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 167 additions & 7 deletions docs/110-key-terms-and-features/100-entity-codes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: cluedin
title: Entity codes
title: Entity codes (Identifiers)
parent: Key terms and features
nav_order: 10
has_children: false
Expand All @@ -12,7 +12,9 @@ tags: ["development","entities","entity-codes"]
- TOC
{:toc}

A code is a way to instruct CluedIn to know what a completely unique reference to a clue is. If two clues have identical codes, they will be merged during processing.
A **code (identifier)** is a mechanism that CluedIn uses to define the **uniqueness** of a golden record.

During processing, if two clues share the **same code**, they are **merged** into a single golden record. This ensures that data from different sources is unified under a consistent, unique identifier.

**Example**

Expand All @@ -26,13 +28,13 @@ To find all the codes that uniquely represent a golden record in the system, go

The codes are divided into two sections:

- [Origin code](#entity-origin-code) – also referred to as the entity origin code. This is the primary unique identifier of a golden record in CluedIn.
- [Origin code](#entity-origin-code) – also referred to as the entity origin code. This is the **primary unique identifier** of a golden record in CluedIn.

- [Codes](#codes) – also referred to as entity codes. This section contains all codes associated with a golden record.

For more information, see the **Codes** section in our [Review mapping](/integration/review-mapping#codes) article.

## Entity origin code
## Entity origin code (Primary Identifier)

An entity origin code is a primary unique identifier of a record in CluedIn. The required details for producing the entity origin code are established when the mapping for a data set is created. To find these details, go to the **Map** tab of the data set and select **Edit mapping**. On the **Map entity** tab, you'll find the **Entity Origin** section, which contains the required details for producing the entity origin code.

Expand All @@ -42,7 +44,151 @@ The entity origin code is made up from the entity type (1), [origin](/key-terms-

![codes-3.png](../../assets/images/key-terms-and-features/codes-3.png)

## Entity codes
### What happens if the value of an origin code is empty?

As primary identifier are required and you have picked an attribute in your mapping that would have empty value, those value will fallback to a **HASH CODE** that will try to represent uniqueness.

Please notice, that even if the HASH CODE is a "good fallback", you will need to think if that is a viable for your source of data. It is generally a good fallback but for data sets that have lots of blank value or very incomplete, it can leads to un-wanted merged.

Example:

```
{
firstName: "Robert",
lastName: "Smith"
}
```

will append the hash code `e7c4d00573302d3b1432fd14d89e5dd0dc68a0ea`;

Bear in mind that HASH code are case sensitive.

Given the same properties and the values all lower case:

```
{
firstName: "robert",
lastName: "smith"
}
```

it will append the hash code `479b8ebe1612297996532b9abeeb9feee4ed4569`;

### Auto-generated uses hash code

In our mapping, if you pick the `auto-generated` options. It will use the same hash-code as documented above.

**When not to use auto-generated?**

As we have seen, now `auto-generated` will **change the value** of the **code** when the **records is changing**.

It means you should **avoid using auto-generated when you edit a Data Set**. In CluedIn, we offer you the possibility to manipulate the source data, that's a great option as it can lead to much faster and better results in your golden records.

However, if you use the **auto-generated** options to identify the golden record, each time you will change the value, it will generate "different code".

For example, let's you have some rules that you add to capitalizing firstName and lastName.

In our previous example, we had 2 records

```
[{
firstName: "Robert",
lastName: "Smith"
}, {
firstName: "robert",
lastName: "smith"
}]
```
If we add a rule to capitalize firstName and lastName, it means the records will become:

```
[{
firstName: "Robert",
lastName: "Smith"
}, {
firstName: "Robert",
lastName: "Smith"
}]
```

This means that both now, if you use `auto-generated` will use `e7c4d00573302d3b1432fd14d89e5dd0dc68a0ea` as value, so they will **merge**.

So, what's the catch?

**If you had ALREADY processed this data, it can lead to duplication**

If in the example above, you have:

- 1. Uploaded the following JSON


```
[{
firstName: "Robert",
lastName: "Smith"
}, {
firstName: "robert",
lastName: "smith"
}]
```

- 2. Map the data with `auto-generated` as primary key
- 3. Process this data
- 4. Switch to Edit Mode for the data set
- 5. Applying changes such as "Capitalize" on firstName and lastName
- 6. Re-process the data

You will end-up with 3 golden-records, because you have "changed" the origin code of the golden records that has lower case as value.

So in the first "process", on step 3. you sent 2 codes:

- 1. `e7c4d00573302d3b1432fd14d89e5dd0dc68a0ea`, the hash code with value capitalized
- 2. `479b8ebe1612297996532b9abeeb9feee4ed4569`, the hash code with the value lower case

In the second "process", on step 6, you have sent 2 times the same code for processing:

- 3. `e7c4d00573302d3b1432fd14d89e5dd0dc68a0ea`, the hash code with value capitalized
- 4. `e7c4d00573302d3b1432fd14d89e5dd0dc68a0ea`, identical hashcode for the record that "had" lower case because it has been capitalized.

The record sent in 1 / 3 / 4 will merge together
The record sent in 2 will becomes alone

{:.important}
If you want to edit your records in the source, make sure not to leverage **auto-generated**

### Do not want to use `auto-generated` what to do? Use a composite code

You can use a concatenation of different attributes to create, uniqueness, generally referred to the `MDM Code`.

An MDM code, can be a good avenue and combine multiple attribute such as for a customer:

```
- firstName
- lastName
- line 1
- city
- country
- date of birth
```

This could lead to uniqueness, if you go the `MDM Code`, please make sure the `normalized the value` by either creating a Computed Column for your data set or by adding a bit of glue-code in our **advanced mapping** section. Our CluedIn experts can assist you.

The normalization of the MDM Code is important because it would avoid the above scenario where editing values changes the origin entity code, and therefore can lead to un-desired effect.


### I have no way to define uniqueness, what to do?

Use a generated guid using advanced mapping or a pre-process rule.

However, **each time your process**, it will create **duplicates**, so use for **one-time only** ingestion.

### I do not have uniqueness and need multiple-process

If you come into this case, the only way is to modify the source of data to setup "some kind" of uniqueness. If a SQL table, you can add a unique identifier for each row.

Because if they are no properties that define uniqueness, that composite codes is not good enough due to the having too many blanks and you want to process multiple time the source. You only way is to fix the issue on the source level.

## Entity codes (Identifiers)

An entity code is an additional identifier that uniquely represents a record in CluedIn. The required details for producing the entity codes are established when the mapping for a data set is created. To find these details, go to the **Map** tab of the data set and select **Edit mapping**. On the **Map entity** tab, you'll find the **Codes** section, which contains the required details for producing the entity codes.

Expand All @@ -56,13 +202,23 @@ In the **Entity Codes** section, you can instruct CluedIn to produce additional

- **Strict edge codes** – codes that are built from the entity type, data source group ID/data source ID/data set ID, and the value from the column that was selected for producing the entity origin code.

### What happens if the value of a code is empty?

The **value** will be **ignored** and no code will be added. A code is not a required element and using a hashcode as "code" would be un-necessary as you have already defined what means "uniqueness" with the origin entity code.

### I have no codes defined, is it bad?

No, it can happen regularly, generally when the source records cannot be trusted or unknown.

In case of doubt, it is better not to add an extra code, and rely on Deduplication Projects to find duplicates.

## FAQ

**How to make sure that the codes will blend across different data sources?**

Since a code will only merge with another code if they are identical, how can you merge records across different systems if the origin is different? One of the ways to achieve it is through the GUID.

If a record has an adentifier that is a GUID/UUID, you can set the origin as CluedIn because no matter the system, the identifier should be unique. However, this is not applicable if you are using deterministic GUIDS. If you're wondering whether you use deterministic GUIDs, conducting preliminary analysis on the data can help. Check if many GUIDs overlap in a certain sequence, such as the first chunk of the GUID being replicated many times. This is a strong indicator that you are using deterministic GUIDs. Random GUIDs are so unique that the chance of them being the same is close to impossible.
If a record has an identifier that is a GUID/UUID, you can set the origin as CluedIn because no matter the system, the identifier should be unique. However, this is not applicable if you are using deterministic GUIDS. If you're wondering whether you use deterministic GUIDs, conducting preliminary analysis on the data can help. Check if many GUIDs overlap in a certain sequence, such as the first chunk of the GUID being replicated many times. This is a strong indicator that you are using deterministic GUIDs. Random GUIDs are so unique that the chance of them being the same is close to impossible.

You could even determine that the entity type can be generic as well. You will have to craft these special entity codes in your clues (for example, something like `/Generic#CluedIn:<GUID>`). You will need to make sure your edges support the same mechanism. In doing this, you are instructing CluedIn that no matter the entity type, no matter the origin of the data, this record can be uniquely identified by just the GUID.

Expand All @@ -72,4 +228,8 @@ Often you will find that you need to merge or link records across systems that d

**What if an identifier is not ready for producing a code?**

Sometimes identifiers for codes are not ready to be made into a unique entity origin code. For example, your data might include default or fallback values when a real value is not present. Imagine you have an EmployeeId column, and when a value is missing, placeholders like "NONE", "", or "N/A" are used. These are not valid identifiers for the EmployeeId. However, the important aspect is that you cannot handle all permutations of these placeholders upfront. Therefore, you should create codes with the intention that these values are unique. You can fix and clean up such values later.
Sometimes identifiers for codes are not ready to be made into a unique entity origin code. For example, your data might include default or fallback values when a real value is not present. Imagine you have an EmployeeId column, and when a value is missing, placeholders like "NONE", "", or "N/A" are used. These are not valid identifiers for the EmployeeId. However, the important aspect is that you cannot handle all permutations of these placeholders upfront. Therefore, you should create codes with the intention that these values are unique. You can fix and clean up such values later.

## Related Article(s)

[Origin](/key-terms-and-features/origin)
81 changes: 79 additions & 2 deletions docs/110-key-terms-and-features/110-vocabularies.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: cluedin
title: Vocabularies
title: Vocabularies (Schema)
parent: Key terms and features
nav_order: 11
has_children: false
Expand All @@ -14,7 +14,84 @@ tags: ["development","vocabularies"]

A vocabulary is a framework that defines how metadata is stored and organized within the system. A well-defined vocabulary is essential for maintaining data consistency, accuracy, and usability across an organization. By providing a standardized framework, a vocabulary contributes to effective data integration, improved decision making, and streamlined operations. It ensures that all stakeholders are working with consistent and reliable master data definitions and structures. For more information, see [Vocabulary](/management/data-catalog/vocabulary).

The primary purpose of vocabulary is to hold vocabulary keys. Vocabulary keys are your way to be able to describe properties that are coming in from data sources. To maintain proper data lineage, it is recommended that the Key Name is set to the exact same value of the property name coming in from the source system. For example, if you are integrating data from a CRM system and one of the properties is named "Column_12", then even though it is tempting to change the Key Name, we would recommend that you maintain that as the Key Name and that you can set the Display Name as something different for aesthetic reasons. For more information, see [Vocabulary keys](/management/data-catalog/vocabulary-keys).
The **primary purpose** of vocabulary is to **hold vocabulary keys**. Vocabulary keys are your way to be able to describe properties that are coming in from data sources.

A vocabulary is composted by multiple vocabulary groups and multiple vocabulary keys.


### Why is the prefix of vocabulary so important?

The prefix of a vocabulary will be part of what we call the **full vocabulary key**, this will be name of the attribute being stored in our databases and will be given automatically when adding a vocabulary key to a given vocabulary.

Changing the _prefix_ of an existing vocabulary could influence many of the records and would re-require a `re-processing` of the golden records (automatically when changing the name of the UI).

Let's take the prefix `contact_person` and a vocabulary key called `firstName` and `lastName`.

If you want to rename `contact_person` to `contactPerson`, it means that now, in _all the golden records_ using `contact_person.firstName` and `contact_person.lastName` will need to be changed. Please, be mindful that applying those changing on millions of golden records may require a bit of time.

### One vocabulary per source to keep the lineage

Even if it is tempting to re-use vocabulary across the multiple sources, you should keep the Vocabulary **close to your sources**.

This gives you better flexibility as you can target your clean-project and/or rules and/or deduplication project to know where this vocabulary key is coming from.

To start using what we called "Shared Vocabulary", you needs to use the feature of "Mapping" a key to another Vocabulary Key.

For example, let's take a field called "Email", from a source "CRM".

You should use a vocabulary key called: "CRM.contact.email".

The `CRM.contact` would be the prefix of a Vocabulary, probably called "CRM Contact" and `email` would be a child vocabulary key of "CRM Contact". Producing a full key of `CRM.contact.email`.

If you add now a source ERP, you would use a prefix such as `ERP.contact` which would be the prefix of a Vocabulary, probably called "ERP Contact" and `email` would be a child vocabulary key of "ERP Contact". Producing a full key of `ERP.contact.email`.

Obviously, those 2 keys called (`CRM.contact.email` and `ERP.contact.email`) represent the **same meaning**, so you would want in your golden record to have a single **shared vocabulary key** called `contact.email`.

This is possible by mapping those 2 keys to the shared vocabulary key called `contact.email`.

Bear in mind, it means the data will be **flowing towards the shared vocabulary key** and the values of `CRM.contact.email` and `ERP.contact.email` will now all be located in `contact.email`.

Such as:

| Source Type | Vocabulary | Vocabulary Key | Maps to |
|--|--|--|
| CRM | CRM.contact | email | contact.email |
| ERP | ERP.contact | email | contact.email |

By applying this principle, you can keep your lineage but as well it gives you better flexibility and better agility. As you can "map to" the shared vocabulary key only when you feel it is "ready". Maybe you want to clean it before, maybe you do want to keep it separate...

### What is the difference between Entity Type and Vocabulary?

When you map your data in CluedIn, you can have a one to one. However, vocabulary can be share among different entity type or may only represent a partial aspect of the golden record.

For example:

- There can be only 1 entity Type assigned to a golden record.
- You can have multiple vocabularies being used for a given golden record.

By dividing the notion of Entity Type and Vocabulary, we are de-coupling the "value" aspect of the records from its "modeling" aspect, opening for better flexibility when it comes to modeling and be to evolve with your use-cases.

## Vocabulary Groups

A vocabulary group is an optional group of vocabulary keys. It helps re-group some of the vocabulary keys in logical grouping.

For example having a "Social" groups for the Vocabulary "Contact".

The social group would have:

- LinkedIn Profile
- X.com username
- Website
- Blog

It is aestetic and has no influences on your record.

If you do not provide any grouping for you vocabulary keys, they will be located under a group called `Ungrouped Keys`.

## Vocabulary Keys (attribute)

To maintain proper data lineage, it is recommended that the Key Name is set to the exact same value of the property name coming in from the source system.


**Core vocabularies**

Expand Down