Skip to content

Representing Dataset Consumers #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions docs/api/datahub-apis.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,20 @@ Here's an overview of what each API can do.
| Add Owner to a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#add-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md#add-owners#remove-owners) | ✅ |
| Remove Owner from a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#remove-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ |
| Add Lineage | ✅ [[Guide]](/docs/api/tutorials/lineage.md) | ✅ [[Guide]](/docs/api/tutorials/lineage.md#add-lineage) | ✅ |

# Representing Dataset Consumers

DataHub provides several methods to represent dataset consumers, which include teams, jobs, or processes that consume specific datasets. Here are some key methods:

- **DataFlow and DataJob**: Use `DataFlow` to model a collection of data processing jobs and `DataJob` to represent individual jobs or tasks that consume data. This is particularly useful for processes like cron jobs or Airflow tasks.

- **Ownership and Tags**: Assign consumers as owners using the `ownership` aspect, and use tags to annotate datasets with consumer information.

- **Lineage Visualization**: Utilize lineage to visually represent the flow of data from producers to consumers, helping to understand data dependencies.

- **Custom Metadata**: When standard methods are insufficient, leverage custom metadata aspects to represent consumers by defining custom fields or properties.

Including examples for each method can enhance understanding and utility.
| Add Column Level (Fine Grained) Lineage | 🚫 | ✅ [[Guide]](docs/api/tutorials/lineage.md#add-column-level-lineage) | ✅ |
| Add Documentation (Description) to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ |
| Add Documentation (Description) to a Dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ |
Expand Down
14 changes: 14 additions & 0 deletions docs/modeling/metadata-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,20 @@ By moving to this format, evolving the Metadata Model becomes much easier. Addin
to the YAML configuration, instead of creating new Snapshot / Aspect files.


## Representing Dataset Consumers

In DataHub, representing dataset consumers can be achieved through several methods:

- **DataFlow and DataJob**: Use `DataFlow` to model a collection of data processing jobs and `DataJob` to represent individual jobs or tasks that consume data from datasets. This is particularly useful for processes like cron jobs or Airflow tasks.

- **Ownership and Tags**: Utilize the `ownership` aspect to denote teams or individuals as dataset consumers. Tags can also be used to annotate datasets with consumer information.

- **Lineage Visualization**: Leverage lineage to visually represent the flow of data from producers to consumers, helping to understand data dependencies and usage.

- **Custom Metadata**: When standard methods do not suffice, custom metadata aspects can be used to represent consumers, allowing for flexibility in capturing consumer information.

Including examples for each method can enhance understanding and utility.

## Exploring DataHub's Metadata Model

To explore the current DataHub metadata model, you can inspect this high-level picture that shows the different entities and edges between them showing the relationships between them.
Expand Down
8 changes: 4 additions & 4 deletions docs/what-is-datahub/datahub-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,14 @@ A collection of Charts for visualization. Dashboards can have tags, owners, link

### Data Job

An executable job that processes data assets, where "processing" implies consuming data, producing data, or both.
An executable job that processes data assets, where "processing" implies consuming data, producing data, or both. DataJobs can be used to represent dataset consumers, such as cron jobs or Airflow tasks, that read from datasets.
In orchestration systems, this is sometimes referred to as an individual "Task" within a "DAG". Examples include an Airflow Task.

> - [Developer Guides: Data Job](/docs/generated/metamodel/entities/dataJob.md)

### Data Flow

An executable collection of Data Jobs with dependencies among them, or a DAG.
An executable collection of Data Jobs with dependencies among them, or a DAG. DataFlows can be used to model processes that consume data from datasets, providing a structured way to represent data consumption.
Sometimes referred to as a "Pipeline". Examples include an Airflow DAG.

> - [Developer Guides: Data Flow](/docs/generated/metamodel/entities/dataFlow.md)
Expand All @@ -152,7 +152,7 @@ Glossary Term Group is similar to a folder, containing Terms and even other Term

### Tag

Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary.
Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary. Tags can also be used to annotate datasets with consumer information, providing insights into how datasets are used.

> - [Feature Guides: About DataHub Tags](/docs/tags.md)
> - [Developer Guides: Tags](/docs/generated/metamodel/entities/tag.md)
Expand All @@ -166,7 +166,7 @@ Domains are curated, top-level folders or categories where related assets can be

### Owner

Owner refers to the users or groups that has ownership rights over entities. For example, owner can be acceessed to dataset or a column or a dataset.
Owner refers to the users or groups that has ownership rights over entities. Ownership can also be used to denote teams or individuals as dataset consumers, indicating their responsibility for using the dataset. For example, owner can be accessed to dataset or a column or a dataset.

> - [Getting Started : Adding Owners On Datasets/Columns](/docs/api/tutorials/owners.md#add-owners)

Expand Down