Skip to content

Emitting Metadata for Containers and Datasets #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/api/datahub-apis.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ We offer an SDK for both Python and Java that provide full functionality when it
- Define a lineage between data entities
- Executing bulk operations - e.g. adding tags to multiple datasets
- Creating custom metadata entities
- Emitting metadata for datasets and containers using `MetadataChangeProposalWrapper`

Learn more about the SDKs:

Expand Down
91 changes: 91 additions & 0 deletions docs/architecture/metadata-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,97 @@ DataHub ships with a Python based [metadata-ingestion system](../../metadata-ing

As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin.

### Emitting Metadata for Containers and Datasets

To emit metadata for both datasets and their containers using the DataHub Python SDK, follow these steps:

1. **Install the DataHub Python SDK**:
```bash
pip install acryl-datahub
```

2. **Initialize DataHubRestEmitter**: Set up the `DataHubRestEmitter` with your DataHub GMS endpoint.

3. **Create and Emit Metadata for the Dataset**:
- Use `MetadataChangeProposalWrapper` with `DatasetPropertiesClass` to set dataset properties.

4. **Create and Emit Metadata for the Container**:
- Use `MetadataChangeProposalWrapper` with `ContainerPropertiesClass` to set container properties.

5. **Link Dataset to Container** (Optional):
- Use `ContainerClass` to link the dataset to its container and emit the relationship.

Example code snippet:

```python
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
MetadataChangeProposalWrapper,
DatasetPropertiesClass,
ContainerPropertiesClass,
ChangeTypeClass,
ContainerClass
)

# Initialize the DataHub REST emitter
gms_endpoint = "http://localhost:8080" # Replace with your DataHub GMS endpoint
emitter = DatahubRestEmitter(gms_endpoint)

# Define the dataset URN
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:your_platform,your_dataset,PROD)"

# Define the container URN
container_urn = "urn:li:container:(urn:li:dataPlatform:your_platform,your_container,PROD)"

# Create the DatasetProperties aspect
dataset_properties = DatasetPropertiesClass(
description="This is a sample dataset",
customProperties={"key": "value"}
)

# Create the MetadataChangeProposal for the dataset
dataset_mcp = MetadataChangeProposalWrapper(
entityType="dataset",
entityUrn=dataset_urn,
changeType=ChangeTypeClass.UPSERT,
aspect=dataset_properties
)

# Emit the MetadataChangeProposal for the dataset
emitter.emit_mcp(dataset_mcp)

# Create the ContainerProperties aspect
container_properties = ContainerPropertiesClass(
name="Your Container Name",
description="Description of your container",
customProperties={"key": "value"}
)

# Create the MetadataChangeProposal for the container
container_mcp = MetadataChangeProposalWrapper(
entityType="container",
entityUrn=container_urn,
changeType=ChangeTypeClass.UPSERT,
aspect=container_properties
)

# Emit the MetadataChangeProposal for the container
emitter.emit_mcp(container_mcp)

# Optionally, link the dataset to the container
link_mcp = MetadataChangeProposalWrapper(
entityType="dataset",
entityUrn=dataset_urn,
changeType=ChangeTypeClass.UPSERT,
aspect=ContainerClass(container=container_urn)
)

# Emit the link between the dataset and the container
emitter.emit_mcp(link_mcp)
```

This approach ensures that both the dataset and its container are represented in DataHub, and the dataset is linked to its container.

## Internal Components

### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job)
Expand Down
2 changes: 2 additions & 0 deletions docs/what/mxe.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ MCPs may be emitted by clients of DataHub's low-level ingestion APIs (e.g. inges
during the process of metadata ingestion. The DataHub Python API exposes an interface for
easily sending MCPs into DataHub.

To emit metadata for both datasets and containers, you can use the `MetadataChangeProposalWrapper` class in the DataHub Python SDK. This involves creating and emitting separate `MetadataChangeProposalWrapper` instances for each entity, ensuring that both the dataset and its container are properly represented in DataHub. You can also link a dataset to its container using the `ContainerClass` aspect.

The default Kafka topic name for MCPs is `MetadataChangeProposal_v1`.

### Consumption
Expand Down