From 6bb559d33a150bde8a798fc27deac481a4f9aa5a Mon Sep 17 00:00:00 2001 From: runllm Date: Fri, 18 Apr 2025 15:18:24 +0000 Subject: [PATCH 1/3] Update metadata-ingestion.md --- docs/architecture/metadata-ingestion.md | 91 +++++++++++++++++++++++++ 1 file changed, 91 insertions(+) diff --git a/docs/architecture/metadata-ingestion.md b/docs/architecture/metadata-ingestion.md index abf8fc24d1385..821035918af08 100644 --- a/docs/architecture/metadata-ingestion.md +++ b/docs/architecture/metadata-ingestion.md @@ -25,6 +25,97 @@ DataHub ships with a Python based [metadata-ingestion system](../../metadata-ing As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin. +### Emitting Metadata for Containers and Datasets + +To emit metadata for both datasets and their containers using the DataHub Python SDK, follow these steps: + +1. **Install the DataHub Python SDK**: + ```bash + pip install acryl-datahub + ``` + +2. **Initialize DataHubRestEmitter**: Set up the `DataHubRestEmitter` with your DataHub GMS endpoint. + +3. **Create and Emit Metadata for the Dataset**: + - Use `MetadataChangeProposalWrapper` with `DatasetPropertiesClass` to set dataset properties. + +4. **Create and Emit Metadata for the Container**: + - Use `MetadataChangeProposalWrapper` with `ContainerPropertiesClass` to set container properties. + +5. **Link Dataset to Container** (Optional): + - Use `ContainerClass` to link the dataset to its container and emit the relationship. + +Example code snippet: + +```python +from datahub.emitter.rest_emitter import DatahubRestEmitter +from datahub.metadata.schema_classes import ( + MetadataChangeProposalWrapper, + DatasetPropertiesClass, + ContainerPropertiesClass, + ChangeTypeClass, + ContainerClass +) + +# Initialize the DataHub REST emitter +gms_endpoint = "http://localhost:8080" # Replace with your DataHub GMS endpoint +emitter = DatahubRestEmitter(gms_endpoint) + +# Define the dataset URN +dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:your_platform,your_dataset,PROD)" + +# Define the container URN +container_urn = "urn:li:container:(urn:li:dataPlatform:your_platform,your_container,PROD)" + +# Create the DatasetProperties aspect +dataset_properties = DatasetPropertiesClass( + description="This is a sample dataset", + customProperties={"key": "value"} +) + +# Create the MetadataChangeProposal for the dataset +dataset_mcp = MetadataChangeProposalWrapper( + entityType="dataset", + entityUrn=dataset_urn, + changeType=ChangeTypeClass.UPSERT, + aspect=dataset_properties +) + +# Emit the MetadataChangeProposal for the dataset +emitter.emit_mcp(dataset_mcp) + +# Create the ContainerProperties aspect +container_properties = ContainerPropertiesClass( + name="Your Container Name", + description="Description of your container", + customProperties={"key": "value"} +) + +# Create the MetadataChangeProposal for the container +container_mcp = MetadataChangeProposalWrapper( + entityType="container", + entityUrn=container_urn, + changeType=ChangeTypeClass.UPSERT, + aspect=container_properties +) + +# Emit the MetadataChangeProposal for the container +emitter.emit_mcp(container_mcp) + +# Optionally, link the dataset to the container +link_mcp = MetadataChangeProposalWrapper( + entityType="dataset", + entityUrn=dataset_urn, + changeType=ChangeTypeClass.UPSERT, + aspect=ContainerClass(container=container_urn) +) + +# Emit the link between the dataset and the container +emitter.emit_mcp(link_mcp) +``` + +This approach ensures that both the dataset and its container are represented in DataHub, and the dataset is linked to its container. + ## Internal Components ### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job) From d8be5164f868c143bb2928451d7328ffb06c674d Mon Sep 17 00:00:00 2001 From: runllm Date: Fri, 18 Apr 2025 15:18:25 +0000 Subject: [PATCH 2/3] Update mxe.md --- docs/what/mxe.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/what/mxe.md b/docs/what/mxe.md index 25294e04ea3d9..89131b1931069 100644 --- a/docs/what/mxe.md +++ b/docs/what/mxe.md @@ -23,6 +23,8 @@ MCPs may be emitted by clients of DataHub's low-level ingestion APIs (e.g. inges during the process of metadata ingestion. The DataHub Python API exposes an interface for easily sending MCPs into DataHub. +To emit metadata for both datasets and containers, you can use the `MetadataChangeProposalWrapper` class in the DataHub Python SDK. This involves creating and emitting separate `MetadataChangeProposalWrapper` instances for each entity, ensuring that both the dataset and its container are properly represented in DataHub. You can also link a dataset to its container using the `ContainerClass` aspect. + The default Kafka topic name for MCPs is `MetadataChangeProposal_v1`. ### Consumption From e5cb3e23c4cb996baf06bcd81f1b5d16fa8d9388 Mon Sep 17 00:00:00 2001 From: runllm Date: Fri, 18 Apr 2025 15:18:26 +0000 Subject: [PATCH 3/3] Update datahub-apis.md --- docs/api/datahub-apis.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/api/datahub-apis.md b/docs/api/datahub-apis.md index c46aacde3a0cb..e38c145eeb082 100644 --- a/docs/api/datahub-apis.md +++ b/docs/api/datahub-apis.md @@ -19,6 +19,7 @@ We offer an SDK for both Python and Java that provide full functionality when it - Define a lineage between data entities - Executing bulk operations - e.g. adding tags to multiple datasets - Creating custom metadata entities +- Emitting metadata for datasets and containers using `MetadataChangeProposalWrapper` Learn more about the SDKs: