[SDP] Sink et al.

jaceklaskowski · jaceklaskowski · commit 477c7d3bd03c · 2025-10-20T09:04:56.000+02:00
diff --git a/docs/declarative-pipelines/DataflowGraph.md b/docs/declarative-pipelines/DataflowGraph.md
@@ -8,6 +8,7 @@
 
 * <span id="flows"> [Flow](Flow.md)s
 * <span id="tables"> [Table](Table.md)s
+* <span id="sinks"> [Sink](Sink.md)s
 * <span id="views"> [View](View.md)s
 
 `DataflowGraph` is created when:
@@ -26,7 +27,7 @@ output: Map[TableIdentifier, Output]
 
     Learn more in the [Scala Language Specification]({{ scala.spec }}/05-classes-and-objects.html#lazy).
 
-`output` is a collection of unique `Output`s ([Table](Table.md)s) by their `TableIdentifier`.
+`output` is a collection of unique `Output`s ([tables](#tables) and [sinks](#sinks)) by their `TableIdentifier`.
 
 ---
 
diff --git a/docs/declarative-pipelines/GraphElementRegistry.md b/docs/declarative-pipelines/GraphElementRegistry.md
@@ -4,21 +4,22 @@
 
 ## Contract
 
-### register_dataset { #register_dataset }
+### Register Output { #register_output }
 
 ```py
-register_dataset(
+register_output(
     self,
-    dataset: Dataset,
+    output: Output,
 ) -> None
 ```
 
 See:
 
-* [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_dataset)
+* [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_output)
 
 Used when:
 
+* [create_sink](./index.md#create_sink) is used
 * [@create_streaming_table](./index.md#create_streaming_table), [@table](./index.md#table), [@materialized_view](./index.md#materialized_view), [@temporary_view](./index.md#temporary_view) decorators are used
 
 ### register_flow { #register_flow }
diff --git a/docs/declarative-pipelines/GraphRegistrationContext.md b/docs/declarative-pipelines/GraphRegistrationContext.md
@@ -24,17 +24,29 @@ Eventually, `GraphRegistrationContext` [becomes a DataflowGraph](#toDataflowGrap
 toDataflowGraph: DataflowGraph
 ```
 
-`toDataflowGraph` creates a new [DataflowGraph](DataflowGraph.md) with the [tables](#tables), [views](#views), and [flows](#flows) fully-qualified, resolved, and de-duplicated.
+`toDataflowGraph` creates a new [DataflowGraph](DataflowGraph.md) with the [tables](#tables), [views](#views), [sinks](#sinks) and [flows](#flows) fully-qualified, resolved, and de-duplicated.
 
 ??? note "AnalysisException"
-    `toDataflowGraph` reports an `AnalysisException` for a `GraphRegistrationContext` with no [tables](#tables) and no `PersistedView`s (in the [views](#views) registry).
+    `toDataflowGraph` reports an `AnalysisException` when this `GraphRegistrationContext` is [empty](#isPipelineEmpty).
 
 ---
 
 `toDataflowGraph` is used when:
 
 * `PipelinesHandler` is requested to [start a pipeline run](PipelinesHandler.md#startRun)
 
+### isPipelineEmpty { #isPipelineEmpty }
+
+```scala
+isPipelineEmpty: Boolean
+```
+
+`isPipelineEmpty` is `true` when this pipeline (this `GraphRegistrationContext`) is empty, i.e., for all the following met:
+
+1. No [tables](#tables) registered
+1. No [PersistedView](PersistedView.md)s registered (among the [views](#views))
+1. No [sinks](#sinks) registered
+
 ### assertNoDuplicates { #assertNoDuplicates }
 
 ```scala
@@ -65,48 +77,71 @@ Flow [flow_name] was found in multiple datasets: [dataset_names]
 
 `GraphRegistrationContext` creates an empty registry of [Table](Table.md)s when [created](#creating-instance).
 
-A new [Table](Table.md) is added when [registerTable](#registerTable).
+A new [Table](Table.md) is added when `GraphRegistrationContext` is requested to [register a table](#registerTable).
 
 ## Views { #views }
 
 `GraphRegistrationContext` creates an empty registry of [View](View.md)s when [created](#creating-instance).
 
+## Sinks { #sinks }
+
+`GraphRegistrationContext` creates an empty registry of [Sink](Sink.md)s when [created](#creating-instance).
+
 ## Flows { #flows }
 
 `GraphRegistrationContext` creates an empty registry of [UnresolvedFlow](UnresolvedFlow.md)s when [created](#creating-instance).
 
-## Register Table { #registerTable }
+## Register Flow { #registerFlow }
 
 ```scala
-registerTable(
-  tableDef: Table): Unit
+registerFlow(
+  flowDef: UnresolvedFlow): Unit
 ```
 
-`registerTable` adds the given [Table](Table.md) to the [tables](#tables) registry.
+`registerFlow` adds the given [UnresolvedFlow](UnresolvedFlow.md) to the [flows](#flows) registry.
 
 ---
 
-`registerTable` is used when:
+`registerFlow` is used when:
 
-* `PipelinesHandler` is requested to [define a dataset](PipelinesHandler.md#defineDataset)
+* `PipelinesHandler` is requested to [define a flow](PipelinesHandler.md#defineFlow)
+* `SqlGraphRegistrationContext` is requested to [process the following logical commands](SqlGraphRegistrationContext.md#processSqlQuery):
+    * [CREATE FLOW ... AS INSERT INTO ... BY NAME](../logical-operators/CreateFlowCommand.md)
+    * [CREATE MATERIALIZED VIEW ... AS](../logical-operators/CreateMaterializedViewAsSelect.md)
+    * [CREATE STREAMING TABLE ... AS](../logical-operators/CreateStreamingTableAsSelect.md)
+    * [CREATE TEMPORARY VIEW](../logical-operators/CreateViewCommand.md)
+    * [CREATE VIEW](../logical-operators/CreateView.md)
 
-## Register Flow { #registerFlow }
+## Register Sink { #registerSink }
 
 ```scala
-registerFlow(
-  flowDef: UnresolvedFlow): Unit
+registerSink(
+  sinkDef: Sink): Unit
 ```
 
-`registerFlow` adds the given [UnresolvedFlow](UnresolvedFlow.md) to the [flows](#flows) registry.
+`registerSink` adds the given [Sink](Sink.md) to the [sinks](#sinks) registry.
 
 ---
 
-`registerFlow` is used when:
+`registerSink` is used when:
 
-* `PipelinesHandler` is requested to [define a flow](PipelinesHandler.md#defineFlow)
-* `SqlGraphRegistrationContext` is requested to [handle the following logical commands](SqlGraphRegistrationContext.md#processSqlQuery):
-    * [CreateFlowCommand](SqlGraphRegistrationContext.md#CreateFlowCommand)
-    * [CreateMaterializedViewAsSelect](SqlGraphRegistrationContext.md#CreateMaterializedViewAsSelect)
-    * [CreateView](SqlGraphRegistrationContext.md#CreateView)
-    * [CreateStreamingTableAsSelect](SqlGraphRegistrationContext.md#CreateStreamingTableAsSelect)
-    * [CreateViewCommand](SqlGraphRegistrationContext.md#CreateViewCommand)
+* `PipelinesHandler` is requested to [define an output](PipelinesHandler.md#defineOutput)
+
+## Register Table { #registerTable }
+
+```scala
+registerTable(
+  tableDef: Table): Unit
+```
+
+`registerTable` adds the given [Table](Table.md) to the [tables](#tables) registry.
+
+---
+
+`registerTable` is used when:
+
+* `PipelinesHandler` is requested to [define an output](PipelinesHandler.md#defineOutput)
+* `SqlGraphRegistrationContext` is requested to [process the following logical commands](SqlGraphRegistrationContext.md#processSqlQuery):
+    * [CREATE MATERIALIZED VIEW ... AS](../logical-operators/CreateMaterializedViewAsSelect.md)
+    * [CREATE STREAMING TABLE ... AS](../logical-operators/CreateStreamingTableAsSelect.md)
+    * [CREATE STREAMING TABLE](../logical-operators/CreateStreamingTable.md)
diff --git a/docs/declarative-pipelines/PersistedView.md b/docs/declarative-pipelines/PersistedView.md
@@ -0,0 +1,3 @@
+# PersistedView
+
+`PersistedView` is...FIXME
diff --git a/docs/declarative-pipelines/PipelinesHandler.md b/docs/declarative-pipelines/PipelinesHandler.md
@@ -26,16 +26,17 @@ handlePipelinesCommand(
 |-----------------|-------------|-----------|
 | `CREATE_DATAFLOW_GRAPH` | [Creates a new dataflow graph](#CREATE_DATAFLOW_GRAPH) | [pyspark.pipelines.spark_connect_pipeline](spark_connect_pipeline.md#create_dataflow_graph) |
 | `DROP_DATAFLOW_GRAPH` | [Drops a pipeline](#DROP_DATAFLOW_GRAPH) ||
-| `DEFINE_DATASET` | [Defines a dataset](#DEFINE_DATASET) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_dataset) |
+| `DEFINE_OUTPUT` | [Defines an output](#DEFINE_OUTPUT) (a table, a materialized view, a temporary view or a sink) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_output) |
 | `DEFINE_FLOW` | [Defines a flow](#DEFINE_FLOW) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_flow) |
 | `START_RUN` | [Starts a pipeline run](#START_RUN) | [pyspark.pipelines.spark_connect_pipeline](spark_connect_pipeline.md#start_run) |
 | `DEFINE_SQL_GRAPH_ELEMENTS` | [DEFINE_SQL_GRAPH_ELEMENTS](#DEFINE_SQL_GRAPH_ELEMENTS) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_sql) |
 
-`handlePipelinesCommand` reports an `UnsupportedOperationException` for incorrect commands:
+??? warning "UnsupportedOperationException"
+    `handlePipelinesCommand` reports an `UnsupportedOperationException` for incorrect commands:
 
-```text
-[other] not supported
-```
+    ```text
+    [other] not supported
+    ```
 
 ---
 
@@ -51,15 +52,15 @@ handlePipelinesCommand(
 
 [handlePipelinesCommand](#handlePipelinesCommand)...FIXME
 
-### <span id="DefineDataset"> DEFINE_DATASET { #DEFINE_DATASET }
+### <span id="DefineOutput"> DEFINE_OUTPUT { #DEFINE_OUTPUT }
 
 [handlePipelinesCommand](#handlePipelinesCommand) prints out the following INFO message to the logs:
 
 ```text
-Define pipelines dataset cmd received: [cmd]
+Define pipelines output cmd received: [cmd]
 ```
 
-`handlePipelinesCommand` [defines a dataset](#defineDataset).
+`handlePipelinesCommand` [defines an output](#defineOutput) and responds with a resolved dataset (with a catalog and a database when specified)
 
 ### <span id="DefineFlow"> DEFINE_FLOW { #DEFINE_FLOW }
 
@@ -94,8 +95,8 @@ startRun(
   sessionHolder: SessionHolder): Unit
 ```
 
-??? note "`START_RUN` Pipeline Command"
-    `startRun` is used when `PipelinesHandler` is requested to handle [proto.PipelineCommand.CommandTypeCase.START_RUN](#START_RUN) command.
+??? note "START_RUN Pipeline Command"
+    `startRun` is used to handle [START_RUN](#START_RUN) pipeline command.
 
 `startRun` finds the [GraphRegistrationContext](GraphRegistrationContext.md) by `dataflowGraphId` in the [DataflowGraphRegistry](DataflowGraphRegistry.md) (in the given `SessionHolder`).
 
@@ -113,6 +114,9 @@ createDataflowGraph(
   spark: SparkSession): String
 ```
 
+??? note "CREATE_DATAFLOW_GRAPH Pipeline Command"
+    `createDataflowGraph` is used to handle [CREATE_DATAFLOW_GRAPH](#CREATE_DATAFLOW_GRAPH) pipeline command.
+
 `createDataflowGraph` gets the catalog (from the given `CreateDataflowGraph` if defined in the [pipeline specification file](index.md#pipeline-specification-file)) or prints out the following INFO message to the logs and uses the current catalog instead.
 
 ```text
@@ -127,40 +131,48 @@ No default database was supplied. Falling back to the current database: [current
 
 In the end, `createDataflowGraph` [creates a dataflow graph](DataflowGraphRegistry.md#createDataflowGraph) (in the session's [DataflowGraphRegistry](DataflowGraphRegistry.md)).
 
-## defineSqlGraphElements { #defineSqlGraphElements }
+## Define SQL Datasets { #defineSqlGraphElements }
 
 ```scala
 defineSqlGraphElements(
   cmd: proto.PipelineCommand.DefineSqlGraphElements,
   session: SparkSession): Unit
 ```
 
+??? note "DEFINE_SQL_GRAPH_ELEMENTS Pipeline Command"
+    `defineSqlGraphElements` is used to handle [DEFINE_SQL_GRAPH_ELEMENTS](#DEFINE_SQL_GRAPH_ELEMENTS) pipeline command.
+
 `defineSqlGraphElements` [looks up the GraphRegistrationContext for the dataflow graph ID](DataflowGraphRegistry.md#getDataflowGraphOrThrow) (from the given `DefineSqlGraphElements` command and in the given `SessionHolder`).
 
 `defineSqlGraphElements` creates a new [SqlGraphRegistrationContext](SqlGraphRegistrationContext.md) (for the `GraphRegistrationContext`) to [process the SQL definition file](SqlGraphRegistrationContext.md#processSqlFile).
 
-## Define Dataset (Table or View) { #defineDataset }
+## Define Output { #defineOutput }
 
 ```scala
-defineDataset(
-  dataset: proto.PipelineCommand.DefineDataset,
-  sparkSession: SparkSession): Unit
+defineOutput(
+  output: proto.PipelineCommand.DefineOutput,
+  sessionHolder: SessionHolder): TableIdentifier
 ```
 
-`defineDataset` looks up the [GraphRegistrationContext](DataflowGraphRegistry.md#getDataflowGraphOrThrow) for the given `dataset` (or throws a `SparkException` if not found).
+??? note "DEFINE_OUTPUT Pipeline Command"
+    `defineOutput` is used to handle [DEFINE_OUTPUT](#DEFINE_OUTPUT) pipeline command.
+
+`defineOutput` looks up the [GraphRegistrationContext](DataflowGraphRegistry.md#getDataflowGraphOrThrow) for the dataflow graph ID of the given `output` (or throws a `SparkException` if not found).
 
-`defineDataset` branches off based on the `dataset` type:
+`defineOutput` branches off based on the `output` type:
 
 | Dataset Type | Action |
 |--------------|--------|
 | `MATERIALIZED_VIEW` or `TABLE` | [Registers a table](GraphRegistrationContext.md#registerTable) |
 | `TEMPORARY_VIEW` | [Registers a view](GraphRegistrationContext.md#registerView) |
+| `SINK` | [Registers a sink](GraphRegistrationContext.md#registerSink) |
 
-For unknown types, `defineDataset` reports an `IllegalArgumentException`:
+??? warning "IllegalArgumentException"
+    For unknown types, `defineOutput` reports an `IllegalArgumentException`:
 
-```text
-Unknown dataset type: [type]
-```
+    ```text
+    Unknown output type: [type]
+    ```
 
 ## Define Flow { #defineFlow }
 
@@ -172,7 +184,7 @@ defineFlow(
 ```
 
 ??? note "DEFINE_FLOW Pipeline Command"
-    `defineFlow` is used to handle [DEFINE_FLOW](#DEFINE_FLOW).
+    `defineFlow` is used to handle [DEFINE_FLOW](#DEFINE_FLOW) pipeline command.
 
 `defineFlow` looks up the [GraphRegistrationContext](DataflowGraphRegistry.md#getDataflowGraphOrThrow) for the given `flow` (or throws a `SparkException` if not found).
 
@@ -185,7 +197,7 @@ defineFlow(
 
 `defineFlow` [creates a flow identifier](GraphIdentifierManager.md#parseTableIdentifier) (for the `flow` name).
 
-??? note "AnalysisException"
+??? warning "AnalysisException"
     `defineFlow` reports an `AnalysisException` if the given `flow` is not an implicit flow, but is defined with a multi-part identifier.
 
 In the end, `defineFlow` [registers a flow](GraphRegistrationContext.md#registerFlow) (with a proper [FlowFunction](FlowAnalysis.md#createFlowFunctionFromLogicalPlan)).
diff --git a/docs/declarative-pipelines/Sink.md b/docs/declarative-pipelines/Sink.md
@@ -0,0 +1,31 @@
+# Sink
+
+`Sink` is an [extension](#contract) of the [GraphElement](GraphElement.md) and [Output](Output.md) abstractions for [pipeline sinks](#implementations) that can define their [write format](#format) and [options](#options).
+
+## Contract
+
+### Format { #format }
+
+```scala
+format: String
+```
+
+Used when:
+
+* `PipelinesHandler` is requested to [define a sink (output)](PipelinesHandler.md#defineOutput)
+* `SinkWrite` is requested to [start a stream](SinkWrite.md#startStream)
+
+### Options { #options }
+
+```scala
+options: Map[String, String]
+```
+
+Used when:
+
+* `PipelinesHandler` is requested to [define a sink (output)](PipelinesHandler.md#defineOutput)
+* `SinkWrite` is requested to [start a stream](SinkWrite.md#startStream)
+
+## Implementations
+
+* [SinkImpl](SinkImpl.md)
diff --git a/docs/declarative-pipelines/SinkImpl.md b/docs/declarative-pipelines/SinkImpl.md
@@ -0,0 +1,16 @@
+# SinkImpl
+
+`SinkImpl` is a [Sink](Sink.md).
+
+## Creating Instance
+
+`SinkImpl` takes the following to be created:
+
+* <span id="identifier"> [TableIdentifier](GraphElement.md#identifier)
+* <span id="format"> [Format](Sink.md#format)
+* <span id="options"> [Options](Sink.md#options)
+* <span id="origin"> [QueryOrigin](GraphElement.md#origin)
+
+`SinkImpl` is created when:
+
+* `PipelinesHandler` is requested to [defineOutput](PipelinesHandler.md#defineOutput)
diff --git a/docs/declarative-pipelines/SinkWrite.md b/docs/declarative-pipelines/SinkWrite.md
@@ -0,0 +1,3 @@
+# SinkWrite
+
+`SinkWrite` is...FIXME
diff --git a/docs/declarative-pipelines/SparkConnectGraphElementRegistry.md b/docs/declarative-pipelines/SparkConnectGraphElementRegistry.md
diff --git a/docs/declarative-pipelines/index.md b/docs/declarative-pipelines/index.md

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# PersistedView`
	`2`	`+`
	`3`	+`PersistedView` is...FIXME
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# SinkWrite`
	`2`	`+`
	`3`	+`SinkWrite` is...FIXME