Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ nav:
- Hooks:
- Benchmark runtimes: guides/examples/benchmark_node_runtime.md
- Adopting Ordeq:
- Coming from Kedro: guides/kedro.md
- Coming from Kedro: guides/adopting/kedro.md
- Coming from Dagster: guides/adopting/dagster.md
- Integrations:
- Docker: guides/integrations/docker.md
- Marimo: guides/integrations/marimo.md
Expand Down
164 changes: 164 additions & 0 deletions docs/guides/adopting/dagster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Coming from Dagster

This guide is for users familiar with Dagster who want to get started with Ordeq.

- Dagster is also a data orchestrator, while Ordeq focuses on streamlining data engineering tasks with a simpler and more flexible approach.
This means that Dagster has additional built-in features, such as schedules, run tracking and sensors, that are not present in Ordeq.
Instead, Ordeq integrates seamlessly with existing scheduling and monitoring tools, such as Dagster, Airflow and KubeFlow, allowing you to leverage your current infrastructure.
Writing your code in Ordeq is often more straightforward and requires less boilerplate compared to concept-heavy Dagster.
This reduces the learning curve and accelerates development time.
Because of the above distinction, Dagster also has more dependencies (49 at the time of writing) compared to the lightweight Ordeq (none).

- Ordeq tries to leverage native Python features as much as possible, while Dagster often requires using its own abstractions.
This means that in Ordeq you can use standard Python libraries and tools without needing to adapt them to Dagster's framework.
This results in more readable and maintainable code, as you can rely on familiar Python constructs.

## Assets vs IOs

- Dagster: function-based assets, four decorators https://docs.dagster.io/guides/build/assets/defining-assets
- Ordeq: single `@node` decorator with IOs. IOs are classes (with attributes for state).

## Example project

For this guide we use the "project_ml" example project from the Dagster repository, and show how the same functionality can be implemented in Ordeq.
The original Dagster example can be found [here](https://github.com/dagster-io/dagster/tree/master/examples/docs_projects/project_ml).

```text title="Dagster project structure"
.
├── pyproject.toml
├── src
│ └── project_ml
│ ├── __init__.py
│ ├── definitions.py
│ └── defs
│ ├── __init__.py
│ ├── asset_checks.py
│ ├── assets
│ │ ├── __init__.py
│ │ ├── data_assets.py
│ │ ├── model_assets.py
│ │ └── prediction_assets.py
│ ├── constants.py
│ ├── jobs.py
│ ├── resources.py
│ ├── schedules.py
│ ├── sensors.py
│ ├── types.py
│ └── utils.py
└── tests
├── __init__.py
├── conftest.py
├── test_data_assets.py
├── test_full_pipeline.py
└── test_model.py
```

```text title="Ordeq project structure"
.
├── data
│ └── 01_raw
├── pyproject.toml
├── src
│ └── project_ml
│ ├── __init__.py
│ ├── __main__.py
│ ├── catalog.py
│ ├── config
│ │ ├── __init__.py
│ │ ├── batch_prediction_config.py
│ │ ├── deployment_config.py
│ │ ├── model_config.py
│ │ ├── model_evaluation_config.py
│ │ └── real_time_prediction_config.py
│ ├── data
│ │ ├── __init__.py
│ │ ├── data_preprocessing.py
│ │ └── raw_data_loading.py
│ ├── deploy
│ │ ├── __init__.py
│ │ ├── deploy_model.py
│ │ └── predict.py
│ └── model
│ ├── __init__.py
│ ├── cnn_architecture.py
│ ├── digit_classifier.py
│ ├── model_evaluation.py
│ └── train_model.py
└── tests
```

Deviations:

- excluded model selection part for simplicity
- data quality checks

## Context

Dagster uses "context" which requires a global variable to be passed around.
Ordeq avoids this pattern by allowing nodes to request IOs directly.
For example, metadata in Ordeq is just another IO that contains the metadata information, instead of a dedicated `context.add_output_metadata`
See [parametrizing nodes] for more details.

## Logging

In Dagster you would write:

```python
context.log.info(
f"User requested deployment of custom model: {config.custom_model_name}"
)
```

In most cases in Ordeq this becomes native Python logging:

```python
import logging

logger = logging.getLogger(__name__)

# (...)

logger.info(
f"User requested deployment of custom model: {config.custom_model_name}"
)
```

Only when you need advanced structured logging features you would use Ordeq's `Logger` IO.

## Configuration

Dagster requires user to use own objects `dg.Config`, whereas in Ordeq you can use
native Python types for configuration (constants, files, dataclasses, Pydantic or whatever your preference has).

## Attributes

```python
@dg.asset(
description="Evaluate model performance on test set",
group_name="model_pipeline",
required_resource_keys={"model_storage"},
deps=["digit_classifier"],
...
)
```

```python
@node(..., description="Evaluate model performance on test set")
```

The other attributes are not required, as `group_name` in inferred from the module name, and `required_resources_keys` and `deps` from the node inputs and outputs.

## Cloud integration

The Dagster example implements dedicated resources for S3 and local file system:
https://github.com/dagster-io/dagster/blob/master/examples/docs_projects/project_ml/src/project_ml/defs/resources.py

In Ordeq you can use the same code for both local and S3 storage by leveraging the existing IOs.
(see [Storage IOs] for more details).

## Asset checks

Dagster has built-in asset checks, while in Ordeq you can implement similar functionality using nodes that validate data and raise exceptions if checks fail.
https://github.com/dagster-io/dagster/blob/master/examples/docs_projects/project_ml/src/project_ml/defs/asset_checks.py

For this guide we keep it out of scope.
File renamed without changes.
10 changes: 10 additions & 0 deletions examples/project-ml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Machine learning with PyTorch

This project is based on an example from Dagster to demonstrate what code with similar functionality looks like in both frameworks.
See also the original [pipeline guide](https://docs.dagster.io/examples/ml) and [dagster implementation](https://github.com/dagster-io/dagster/tree/master/examples/docs_projects/project_ml)

Simply run the entire project using the following `uv` command:

```shell
uv run src/project_ml
```
Empty file.
Empty file.
Empty file.
129 changes: 129 additions & 0 deletions examples/project-ml/pipeline_diagram.mermaid
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
graph TB
subgraph legend["Legend"]
direction TB
L0@{shape: rounded, label: "Node"}
L2@{shape: subroutine, label: "View"}
L00@{shape: rect, label: "IO"}
L01@{shape: rect, label: "Literal"}
L02@{shape: rect, label: "MatplotlibFigure"}
L03@{shape: rect, label: "Pickle"}
end

project_ml.data.raw_data_loading:raw_mnist_test_data --> project_ml.data.data_preprocessing:processed_mnist_test_data
project_ml.data.raw_data_loading:raw_mnist_train_data --> project_ml.data.data_preprocessing:processed_mnist_train_data
IO0 --> project_ml.data.data_preprocessing:processed_mnist_train_data
IO1 --> project_ml.data.data_preprocessing:processed_mnist_train_data
IO2 --> project_ml.data.raw_data_loading:raw_mnist_test_data
project_ml.data.raw_data_loading:transform --> project_ml.data.raw_data_loading:raw_mnist_test_data
IO3 --> project_ml.data.raw_data_loading:raw_mnist_train_data
project_ml.data.raw_data_loading:transform --> project_ml.data.raw_data_loading:raw_mnist_train_data
IO4 --> project_ml.data.raw_data_loading:transform
project_ml.deploy.predict:dummy_images --> project_ml.deploy.predict:batch_digit_predictions
IO5 --> project_ml.deploy.predict:batch_digit_predictions
IO6 --> project_ml.deploy.predict:batch_digit_predictions
project_ml.deploy.predict:inference_device --> project_ml.deploy.predict:batch_digit_predictions
project_ml.deploy.predict:batch_digit_predictions --> IO7
project_ml.deploy.predict:batch_digit_predictions --> IO8
project_ml.deploy.predict:dummy_batch --> project_ml.deploy.predict:digit_predictions
IO5 --> project_ml.deploy.predict:digit_predictions
IO9 --> project_ml.deploy.predict:digit_predictions
project_ml.deploy.predict:inference_device --> project_ml.deploy.predict:digit_predictions
project_ml.deploy.predict:digit_predictions --> IO10
project_ml.deploy.predict:digit_predictions --> IO11
IO9 --> project_ml.deploy.predict:dummy_batch
IO6 --> project_ml.deploy.predict:dummy_images
IO6 --> project_ml.deploy.predict:inference_device
IO5 --> project_ml.model.model_evaluation:model_evaluation
project_ml.model.model_evaluation:test_loader --> project_ml.model.model_evaluation:model_evaluation
project_ml.model.train_model:training_device --> project_ml.model.model_evaluation:model_evaluation
project_ml.model.model_evaluation:model_evaluation --> IO12
project_ml.model.model_evaluation:model_evaluation --> IO13
project_ml.model.model_evaluation:model_evaluation --> IO14
project_ml.data.data_preprocessing:processed_mnist_test_data --> project_ml.model.model_evaluation:test_loader
IO15 --> project_ml.model.model_evaluation:test_loader
IO16 --> project_ml.model.train_model:optimizer
project_ml.model.train_model:untrained_model --> project_ml.model.train_model:optimizer
IO16 --> project_ml.model.train_model:scheduler
project_ml.model.train_model:optimizer --> project_ml.model.train_model:scheduler
project_ml.data.data_preprocessing:processed_mnist_train_data --> project_ml.model.train_model:train_loader
IO16 --> project_ml.model.train_model:train_loader
project_ml.model.train_model:train_loader --> project_ml.model.train_model:train_model
project_ml.model.train_model:val_loader --> project_ml.model.train_model:train_model
IO16 --> project_ml.model.train_model:train_model
project_ml.model.train_model:training_device --> project_ml.model.train_model:train_model
project_ml.model.train_model:untrained_model --> project_ml.model.train_model:train_model
project_ml.model.train_model:optimizer --> project_ml.model.train_model:train_model
project_ml.model.train_model:scheduler --> project_ml.model.train_model:train_model
project_ml.model.train_model:train_model --> IO5
project_ml.model.train_model:train_model --> IO17
IO18 --> project_ml.model.train_model:untrained_model
project_ml.data.data_preprocessing:processed_mnist_train_data --> project_ml.model.train_model:val_loader
IO16 --> project_ml.model.train_model:val_loader

subgraph s0["project_ml.data.data_preprocessing"]
direction TB
project_ml.data.data_preprocessing:processed_mnist_test_data@{shape: subroutine, label: "processed_mnist_test_data"}
project_ml.data.data_preprocessing:processed_mnist_train_data@{shape: subroutine, label: "processed_mnist_train_data"}
end
subgraph s1["project_ml.data.raw_data_loading"]
direction TB
project_ml.data.raw_data_loading:raw_mnist_test_data@{shape: subroutine, label: "raw_mnist_test_data"}
project_ml.data.raw_data_loading:raw_mnist_train_data@{shape: subroutine, label: "raw_mnist_train_data"}
project_ml.data.raw_data_loading:transform@{shape: subroutine, label: "transform"}
end
subgraph s2["project_ml.deploy.predict"]
direction TB
project_ml.deploy.predict:batch_digit_predictions@{shape: rounded, label: "batch_digit_predictions"}
project_ml.deploy.predict:digit_predictions@{shape: rounded, label: "digit_predictions"}
project_ml.deploy.predict:dummy_batch@{shape: subroutine, label: "dummy_batch"}
project_ml.deploy.predict:dummy_images@{shape: subroutine, label: "dummy_images"}
project_ml.deploy.predict:inference_device@{shape: subroutine, label: "inference_device"}
end
subgraph s3["project_ml.model.model_evaluation"]
direction TB
project_ml.model.model_evaluation:model_evaluation@{shape: rounded, label: "model_evaluation"}
project_ml.model.model_evaluation:test_loader@{shape: subroutine, label: "test_loader"}
end
subgraph s4["project_ml.model.train_model"]
direction TB
project_ml.model.train_model:optimizer@{shape: subroutine, label: "optimizer"}
project_ml.model.train_model:scheduler@{shape: subroutine, label: "scheduler"}
project_ml.model.train_model:train_loader@{shape: subroutine, label: "train_loader"}
project_ml.model.train_model:train_model@{shape: rounded, label: "train_model"}
project_ml.model.train_model:training_device@{shape: subroutine, label: "training_device"}
project_ml.model.train_model:untrained_model@{shape: subroutine, label: "untrained_model"}
project_ml.model.train_model:val_loader@{shape: subroutine, label: "val_loader"}
end
IO0@{shape: rect, label: "validation_split"}
IO1@{shape: rect, label: "random_seed"}
IO10@{shape: rect, label: "real_time_predictions"}
IO11@{shape: rect, label: "real_time_prediction_metadata"}
IO12@{shape: rect, label: "model_evaluation_result"}
IO13@{shape: rect, label: "confusion_matrix"}
IO14@{shape: rect, label: "model_evaluation_metadata"}
IO15@{shape: rect, label: "model_evaluation_config"}
IO16@{shape: rect, label: "training_config"}
IO17@{shape: rect, label: "training_metadata"}
IO18@{shape: rect, label: "model_config"}
IO2@{shape: rect, label: "test_dataset"}
IO3@{shape: rect, label: "train_dataset"}
IO4@{shape: rect, label: "mnist_moments"}
IO5@{shape: rect, label: "production_model"}
IO6@{shape: rect, label: "batch_prediction_config"}
IO7@{shape: rect, label: "batch_predictions"}
IO8@{shape: rect, label: "batch_prediction_metadata"}
IO9@{shape: rect, label: "real_time_prediction_config"}

class L0,project_ml.deploy.predict:batch_digit_predictions,project_ml.deploy.predict:digit_predictions,project_ml.model.model_evaluation:model_evaluation,project_ml.model.train_model:train_model node
class L2,project_ml.data.data_preprocessing:processed_mnist_test_data,project_ml.data.data_preprocessing:processed_mnist_train_data,project_ml.data.raw_data_loading:raw_mnist_test_data,project_ml.data.raw_data_loading:raw_mnist_train_data,project_ml.data.raw_data_loading:transform,project_ml.deploy.predict:dummy_batch,project_ml.deploy.predict:dummy_images,project_ml.deploy.predict:inference_device,project_ml.model.model_evaluation:test_loader,project_ml.model.train_model:optimizer,project_ml.model.train_model:scheduler,project_ml.model.train_model:train_loader,project_ml.model.train_model:training_device,project_ml.model.train_model:untrained_model,project_ml.model.train_model:val_loader view
class L00,IO10,IO11,IO12,IO14,IO17,IO7,IO8 io0
class L01,IO0,IO1,IO15,IO16,IO18,IO2,IO3,IO4,IO6,IO9 io1
class L02,IO13 io2
class L03,IO5 io3
classDef node fill:#008AD7,color:#FFF
classDef io fill:#FFD43B
classDef view fill:#00C853,color:#FFF
classDef io0 fill:#66c2a5
classDef io1 fill:#fc8d62
classDef io2 fill:#8da0cb
classDef io3 fill:#e78ac3
18 changes: 18 additions & 0 deletions examples/project-ml/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
[project]
name = "project_ml"
version = "0.1.0"
description = "Example ML project using Ordeq"
requires-python = ">=3.10"
dependencies = [
"ordeq",
"ordeq-viz",
"ordeq-matplotlib",
"numpy>=2.0.2",
"scikit-learn>=1.6.1",
"seaborn>=0.13.2",
"torch>=2.8.0",
"torchvision",
]

[tool.ruff.lint]
extend-ignore = ["G004"] # f-strings in logging, coming from the upstream example code
Empty file.
21 changes: 21 additions & 0 deletions examples/project-ml/src/project_ml/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import logging
from pathlib import Path

from ordeq_viz import viz

from project_ml import catalog, data, deploy, model

ROOT_PATH = Path(__file__).parent.parent.parent

logging.basicConfig(level=logging.INFO)

if __name__ == "__main__":
pipeline = {data, model, deploy}
viz(
*pipeline,
catalog,
fmt="mermaid",
output=ROOT_PATH / "pipeline_diagram.mermaid",
subgraphs=True,
)
# run(*pipeline)
Loading
Loading