data-sleuth

An Ethereum blockchain data engineering toolkit designed for extracting, transforming, and loading EVM blockchain data to PostgreSQL from various sources into analytics-ready formats.

The toolkit focuses on:

Multi-source data ingestion from blockchain APIs (Etherscan, DeFiLlama, CoinGecko, etc.)
Rate-limited API client management with robust error handling
Pipeline orchestration using modern data tooling (dlt, dbt)
PostgreSQL-based data warehousing with staging transformations

data-sleuth/
├── src/data_sleuth/          # Core Python package
│   ├── config/              # Settings and configuration
│   ├── core/                # Base classes and utilities
│   ├── datasource/          # API client implementations
│   ├── decoder/             # Event decoding logic
│   ├── dataloader/          # Pipeline management
│   └── utils/               # Helper functions
├── dbt_subprojects/         # dbt transformation projects
├── scripts/                 # Data loading and utility scripts
├── resource/                # ABI files and configuration
└── tests/                   # Test suite

Architecture

The project follows a modular, layered architecture designed for scalability and maintainability:

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                        │
├─────────────────────────────────────────────────────────────┤
│  Scripts & CLI Tools                                        │
│  • decoder.py - decoding utilities                          │
│  • load_crvusd.py - data loading script                     │
│  • run_dbt.sh - dbt pipeline execution                      │
├─────────────────────────────────────────────────────────────┤
│                    Core Components                          │
├─────────────────────────────────────────────────────────────┤
│  • Decoder - Smart contract event/transaction parsing       │
│  • PipelineManager - DLT pipeline orchestration             │
│  • Rate Limiter - API call management                       │
├─────────────────────────────────────────────────────────────┤
│                   Data Sources                              │
├─────────────────────────────────────────────────────────────┤
│  • EtherscanClient - Blockchain data extraction             │
│  • DeFiLlamaClient - DeFi protocol data                     │
│  • BaseAPIClient - Extensible API framework                 │
├─────────────────────────────────────────────────────────────┤
│                  Infrastructure                             │
├─────────────────────────────────────────────────────────────┤
│  • PostgreSQL - Data warehouse                              │
│  • dlt - Data loading & transformation                      │
│  • dbt - Data modeling & testing                            │
└─────────────────────────────────────────────────────────────┘

Key Design Patterns:

Strategy Pattern - Multiple decoding strategies for different types of data
Factory Pattern - Dynamic strategy creation based on requirements
Base Class Inheritance - Consistent API client interfaces
Configuration Management - Centralized settings with environment variable support

Data Sources

Blockchain APIs:

Etherscan - Contract logs, transactions, ABI retrieval
DeFiLlama - Coin metadata, yield pools, protocol revenue, ...
CoinGecko - Token price data and market information
... (more to come)

Data Types Extraction:

Smart contract event logs with indexed and non-indexed parameters
Transaction metadata (gas usage, timestamps, block numbers)
Contract ABIs for decoding
DeFi protocol metrics and historical data

Rate Limiting & Error Handling:

Configurable rate limits per API endpoint, backoff for failed requests, error logging and retry mechanisms

Data Pipeline

The data pipeline follows a modern ELT (Extract, Load, Transform) approach:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Extract   │───▶│    Load     │───▶│ Transform  │───▶│   Analyze   │
│             │    │             │    │             │    │             │
│ • API Calls │    │ • dlt       │    │ • dbt       │    │ • PostgreSQL│
│ • Rate      │    │ • PostgreSQL│    │ • Staging   │    │ • Analytics │
│   Limiting  │    │ • Raw Data  │    │ • Models    │    │ • Dashboards│
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Data Flow:

Raw Data Ingestion - Contract logs, transactions, metadata, ...
Event/Transaction Decoding - ABI-based parsing with multiple strategies
Staging Transformations - Data cleaning, type casting
Analytics Tables - Business-ready models

Data Transformation

dbt Models Structure:

staging/
├── decoded_logs.sql          # Event log parsing and indexing
├── schema.yml                # Model documentation and tests
└── sources.yml               # Source table definitions

Key Transformations:

Event Decoding - Convert hex topics to readable event names
Data Type Casting - Timestamp conversion, numeric formatting
Indexing - Optimized database indexes for common query patterns
Data Validation - Schema tests and data quality checks

Getting Started

Prerequisites:

Python 3.11+
PostgreSQL 15+
Docker & Docker Compose

Quick Start:

# Clone repository
git clone https://github.com/newgnart/data-sleuth.git
cd data-sleuth

# Install dependencies
uv sync

# Start PostgreSQL
docker-compose up -d

# Set environment variables
cp .env.example .env

# Run data pipeline, for example:
python scripts/stables/load_crvusd.py

# Execute dbt transformations
cd dbt_subprojects/staging
dbt run

License

[License information to be added]

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
dbt_subprojects		dbt_subprojects
notebooks		notebooks
resource		resource
scripts		scripts
src/data_sleuth		src/data_sleuth
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
_quarto.yml		_quarto.yml
docker-compose.yml		docker-compose.yml
index.qmd		index.qmd
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
styles.css		styles.css
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data-sleuth

Architecture

Data Sources

Data Pipeline

Data Transformation

Getting Started

License

About

Uh oh!

Releases

Packages

Languages

newgnart/data-sleuth

Folders and files

Latest commit

History

Repository files navigation

data-sleuth

Architecture

Data Sources

Data Pipeline

Data Transformation

Getting Started

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages