feat: Add `SqlStorageClient` based on `sqlalchemy` v2+ #1339

Mantisus · 2025-08-01T21:21:45Z

Description

Add SQLStorageClient which can accept a database connection string or a pre-configured AsyncEngine, or creates a default crawlee.db database in Configuration.storage_dir.

Issues

Closes: Add support for SQLite storage client #307

Copilot

Pull Request Overview

This PR implements a new SQL-based storage client (SQLStorageClient) that provides persistent data storage using SQLAlchemy v2+ for datasets, key-value stores, and request queues.

Key changes:

Adds SQLStorageClient with support for connection strings, pre-configured engines, or default SQLite database
Implements SQL-based clients for all three storage types with database schema management and transaction handling
Updates storage model configurations to support SQLAlchemy ORM mapping with from_attributes=True

Reviewed Changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/crawlee/storage_clients/_sql/`	New SQL storage implementation with database models, clients, and schema management
`tests/unit/storage_clients/_sql/`	Comprehensive test suite for SQL storage functionality
`tests/unit/storages/`	Updates to test fixtures to include SQL storage client testing
`src/crawlee/storage_clients/models.py`	Adds `from_attributes=True` to model configs for SQLAlchemy ORM compatibility
`pyproject.toml`	Adds new `sql` optional dependency group
`src/crawlee/storage_clients/__init__.py`	Adds conditional import for SQLStorageClient

Comments suppressed due to low confidence (1)

tests/unit/storages/test_request_queue.py:23

The test fixture only tests 'sql' storage client, but the removed 'memory' and 'file_system' parameters suggest this may have unintentionally reduced test coverage. Consider including all storage client types to ensure comprehensive testing.

@pytest.fixture(params=['sql'])

src/crawlee/storage_clients/_sql/_request_queue_client.py

Co-authored-by: Copilot <[email protected]>

Mantisus · 2025-08-01T22:11:03Z

When implementing, I opted out of SQLModel for several reasons:

Poor library support. As of today, SQLModel has a huge number of PRs and update requests, some of which are several years old. The latest releases have been mostly cosmetic (updating dependencies, documentation, builds, and checks, etc.).
Model hierarchy issue: if we use SQLModel, it's expected that we'll inherit existing Pydantic models from it. This greatly increases base dependencies (SQLModel, SQLAlchemy, aiosqlite). I don't think we should do this (see the last point).
It doesn't support optimization constraints for database tables, such as string length limits.
Poor typing when using anything other than select - Add an overload to the exec method with _Executable statement for update and delete statements fastapi/sqlmodel#909.
Overall, we can achieve the same behavior using only SQLAlchemy v2+ — https://docs.sqlalchemy.org/en/20/orm/dataclasses.html#integrating-with-alternate-dataclass-providers-such-as-pydantic. However, this retains the inheritance hierarchy and dependency issue.
I think that data models for SQL can be simpler while being better adapted for SQL than the models used in the framework. This way, we can optimize each data model for its task.

Mantisus · 2025-08-01T22:16:12Z

The storage client has been repeatedly tested with SQLLite and a local PostgreSQL (a simple container installation without fine-tuning).
Сode for testing

import asyncio

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.storage_clients import SQLStorageClient
from crawlee.storages import RequestQueue, KeyValueStore
from crawlee import service_locator
from crawlee import ConcurrencySettings


LOCAL_POSTGRE = None  # 'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres'
USE_STATE = True
KVS = True
DATASET = True
CRAWLERS = 1
REQUESTS = 10000
DROP_STORAGES = True


async def main() -> None:
    service_locator.set_storage_client(
        SQLStorageClient(
            connection_string=LOCAL_POSTGRE if LOCAL_POSTGRE else None,
        )
    )

    kvs = await KeyValueStore.open()
    queue_1 = await RequestQueue.open(name='test_queue_1')
    queue_2 = await RequestQueue.open(name='test_queue_2')
    queue_3 = await RequestQueue.open(name='test_queue_3')

    urls = [f'https://crawlee.dev/page/{i}' for i in range(REQUESTS)]

    await queue_1.add_requests(urls)
    await queue_2.add_requests(urls)
    await queue_3.add_requests(urls)

    crawler_1 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_1)
    crawler_2 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_2)
    crawler_3 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_3)

    # Define the default request handler
    @crawler_1.router.default_handler
    @crawler_2.router.default_handler
    @crawler_3.router.default_handler
    async def request_handler(context: BasicCrawlingContext) -> None:
        if USE_STATE:
            # Use state to store data
            state_data = await context.use_state()
            state_data['a'] = context.request.url

        if KVS:
            # Use KeyValueStore to store data
            await kvs.set_value(context.request.url, {'url': context.request.url, 'title': 'Example Title'})
        if DATASET:
            await context.push_data({'url': context.request.url, 'title': 'Example Title'})

    crawlers = [crawler_1]
    if CRAWLERS > 1:
        crawlers.append(crawler_2)
    if CRAWLERS > 2:
        crawlers.append(crawler_3)

    # Run the crawler
    data = await asyncio.gather(*[crawler.run() for crawler in crawlers])

    print(data)

    if DROP_STORAGES:
        # Drop all storages
        await queue_1.drop()
        await queue_2.drop()
        await queue_3.drop()
        await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())

This allows you to load work with storage without real requests.

Mantisus · 2025-08-01T22:20:41Z

The use of accessed_modified_update_interval is related to optimization. Frequent updates to metadata just to change the access time can overload the database.

Pijukatel

First part review. I will do RQ and tests in second part.

I have only minor comments. My main suggestion is to extract more code that is shared in all 3 clients. It is easier to understand all the clients once the reader easily knows which part of the code is exactly the same in all clients and which part of the code is unique and specific to the client. It also makes it easier to maintain the code.

Drawback would be that understanding just one class in the isolation would be little bit harder. But who wants to understand just one client?