-
Notifications
You must be signed in to change notification settings - Fork 434
feat: Add SqlStorageClient
based on sqlalchemy
v2+
#1339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
a3c5fa0
3142bdd
b056505
ae3bc3d
49f2643
35a27fc
52e1ad2
df41c45
342c65a
61a2666
7055f7d
1884f7d
a10e3cf
f7ebbe5
83ca6d3
1e3474c
8086ab2
9ee93ab
2934836
6401b65
1c11d97
df927d1
9f5e640
77c1894
b3c1aad
fb8ce7d
63249bb
0d62dcf
dffeb76
61ba512
46e12b4
41fcb35
b92e385
045fe9c
1a7618e
cf1f722
9328d9d
9296d90
bdc1258
9d47cff
f69771e
3d53ac2
c7e3f8c
5d05c06
7a999a4
bfec174
a9b466f
4443e98
245a4f9
fb2937b
c3cc554
05f59ca
473610d
f17f6ca
2ed4f06
88a60f3
a9b9671
f8b2879
4ba3a2e
1d0e531
5ae2c38
ceaa9b5
3f0bf8a
3241785
caff701
29cf5af
bf47625
b0e9f66
4d5ade3
74f8825
d3a2ebc
7081fe4
b1a877e
582adb0
d14c43a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
from crawlee.crawlers import ParselCrawler | ||
from crawlee.storage_clients import SqlStorageClient | ||
|
||
|
||
async def main() -> None: | ||
# Create a new instance of storage client. | ||
# This will create an SQLite database file crawlee.db or created tables in your | ||
# database if you pass `connection_string` or `engine` | ||
# Use the context manager to ensure that connections are properly cleaned up. | ||
async with SqlStorageClient() as storage_client: | ||
# And pass it to the crawler. | ||
crawler = ParselCrawler(storage_client=storage_client) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
from sqlalchemy.ext.asyncio import create_async_engine | ||
|
||
from crawlee.configuration import Configuration | ||
from crawlee.crawlers import ParselCrawler | ||
from crawlee.storage_clients import SqlStorageClient | ||
|
||
|
||
async def main() -> None: | ||
# Create a new instance of storage client. | ||
# On first run, also creates tables in your PostgreSQL database. | ||
# Use the context manager to ensure that connections are properly cleaned up. | ||
async with SqlStorageClient( | ||
# Create an `engine` with the desired configuration | ||
engine=create_async_engine( | ||
'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres', | ||
future=True, | ||
pool_size=5, | ||
max_overflow=10, | ||
pool_recycle=3600, | ||
pool_pre_ping=True, | ||
echo=False, | ||
) | ||
) as storage_client: | ||
# Create a configuration with custom settings. | ||
configuration = Configuration( | ||
purge_on_start=False, | ||
) | ||
|
||
# And pass them to the crawler. | ||
crawler = ParselCrawler( | ||
storage_client=storage_client, | ||
configuration=configuration, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,12 +8,15 @@ import ApiLink from '@site/src/components/ApiLink'; | |
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import MemoryStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/memory_storage_client_basic_example.py'; | ||
import FileSystemStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/file_system_storage_client_basic_example.py'; | ||
import FileSystemStorageClientConfigurationExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/file_system_storage_client_configuration_example.py'; | ||
import CustomStorageClientExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/custom_storage_client_example.py'; | ||
import RegisteringStorageClientsExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/registering_storage_clients_example.py'; | ||
import SQLStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/sql_storage_client_basic_example.py'; | ||
import SQLStorageClientConfigurationExample from '!!raw-loader!./code_examples/storage_clients/sql_storage_client_configuration_example.py'; | ||
|
||
Storage clients provide a unified interface for interacting with <ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, and <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, regardless of the underlying implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. This abstraction makes it easy to switch between different environments, such as local development and cloud production setups. | ||
|
||
|
@@ -23,6 +26,7 @@ Crawlee provides three main storage client implementations: | |
|
||
- <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> - Provides persistent file system storage with in-memory caching. | ||
- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> - Stores data in memory with no persistence. | ||
- <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> – Provides persistent storage using a SQL database ([SQLite](https://sqlite.org/) or [PostgreSQL](https://www.postgresql.org/)). Requires installing the extra dependency: 'crawlee[sql_sqlite]' for SQLite or 'crawlee[sql_postgres]' for PostgreSQL. | ||
- [`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient) - Manages storage on the [Apify platform](https://apify.com), implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python). | ||
|
||
```mermaid | ||
|
@@ -50,6 +54,8 @@ class FileSystemStorageClient | |
|
||
class MemoryStorageClient | ||
|
||
class SqlStorageClient | ||
|
||
class ApifyStorageClient | ||
|
||
%% ======================== | ||
|
@@ -58,6 +64,7 @@ class ApifyStorageClient | |
|
||
StorageClient --|> FileSystemStorageClient | ||
StorageClient --|> MemoryStorageClient | ||
StorageClient --|> SqlStorageClient | ||
StorageClient --|> ApifyStorageClient | ||
``` | ||
|
||
|
@@ -125,6 +132,183 @@ The `MemoryStorageClient` does not persist data between runs. All data is lost w | |
{MemoryStorageClientBasicExample} | ||
</RunnableCodeBlock> | ||
|
||
### SQL storage client | ||
|
||
:::warning Experimental feature | ||
The `SqlStorageClient` is experimental. Its API and behavior may change in future releases. | ||
::: | ||
|
||
The <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> provides persistent storage using a SQL database (SQLite by default, or PostgreSQL). It supports all Crawlee storage types and enables concurrent access from multiple independent clients or processes. | ||
|
||
:::note dependencies | ||
The <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> is not included in the core Crawlee package. | ||
To use it, you need to install Crawlee with the appropriate extra dependency: | ||
|
||
- For SQLite support, run: | ||
<code>pip install 'crawlee[sql_sqlite]'</code> | ||
- For PostgreSQL support, run: | ||
<code>pip install 'crawlee[sql_postgres]'</code> | ||
::: | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add some context before the code snippet |
||
By default, <ApiLink to="class/SqlStorageClient">SqlStorageClient</ApiLink> uses SQLite. | ||
To use PostgreSQL instead, just provide a PostgreSQL connection string via the `connection_string` parameter. No other code changes are needed—the same client works for both databases. | ||
|
||
<RunnableCodeBlock className="language-python" language="python"> | ||
{SQLStorageClientBasicExample} | ||
</RunnableCodeBlock> | ||
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Data is organized in relational tables. Below are the main tables and columns used for each storage type: | ||
|
||
```mermaid | ||
--- | ||
config: | ||
class: | ||
hideEmptyMembersBox: true | ||
--- | ||
|
||
classDiagram | ||
|
||
%% ======================== | ||
%% Storage Clients | ||
%% ======================== | ||
|
||
class SqlDatasetClient { | ||
<<Dataset>> | ||
} | ||
|
||
class SqlKeyValueStoreClient { | ||
<<Key-value store>> | ||
} | ||
|
||
%% ======================== | ||
%% Dataset Tables | ||
%% ======================== | ||
|
||
class datasets { | ||
<<table>> | ||
+ id (PK) | ||
+ name | ||
+ accessed_at | ||
+ created_at | ||
+ modified_at | ||
+ item_count | ||
} | ||
|
||
class dataset_records { | ||
<<table>> | ||
+ order_id (PK) | ||
+ metadata_id (FK) | ||
+ data | ||
} | ||
|
||
%% ======================== | ||
%% Key-Value Store Tables | ||
%% ======================== | ||
|
||
class key_value_stores { | ||
<<table>> | ||
+ id (PK) | ||
+ name | ||
+ accessed_at | ||
+ created_at | ||
+ modified_at | ||
} | ||
|
||
class key_value_store_records { | ||
<<table>> | ||
+ metadata_id (FK, PK) | ||
+ key (PK) | ||
+ value | ||
+ content_type | ||
+ size | ||
} | ||
|
||
%% ======================== | ||
%% Client to Table arrows | ||
%% ======================== | ||
|
||
SqlDatasetClient --> datasets | ||
SqlDatasetClient --> dataset_records | ||
|
||
SqlKeyValueStoreClient --> key_value_stores | ||
SqlKeyValueStoreClient --> key_value_store_records | ||
``` | ||
```mermaid | ||
--- | ||
config: | ||
class: | ||
hideEmptyMembersBox: true | ||
--- | ||
|
||
classDiagram | ||
|
||
%% ======================== | ||
%% Storage Clients | ||
%% ======================== | ||
|
||
class SqlRequestQueueClient { | ||
<<Request queue>> | ||
} | ||
|
||
%% ======================== | ||
%% Request Queue Tables | ||
%% ======================== | ||
|
||
class request_queues { | ||
<<table>> | ||
+ id (PK) | ||
+ name | ||
+ accessed_at | ||
+ created_at | ||
+ modified_at | ||
+ had_multiple_clients | ||
+ handled_request_count | ||
+ pending_request_count | ||
+ total_request_count | ||
} | ||
|
||
class request_queue_records { | ||
<<table>> | ||
+ request_id (PK) | ||
+ metadata_id (FK, PK) | ||
+ data | ||
+ sequence_number | ||
+ is_handled | ||
+ time_blocked_until | ||
} | ||
|
||
class request_queue_state { | ||
<<table>> | ||
+ metadata_id (FK, PK) | ||
+ sequence_counter | ||
+ forefront_sequence_counter | ||
} | ||
|
||
%% ======================== | ||
%% Client to Table arrows | ||
%% ======================== | ||
|
||
SqlRequestQueueClient --> request_queues | ||
SqlRequestQueueClient --> request_queue_records | ||
SqlRequestQueueClient --> request_queue_state | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we also explain somewhere in this section that switching between sqlite and postgres is only done by providing a proper connection string? But the storage client remains the same. |
||
Configuration options for the <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> can be set through environment variables or the <ApiLink to="class/Configuration">`Configuration`</ApiLink> class: | ||
|
||
- **`storage_dir`** (env: `CRAWLEE_STORAGE_DIR`, default: `'./storage'`) - The root directory where the default SQLite database will be created if no connection string is provided. | ||
- **`purge_on_start`** (env: `CRAWLEE_PURGE_ON_START`, default: `True`) - Whether to purge default storages on start. | ||
|
||
Configuration options for the <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> can be set via constructor arguments: | ||
|
||
- **`connection_string`** (default: SQLite in <ApiLink to="class/Configuration">`Configuration`</ApiLink> storage dir) – SQLAlchemy connection string, e.g. `sqlite+aiosqlite:///my.db` or `postgresql+asyncpg://user:pass@host/db`. | ||
- **`engine`** – Pre-configured SQLAlchemy AsyncEngine (optional). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The following example is specific to PostgreSQL. Please provide some context before the code snippet. |
||
For advanced scenarios, you can configure <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> with a custom SQLAlchemy engine and additional options via the <ApiLink to="class/Configuration">`Configuration`</ApiLink> class. This is useful, for example, when connecting to an external PostgreSQL database or customizing connection pooling. | ||
|
||
<CodeBlock className="language-python" language="python"> | ||
{SQLStorageClientConfigurationExample} | ||
</CodeBlock> | ||
|
||
## Creating a custom storage client | ||
|
||
A storage client consists of two parts: the storage client factory and individual storage type clients. The <ApiLink to="class/StorageClient">`StorageClient`</ApiLink> acts as a factory that creates specific clients (<ApiLink to="class/DatasetClient">`DatasetClient`</ApiLink>, <ApiLink to="class/KeyValueStoreClient">`KeyValueStoreClient`</ApiLink>, <ApiLink to="class/RequestQueueClient">`RequestQueueClient`</ApiLink>) where the actual storage logic is implemented. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,21 @@ | ||
from crawlee._utils.try_import import install_import_hook as _install_import_hook | ||
from crawlee._utils.try_import import try_import as _try_import | ||
|
||
# These imports have only mandatory dependencies, so they are imported directly. | ||
from ._base import StorageClient | ||
from ._file_system import FileSystemStorageClient | ||
from ._memory import MemoryStorageClient | ||
|
||
_install_import_hook(__name__) | ||
|
||
# The following imports are wrapped in try_import to handle optional dependencies, | ||
# ensuring the module can still function even if these dependencies are missing. | ||
with _try_import(__name__, 'SqlStorageClient'): | ||
from ._sql import SqlStorageClient | ||
|
||
__all__ = [ | ||
'FileSystemStorageClient', | ||
'MemoryStorageClient', | ||
'SqlStorageClient', | ||
'StorageClient', | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
from ._dataset_client import SqlDatasetClient | ||
from ._key_value_store_client import SqlKeyValueStoreClient | ||
from ._request_queue_client import SqlRequestQueueClient | ||
from ._storage_client import SqlStorageClient | ||
|
||
__all__ = ['SqlDatasetClient', 'SqlKeyValueStoreClient', 'SqlRequestQueueClient', 'SqlStorageClient'] |
Uh oh!
There was an error while loading. Please reload this page.