You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLinkto="class/Dataset">`Dataset`</ApiLink> and <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide helps you choose the storage type that suits your needs.
31
+
Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLinkto="class/Dataset">`Dataset`</ApiLink> and <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide explains when to use each type, how to interact with them, and how to control their lifecycle.
32
+
33
+
## Overview
30
34
31
35
Crawlee's storage system consists of two main layers:
32
36
-**Storages** (<ApiLinkto="class/Dataset">`Dataset`</ApiLink>, <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>): High-level interfaces for interacting with different storage types.
@@ -70,6 +74,21 @@ Storage --|> KeyValueStore
70
74
Storage --|> RequestQueue
71
75
```
72
76
77
+
### Named and unnamed storages
78
+
79
+
Crawlee supports two types of storages:
80
+
81
+
-**Named storages**: Persistent storages with a specific name that persist across runs. These are useful when you want to share data between different crawler runs or access the same storage from multiple places.
82
+
-**Unnamed storages**: Temporary storages identified by an alias that are scoped to a single run. These are automatically purged at the start of each run (when `purge_on_start` is enabled, which is the default).
83
+
84
+
### Default storage
85
+
86
+
Each storage type (<ApiLinkto="class/Dataset">`Dataset`</ApiLink>, <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>) has a default instance that can be accessed without specifying `id`, `name` or `alias`. Default unnamed storage is accessed by calling storage's `open` method without parameters. This is the most common way to use storages in simple crawlers. The special alias `"default"` is equivalent to calling `open` without parameters
The <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run.
@@ -186,13 +205,7 @@ Crawlee provides the following helper function to simplify interactions with the
186
205
187
206
## Cleaning up the storages
188
207
189
-
By default, Crawlee automatically cleans up **default storages** before each crawler run to ensure a clean state. This behavior is controlled by the <ApiLinkto="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> setting (default: `True`).
190
-
191
-
### What gets purged
192
-
193
-
-**Default storages** are completely removed and recreated at the start of each run, ensuring that you start with a clean slate.
194
-
-**Named storages** are never automatically purged and persist across runs.
195
-
- The behavior depends on the storage client implementation.
208
+
By default, Crawlee cleans up all unnamed storages (including the default one) at the start of each run, so every crawl begins with a clean state. This behavior is controlled by <ApiLinkto="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> (default: True). In contrast, named storages are never purged automatically and persist across runs. The exact behavior may vary depending on the storage client implementation.
196
209
197
210
### When purging happens
198
211
@@ -221,6 +234,6 @@ Note that purging behavior may vary between storage client implementations. For
221
234
222
235
## Conclusion
223
236
224
-
This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned how to manage requests using the <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLinkto="class/Dataset">`Dataset`</ApiLink> and <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also discovered how to use helper functions to simplify interactions with these storages. Finally, you learned how to clean up storages before starting a crawler run.
237
+
This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned about the distinction between named storages (persistent across runs) and unnamed storages with aliases (temporary and purged on start). You discovered how to manage requests using the <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLinkto="class/Dataset">`Dataset`</ApiLink> and <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also learned how to use helper functions to simplify interactions with these storages and how to control storage cleanup behavior.
225
238
226
239
If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
0 commit comments