Skip to content

Commit 7ead222

Browse files
committed
updated cassandra docs
1 parent 3d187a8 commit 7ead222

File tree

3 files changed

+76
-60
lines changed

3 files changed

+76
-60
lines changed

docs/source/topics/frontera-settings.rst

Lines changed: 34 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -492,63 +492,56 @@ documents scheduled after the change. All previously queued documents will be cr
492492
Cassandra
493493
---------
494494

495+
.. setting:: CASSANDRABACKEND_CACHE_SIZE
495496

496-
.. setting:: CASSANDRABACKEND_DROP_ALL_TABLES
497+
CASSANDRABACKEND_CACHE_SIZE
498+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
497499

498-
CASSANDRABACKEND_DROP_ALL_TABLES
499-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
500+
Default:: ``10000``
500501

501-
Default: ``False``
502+
Cassandra Metadata LRU Cache size. It's used for caching objects, which are requested from DB every time already known,
503+
documents are crawled. This is mainly saves DB throughput, increase it if you're experiencing problems with too high
504+
volume of SELECT's to Metadata table, or decrease if you need to save memory.
502505

503-
Set to ``True`` if you need to drop of all DB tables on backend instantiation (e.g. every Scrapy spider run).
504506

505-
.. setting:: SQLALCHEMYBACKEND_ENGINE
507+
.. setting:: CASSANDRABACKEND_CLUSTER_HOSTS
506508

507-
CASSANDRABACKEND_CLUSTER_IPS
508-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
509+
CASSANDRABACKEND_CLUSTER_HOSTS
510+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
509511

510512
Default:: ``['127.0.0.1']``
511513

512-
Set IPs from Cassandra Cluster. Default is localhost. To assign more than one IP use this Syntax: ``['192.168.0.1', '192.168.0.2']``
514+
The list of contact points to try connecting for cluster discovery. All contact points are not required, the driver
515+
discovers the rest.
516+
517+
.. setting:: CASSANDRABACKEND_CLUSTER_PORT
513518

514519
CASSANDRABACKEND_CLUSTER_PORT
515520
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
516521

517522
Default:: ``9042``
518523

519-
Set port from Cassandra Cluster / Nodes
524+
The server-side port to open connections to Cassandra.
520525

526+
.. setting:: CASSANDRABACKEND_DROP_ALL_TABLES
521527

522-
CASSANDRABACKEND_GENERATE_STATS
523-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
528+
CASSANDRABACKEND_DROP_ALL_TABLES
529+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
524530

525-
Default:: ``False``
531+
Default: ``False``
526532

527-
Set this to true if you want to create an extra Table for stats collection. In this table there will be pages crawled, links queued etv. counted up.
533+
Set to ``True`` to drop and create all DB tables on backend instantiation.
528534

535+
.. setting:: CASSANDRABACKEND_KEYSPACE
529536

530537
CASSANDRABACKEND_KEYSPACE
531538
^^^^^^^^^^^^^^^^^^^^^^^^^
532539

533-
Default:: ``frontera``
534-
535-
Set cassandra Keyspace
536-
537-
CASSANDRABACKEND_CREATE_KEYSPACE_IF_NOT_EXISTS
538-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
539-
540-
Default:: ``True``
540+
Default:: ``crawler``
541541

542-
Creates Keyspace if it not exist. Set to false if you frontera shouldn't check on every startup.
543-
544-
545-
CASSANDRABACKEND_CRAWL_ID
546-
^^^^^^^^^^^^^^^^^^^^^^^^^
547-
548-
Default:: ``default``
549-
550-
Sets an ID in each table for the actual crawl. If you want to run another crawl from begining in same Table set to another Crawl ID. Its an Text field.
542+
Set Cassandra Keyspace.
551543

544+
.. setting:: CASSANDRABACKEND_MODELS
552545

553546
CASSANDRABACKEND_MODELS
554547
^^^^^^^^^^^^^^^^^^^^^^^
@@ -559,11 +552,19 @@ Default::
559552
'MetadataModel': 'frontera.contrib.backends.cassandra.models.MetadataModel',
560553
'StateModel': 'frontera.contrib.backends.cassandra.models.StateModel',
561554
'QueueModel': 'frontera.contrib.backends.cassandra.models.QueueModel',
562-
'CrawlStatsModel': 'frontera.contrib.backends.cassandra.models.CrawlStatsModel'
555+
'FifoOrLIfoQueueModel': 'frontera.contrib.backends.cassandra.models.FifoOrLIfoQueueModel',
563556
}
564557

565-
This is mapping with Cassandra models used by backends. It is mainly used for customization.
558+
This is mapping of Cassandra models used by backends. It is mainly used for customization.
559+
560+
.. setting:: CASSANDRABACKEND_REQUEST_TIMEOUT
561+
562+
CASSANDRABACKEND_REQUEST_TIMEOUT
563+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
564+
565+
Default:: ``60``
566566

567+
Timeout in seconds for every request made by the Cassandra driver for to Cassandra.
567568

568569
Revisiting backend
569570
------------------

docs/source/topics/frontier-backends.rst

Lines changed: 38 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -254,33 +254,15 @@ For a complete list of all settings used for SQLAlchemy backends check the :doc:
254254
SQLAlchemy :class:`Backend <frontera.core.components.Backend>` implementation of a random selection
255255
algorithm.
256256

257-
258-
Revisiting backend
259-
^^^^^^^^^^^^^^^^^^
260-
261-
Based on custom SQLAlchemy backend, and queue. Crawling starts with seeds. After seeds are crawled, every new
262-
document will be scheduled for immediate crawling. On fetching every new document will be scheduled for recrawling
263-
after fixed interval set by :setting:`SQLALCHEMYBACKEND_REVISIT_INTERVAL`.
264-
265-
Current implementation of revisiting backend has no prioritization. During long term runs spider could go idle, because
266-
there are no documents available for crawling, but there are documents waiting for their scheduled revisit time.
267-
268-
269-
.. class:: frontera.contrib.backends.sqlalchemy.revisiting.Backend
270-
271-
Base class for SQLAlchemy :class:`Backend <frontera.core.components.Backend>` implementation of revisiting back-end.
272-
273257
.. _frontier-backends-cassandra:
274258

275259
Cassandra backends
276260
^^^^^^^^^^^^^^^^^^
277261

278-
This set of :class:`Backend <frontera.core.components.Backend>` objects will use `Cassandra`_ as storage for
262+
This set of :class:`Backend <frontera.core.components.Backend>` objects will use Cassandra as storage for
279263
:ref:`basic algorithms <frontier-backends-basic-algorithms>`.
280264

281-
Cassandra is a NoSQL Colum-Store Database with Linear scalability and a SQL-Like Query Language.
282-
283-
If you need to use your own `declarative cassandra models`_, you can do it by using the
265+
If you need to use your own `cassandra models`_, you can do it by using the
284266
:setting:`CASSANDRABACKEND_MODELS` setting.
285267

286268
This setting uses a dictionary where ``key`` represents the name of the model to define and ``value`` the model to use.
@@ -290,13 +272,46 @@ For a complete list of all settings used for Cassandra backends check the :doc:`
290272
.. class:: frontera.contrib.backends.cassandra.BASE
291273

292274
Base class for Cassandra :class:`Backend <frontera.core.components.Backend>` objects.
293-
It runs cassandra in multi-spider one worker mode with the FIFO algorithm.
275+
276+
.. class:: frontera.contrib.backends.cassandra.FIFO
277+
278+
Cassandra :class:`Backend <frontera.core.components.Backend>` implementation of `FIFO`_ algorithm.
279+
280+
.. class:: frontera.contrib.backends.cassandra.LIFO
281+
282+
Cassandra :class:`Backend <frontera.core.components.Backend>` implementation of `LIFO`_ algorithm.
283+
284+
.. class:: frontera.contrib.backends.cassandra.BFS
285+
286+
Cassandra :class:`Backend <frontera.core.components.Backend>` implementation of `BFS`_ algorithm.
287+
288+
.. class:: frontera.contrib.backends.cassandra.DFS
289+
290+
Cassandra :class:`Backend <frontera.core.components.Backend>` implementation of `DFS`_ algorithm.
294291

295292
.. class:: frontera.contrib.backends.cassandra.Distributed
296293

297-
Cassandra :class:`Backend <frontera.core.components.Backend>` implementation of the distributed Backend.
294+
Cassandra :class:`Backend <frontera.core.components.Distributed>` implementation of a distributed backend.
295+
296+
Revisiting backend
297+
^^^^^^^^^^^^^^^^^^
298+
299+
There are two backends for Revisiting which are based on Cassandra and SqlAlchemy Backend and Queue. Crawling starts
300+
with seeds. After seeds are crawled, every new document will be scheduled for immediate crawling. On fetching every new
301+
document will be scheduled for recrawling after fixed interval set by :setting:`SQLALCHEMYBACKEND_REVISIT_INTERVAL` or
302+
:setting:`CASSANDRABACKEND_REVISIT_INTERVAL`.
303+
304+
Current implementation of revisiting backend has no prioritization. During long term runs spider could go idle, because
305+
there are no documents available for crawling, but there are documents waiting for their scheduled revisit time.
306+
307+
308+
.. class:: frontera.contrib.backends.sqlalchemy.revisiting.Backend
309+
310+
Base class for SQLAlchemy :class:`Backend <frontera.core.components.Backend>` implementation of revisiting back-end.
298311

312+
.. class:: frontera.contrib.backends.cassandra.revisiting.Backend
299313

314+
Base class for Cassandra :class:`Backend <frontera.core.components.Backend>` implementation of revisiting back-end.
300315

301316
HBase backend
302317
^^^^^^^^^^^^^
@@ -325,3 +340,4 @@ setting.
325340
.. _SQLAlchemy: http://www.sqlalchemy.org/
326341
.. _any databases supported by SQLAlchemy: http://docs.sqlalchemy.org/en/latest/dialects/index.html
327342
.. _declarative sqlalchemy models: http://docs.sqlalchemy.org/en/latest/orm/extensions/declarative/index.html
343+
.. _cassandra models: https://datastax.github.io/python-driver/cqlengine/models.html

frontera/settings/default_settings.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,17 @@
1010

1111
CANONICAL_SOLVER = 'frontera.contrib.canonicalsolvers.Basic'
1212
CASSANDRABACKEND_CACHE_SIZE = 10000
13+
CASSANDRABACKEND_CLUSTER_HOSTS = ['127.0.0.1']
14+
CASSANDRABACKEND_CLUSTER_PORT = 9042
1315
CASSANDRABACKEND_DROP_ALL_TABLES = False
16+
CASSANDRABACKEND_KEYSPACE = 'crawler'
1417
CASSANDRABACKEND_MODELS = {
1518
'MetadataModel': 'frontera.contrib.backends.cassandra.models.MetadataModel',
1619
'StateModel': 'frontera.contrib.backends.cassandra.models.StateModel',
1720
'QueueModel': 'frontera.contrib.backends.cassandra.models.QueueModel',
1821
'FifoOrLIfoQueueModel': 'frontera.contrib.backends.cassandra.models.FifoOrLIfoQueueModel',
1922
}
20-
CASSANDRABACKEND_REVISIT_INTERVAL = timedelta(days=1)
21-
CASSANDRABACKEND_CLUSTER_HOSTS = ['127.0.0.1']
22-
CASSANDRABACKEND_CLUSTER_PORT = 9042
23-
CASSANDRABACKEND_KEYSPACE = 'crawler'
24-
CASSANDRABACKEND_REQUEST_TIMEOUT = 100
23+
CASSANDRABACKEND_REQUEST_TIMEOUT = 60
2524
CASSANDRABACKEND_REVISIT_INTERVAL = timedelta(days=1)
2625

2726
DELAY_ON_EMPTY = 5.0

0 commit comments

Comments
 (0)