-
Notifications
You must be signed in to change notification settings - Fork 217
[WIP] Added Cassandra backend #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm marking this PR as WIP as there is some of manual testing that I have to do for the distributed backend. However, I won't be changing much code, so this PR is ready for some early review. @wpxgit since this is your code too, would you mind helping me test this? |
d931e1d
to
a8f4572
Compare
Current coverage is 71.89% (diff: 87.83%)@@ master #225 diff @@
==========================================
Files 68 72 +4
Lines 4690 5116 +426
Methods 0 0
Messages 0 0
Branches 636 679 +43
==========================================
+ Hits 3292 3678 +386
- Misses 1256 1285 +29
- Partials 142 153 +11
|
a8f4572
to
9c316ea
Compare
Hi, Voith Thank you very much for starting this! Frontera will benefit definitely from Cassandra support. I expect major use case is distributed backends run mode. So main source of inspiration is HBaseBackend, not sqla. A.
|
@sibiryakov Thanks for your input!
I will add a separate distributed backend based on hbase. But I'll keep the other backends like LIFO, FIFO for a quick run through purpose.
I did not think about the multi platform issues. I'll use the existing encoder in this case |
how is it going @voith ? |
@sibiryakov I did not get time to work on this. I will work on it over this weekend. I hope to have something by the end of this weekend. |
@voith: its is great to see that you are implementing cassandra for frontera! @alex: sorry i was really bussy the last months - cassandra for frontera was on my todo list - but i've decided to go with another solution for me and haven't found time to complete this. But it's great to see that someone other has taken the baton... |
@wpxgit thanks for your feedback. I'll look into @sibiryakov I have been a little busy off late. I'm sorry for keeping this on hold! I'll see if I get some time this weekend. |
@voith NP! Let me know if you need anything, you took pretty interesting initiative. |
@voith any news on this? |
@sibiryakov the last time worked on this I came across this post which states that using cassendra as an queue is a anti pattern. I was a little discouraged after reading it. I don't if its worth to add cassandra as a backend. I would still be willing to work on this if somehow you convince that this is not a major issue |
@voith It's all about implementation. we never discussed it. I suggest to read the comments to this article where people are pointing out that using Cassandra for queues isn't impossible, you just need to take some details into account and design accordingly. This PR implies designing the data model, and probably testing with at least tens of gigabytes volume. Worth looking into http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra |
@sibiryakov thank you for getting my hopes high back again. I will try to take a stab at this this weekend |
I'm closing this as I no longer have the motivation to continue it |
Thanks for trying, anyway! |
So, what's missing really? This feature is pretty desirable. |
Well, it has to work. Someone has to implement the queue suitable for crawling from multiple domains and test it on crawling at least 10M domains, to make sure queue is operating fast enough. |
This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.
I have tried to follow the
DRY
methodology as much as possible, so I had to refactor some existing code.I have serialized
dicts
using Pickle, as a result this backend won't have problems discussed in #211.The PR includes unit tests and some integration tests with the
backends
integration testing framework.Its good that
frontera
has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.I am open to all sorts of suggestions :)