-
Notifications
You must be signed in to change notification settings - Fork 26
Atlas search lookups #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Atlas search lookups #325
Conversation
449b6a3
to
ca8a7cf
Compare
@@ -207,9 +243,36 @@ def _build_aggregation_pipeline(self, ids, group): | |||
pipeline.append({"$unset": "_id"}) | |||
return pipeline | |||
|
|||
def _compound_searches_queries(self, search_replacements): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to preserve this function for the future, probably want to make hybrid search and this part of the code could be useful. I know that it is weird, check the replacement len
as 1 and then iterate over it. Also the exception could be raised before this point. Let me know if you want me to refactor this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it, please just add a docstring to explain the function and the additional comment explaining the need for the checks.
9935b25
to
a467a57
Compare
ea2118b
to
206b554
Compare
tests/queries_/test_search.py
Outdated
def _tear_down(self, model): | ||
collection = self._get_collection(model) | ||
for search_indexes in collection.list_search_indexes(): | ||
collection.drop_search_index(search_indexes["name"]) | ||
collection.delete_many({}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between test the data persist, is this the way to get rid of it? or I am missing something? in the same test class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need because TransactionTestCase. it does not wrap each test in a transaction that gets rolled back. But not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransactionTestCase
, and TestCase
when transactions aren't supported, use flush
to clear the database between tests. flush
uses delete_many()
, so yes, it's necessary to clean up the indexes but not the collection. I think create_search_index
could add the cleanup collection.drop_search_index(search_indexes["name"])
(or something similar), so that the list_search_indexes()
isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to fix it. But If I remove this line, some test fails because the data from the previous test is still in the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data was not cleaned because I didn't defined any available_apps
. If I define it, I need to create the data in the setUp. It increases the test runtime.
tests/queries_/test_search.py
Outdated
self.create_search_index( | ||
Article, | ||
"equals_headline_index", | ||
{ | ||
"mappings": { | ||
"dynamic": False, | ||
"fields": {"headline": {"type": "token"}, "number": {"type": "number"}}, | ||
} | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do the index creation/teardown in setupClass? (I would guess indexes aren't modified by any tests?)
tests/queries_/test_search.py
Outdated
def test_constant_score(self): | ||
constant_score = SearchScoreOption({"constant": {"value": 10}}) | ||
qs = Article.objects.annotate(score=SearchExists(path="body", score=constant_score)) | ||
self.wait_for_assertion(lambda: self.assertCountEqual(qs.all(), [self.article])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like that wait_for_assertion
is a relatively generic API, it really seems like a lot of boilerplate with lambda
, all()
, ... We may want to think about possibly providing some public test class mixin with assertion helpers for users (which we could also use in this file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do something like you mention and I didn't find a solution, but I will try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I tried some delayed assert, it is not perfect but usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, what's the reason the query needs to be fetched this way? Executing the same query a few times in a row doesn't return the correct results until some time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 . At some point, Atlas will have synchronized the new data. Then, the query will retrieve it, so we need to wait until the new objects are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any MongoDB documentation about this? I don't see any mention of have to retry in the example at https://www.mongodb.com/docs/atlas/atlas-search/tutorial/. It seems unbelievable from a usability perspective. How are querysets going to be used outside of tests? Do we need to document a special pattern? There is no distinction between "no results" and "query hasn't synced yet"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I summon @Jibola to avoid saying something that is not true. What I tried to tell is when a new index is created or data added there is a little time between it get indexed. If I do a query immediately after a new index, it will retrieve nothing, but If I wait a second the value will be pulled correctly. So, this delay that indexes needs, I don't know if it is documented but I got the idea from langchain
Maybe only the index creation needs time, but I don't know. 😬.
For Docarray the same was done:
https://github.com/docarray/docarray/blob/main/tests/index/mongo_atlas/__init__.py#L32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, that makes a lot more sense than the previous theory! Depending on how long the waiting could take, we may want to consider having SchemaEditor.add_index()
do the waiting, since Django migrations assume all operations run synchronously, since a data migration that follows a schema migration assumes that the previous operations have completed. (If not, it would be a caveat to document.) If we do have schema editor wait, you could use it to create the indexes in tests. If not, I guess waiting after test index creation is the way go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I found the docs. it says: This means that data inserted into a MongoDB collection and indexed by Atlas Search will not be available immediately for $search queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, that makes a lot more sense than the previous theory! Depending on how long the waiting could take, we may want to consider having SchemaEditor.add_index() do the waiting, since Django migrations assume all operations run synchronously, since a data migration that follows a schema migration assumes that the previous operations have completed. (If not, it would be a caveat to document.) If we do have schema editor wait, you could use it to create the indexes in tests. If not, I guess waiting after test index creation is the way go.
Totally get the confusion here! It bamboozled me too the first time I ran into the problem.
I would say that having the SchemaEditor wait is not a bad idea! In practice, I don't see many scenarios (please inform me if otherwise!) where someone makes a migration and within 5 seconds begins iterating -- outside of tests -- but I would want it "flaggable" if at all possible.
456028d
to
65f22e6
Compare
eb6eb07
to
e7f4d22
Compare
0fdb066
to
eed2499
Compare
eed2499
to
99f6548
Compare
@@ -71,7 +71,7 @@ def col(self, compiler, connection): # noqa: ARG001 | |||
# Add the column's collection's alias for columns in joined collections. | |||
has_alias = self.alias and self.alias != compiler.collection_name | |||
prefix = f"{self.alias}." if has_alias else "" | |||
return f"${prefix}{self.target.column}" | |||
return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had trouble seeing that $ prefix was the difference here. Maybe it could be rewritten so as not to repeat {prefix}{self.target.column}"
.
docs/source/ref/models/search.rst
Outdated
Atlas search | ||
================ | ||
|
||
The database functions in the ``django_mongodb_backend.expressions.search`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should be importable from django_mongodb_backend.expressions
, similar to django.db.models.functions
.
docs/source/ref/models/search.rst
Outdated
|
||
The database functions in the ``django_mongodb_backend.expressions.search`` | ||
module ease the use of MongoDB Atlas search's `full text and vector search | ||
engine <https://www.mongodb.com/docs/atlas/atlas-search/>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All inks to mongodb.com should use intersphinx. I'll push some updates to get you started.
docs/source/ref/models/search.rst
Outdated
``SearchEquals`` objects can be reused and combined with other search | ||
expressions. | ||
|
||
See :ref:`search-operations-combinable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could structure things so we don't need to repeat this boilerplate on every(?) expression.
docs/source/ref/models/search.rst
Outdated
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No double blank lines in docs.
docs/source/ref/models/search.rst
Outdated
]> | ||
The ``path`` argument specifies the field to search and can be a string or a | ||
:class:`~django.db.models.expressions.Col`. The ``query`` is the user input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Col
isn't a public API which is why building the docs gives "WARNING: py:class reference target not found: django.db.models.expressions.Col". I didn't spot any tests with path=<non-string>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 mmh maybe this part is wrong, it could take columns but since they are referenced using the F. I think I must change this for F or string (the string ends up being an F(string))
tests/queries_/test_search.py
Outdated
) | ||
|
||
def setUp(self): | ||
super().setUp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's a consideration with inheritance, it's generally not necessary to call super().setup()
.
delayedAssertCountEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertCountEqual) | ||
delayedAssertListEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertListEqual) | ||
delayedAssertQuerySetEqual = _delayed_assertion(timeout=2)( | ||
TransactionTestCase.assertQuerySetEqual | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the non-delayed versions ever used? Maybe it's better to overwrite the original names so we don't have to write "delayedXXXXX" everywhere. Or maybe the waiting could be done in setUp()
after data is inserted? Unless some test inserts more data, essentially only the first test's waiting is needed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I all the checks are delayed...
Regarding to the second question: right, any test that insert data need to wait. If the data is inserted in the init class, we could only wait once. So If we want to get rid of those delayed, we can wait in the creation part.
tests/queries_/test_search.py
Outdated
|
||
|
||
@skipUnlessDBFeature("supports_atlas_search") | ||
class SearchEqualsTest(SearchUtilsMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to be consistent in this project about using "Tests" (plural) in the class names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 mmh I didn't notice that. will change.
tests/queries_/test_search.py
Outdated
boost_score = SearchScoreOption({"boost": {"value": 3}}) | ||
|
||
qs = Article.objects.annotate( | ||
score=SearchEquals(path="headline", value="cross", score=boost_score) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd inline boost_score
, or at least omit the blank line. (Only some tests are inconsistent.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look great, but I've gone through about half of the code (due to size). I will check the test code tomorrow!
@@ -71,7 +71,7 @@ def col(self, compiler, connection): # noqa: ARG001 | |||
# Add the column's collection's alias for columns in joined collections. | |||
has_alias = self.alias and self.alias != compiler.collection_name | |||
prefix = f"{self.alias}." if has_alias else "" | |||
return f"${prefix}{self.target.column}" | |||
return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to spot. but there is a dollar at the beginning. Will refactor
all_replacements = {**search_replacements, **group_replacements} | ||
self.search_pipeline = self._compound_searches_queries(search_replacements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we pass the all_replacements
into self._compound_searches_queries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. But then It will need to filter them to check that it has one search or vector search operator. It composes only search operators that are store in the replacements. (search_replacements
has multiple uses, I can try to refactor it a bit)
django_mongodb_backend/compiler.py
Outdated
if not has_search: | ||
raise ValueError( | ||
"Cannot combine two `$vectorSearch` operator. " | ||
"If you need to combine them, consider restructuring your query logic or " | ||
"running them as separate queries." | ||
) | ||
raise ValueError( | ||
"Only one $search operation is allowed per query. " | ||
f"Received {len(search_replacements)} search expressions. " | ||
"To combine multiple search expressions, use either a CompoundExpression for " | ||
"fine-grained control or CombinedSearchExpression for simple logical combinations." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two ValueErrors
need to be switched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 the second is the case when:
has_vector_search but it does not has search. I think I should refactor this. It is a bit confusing. the not at the beginning is not helping.
def _prepare_search_expressions_for_pipeline(self, expression, search_idx, replacements): | ||
searches = {} | ||
for sub_expr in self._get_search_expressions(expression): | ||
if sub_expr not in replacements: | ||
alias = f"__search_expr.search{next(search_idx)}" | ||
replacements[sub_expr] = self._get_replace_expr(sub_expr, searches, alias) | ||
|
||
def _prepare_search_query_for_aggregation_pipeline(self, order_by): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know these are private functions, but can they get a docstring? Same with _get_replace_expr
. It's quite complex code so it becomes harder to follow.
@@ -71,7 +71,7 @@ def col(self, compiler, connection): # noqa: ARG001 | |||
# Add the column's collection's alias for columns in joined collections. | |||
has_alias = self.alias and self.alias != compiler.collection_name | |||
prefix = f"{self.alias}." if has_alias else "" | |||
return f"${prefix}{self.target.column}" | |||
return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per tim's comment, how about we just do this?
return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}" | |
path = "$" if as_path else "" | |
return f"{path}{prefix}{self.target.column}" |
@@ -207,9 +243,36 @@ def _build_aggregation_pipeline(self, ids, group): | |||
pipeline.append({"$unset": "_id"}) | |||
return pipeline | |||
|
|||
def _compound_searches_queries(self, search_replacements): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it, please just add a docstring to explain the function and the additional comment explaining the need for the checks.
Args: | ||
path: The document path to compare (as string or expression). | ||
value: The exact value to match against. | ||
score: Optional expression to modify the relevance score. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add that this is an Optional[SearchScore] type?
# Apply De Morgan's Laws. | ||
operator = node.operator.negate() if negated else node.operator | ||
negated = negated != (node.operator == Operator.NOT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a little confusing because it requires some understanding of negate
and the state changes.
I'll leave this as a comment here to be reviewed later.
What's an example of a NOT combinable?
I.e., how would I construct NOT (A AND B)
or can this only be done via negate
?
limit: Maximum number of matching documents to return. | ||
num_candidates: Optional number of candidates to consider during search. | ||
exact: Optional flag to enforce exact matching (default is approximate). | ||
filter: Optional filter expression to narrow candidate documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clarify that we only take raw mql for this step. (Unless I have that incorrect and we resolve SearchExpressions
too?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall PR looks great! I've got some minor corrections, but other than that, it is good to merge from me. Great work! 🚀
It also looks like there's a ReadTheDocs error:
/home/docs/checkouts/readthedocs.org/user_builds/django-mongodb-backend/checkouts/325/docs/source/ref/models/search.rst:654: WARNING: unknown document: 'atlas:atlas-search/scoring/' [ref.doc]
@@ -16,6 +16,12 @@ New features | |||
- Added :class:`~.fields.PolymorphicEmbeddedModelField` and | |||
:class:`~.fields.PolymorphicEmbeddedModelArrayField` for storing a model | |||
instance or list of model instances that may be of more than one model class. | |||
- Added support for MongoDB Atlas Search expressions, including | |||
``SearchAutocomplete``, :class:`.SearchEquals`, ``SearchVector``, and others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``SearchAutocomplete``, :class:`.SearchEquals`, ``SearchVector``, and others. | |
``SearchAutocomplete``, :class:`SearchEquals`, ``SearchVector``, and others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion isn't correct. Without the leading dot, the class won't be resolved properly.
def create_search_index(cls, model, index_name, definition, type="search"): | ||
collection = cls._get_collection(model) | ||
idx = SearchIndexModel(definition=definition, name=index_name, type=type) | ||
collection.create_search_index(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: For the sake of testing, we can make this a blocking call and check for the index before continuing.
headline = models.CharField(max_length=100) | ||
number = models.IntegerField() | ||
body = models.TextField() | ||
location = models.JSONField(null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: (Fun fact) This could be an EmbeddedModelField for a location
object.
class Location(EmbeddedModel):
type = models.CharField(default="Point")
coordinates = models.ArrayField(FloatField(), max_size=2)
like_docs = [ | ||
{"headline": self.article1.headline, "body": self.article1.body}, | ||
{"headline": self.article2.headline, "body": self.article2.body}, | ||
] | ||
like_docs = [{"body": "NASA launches new satellite to explore the galaxy"}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this gets overridden.
Also, should article2
pop up as a valid result?
like_docs = [ | |
{"headline": self.article1.headline, "body": self.article1.body}, | |
{"headline": self.article2.headline, "body": self.article2.body}, | |
] | |
like_docs = [{"body": "NASA launches new satellite to explore the galaxy"}] | |
like_docs = [{"body": "NASA launches new satellite to explore the galaxy"}] |
This PR adds the initial implementation of the Atlas operator.
Task: