Atlas search lookups #325

WaVEV · 2025-06-24T14:29:51Z

This PR adds the initial implementation of the Atlas operator.

Task:

django_mongodb_backend/functions.py

WaVEV · 2025-07-07T03:20:07Z

django_mongodb_backend/compiler.py

@@ -207,9 +243,36 @@ def _build_aggregation_pipeline(self, ids, group):
                pipeline.append({"$unset": "_id"})
        return pipeline

+    def _compound_searches_queries(self, search_replacements):


I want to preserve this function for the future, probably want to make hybrid search and this part of the code could be useful. I know that it is weird, check the replacement len as 1 and then iterate over it. Also the exception could be raised before this point. Let me know if you want me to refactor this code.

I'm fine with it, please just add a docstring to explain the function and the additional comment explaining the need for the checks.

timgraham · 2025-07-22T01:35:14Z

tests/queries_/test_search.py

+    def _tear_down(self, model):
+        collection = self._get_collection(model)
+        for search_indexes in collection.list_search_indexes():
+            collection.drop_search_index(search_indexes["name"])
+        collection.delete_many({})


Could you add a comment explaining why this is necessary?

Between test the data persist, is this the way to get rid of it? or I am missing something? in the same test class

I think I need because TransactionTestCase. it does not wrap each test in a transaction that gets rolled back. But not 100% sure.

TransactionTestCase, and TestCase when transactions aren't supported, use flush to clear the database between tests. flush uses delete_many(), so yes, it's necessary to clean up the indexes but not the collection. I think create_search_index could add the cleanup collection.drop_search_index(search_indexes["name"]) (or something similar), so that the list_search_indexes() isn't needed.

I will try to fix it. But If I remove this line, some test fails because the data from the previous test is still in the collection.

The data was not cleaned because I didn't defined any available_apps. If I define it, I need to create the data in the setUp. It increases the test runtime.

timgraham · 2025-07-22T01:41:13Z

tests/queries_/test_search.py

+        self.create_search_index(
+            Article,
+            "equals_headline_index",
+            {
+                "mappings": {
+                    "dynamic": False,
+                    "fields": {"headline": {"type": "token"}, "number": {"type": "number"}},
+                }
+            },
+        )


Could we do the index creation/teardown in setupClass? (I would guess indexes aren't modified by any tests?)

tests/queries_/test_search.py

timgraham · 2025-07-22T01:50:51Z

tests/queries_/test_search.py

+    def test_constant_score(self):
+        constant_score = SearchScoreOption({"constant": {"value": 10}})
+        qs = Article.objects.annotate(score=SearchExists(path="body", score=constant_score))
+        self.wait_for_assertion(lambda: self.assertCountEqual(qs.all(), [self.article]))


While I like that wait_for_assertion is a relatively generic API, it really seems like a lot of boilerplate with lambda, all(), ... We may want to think about possibly providing some public test class mixin with assertion helpers for users (which we could also use in this file).

I tried to do something like you mention and I didn't find a solution, but I will try again.

Well I tried some delayed assert, it is not perfect but usable.

More generally, what's the reason the query needs to be fetched this way? Executing the same query a few times in a row doesn't return the correct results until some time?

🤔 . At some point, Atlas will have synchronized the new data. Then, the query will retrieve it, so we need to wait until the new objects are available.

Is there any MongoDB documentation about this? I don't see any mention of have to retry in the example at https://www.mongodb.com/docs/atlas/atlas-search/tutorial/. It seems unbelievable from a usability perspective. How are querysets going to be used outside of tests? Do we need to document a special pattern? There is no distinction between "no results" and "query hasn't synced yet"?

🤔 I summon @Jibola to avoid saying something that is not true. What I tried to tell is when a new index is created or data added there is a little time between it get indexed. If I do a query immediately after a new index, it will retrieve nothing, but If I wait a second the value will be pulled correctly. So, this delay that indexes needs, I don't know if it is documented but I got the idea from langchain

Maybe only the index creation needs time, but I don't know. 😬.
For Docarray the same was done:
https://github.com/docarray/docarray/blob/main/tests/index/mongo_atlas/__init__.py#L32

Whew, that makes a lot more sense than the previous theory! Depending on how long the waiting could take, we may want to consider having SchemaEditor.add_index() do the waiting, since Django migrations assume all operations run synchronously, since a data migration that follows a schema migration assumes that the previous operations have completed. (If not, it would be a caveat to document.) If we do have schema editor wait, you could use it to create the indexes in tests. If not, I guess waiting after test index creation is the way go.

well, I found the docs. it says: This means that data inserted into a MongoDB collection and indexed by Atlas Search will not be available immediately for $search queries.

Whew, that makes a lot more sense than the previous theory! Depending on how long the waiting could take, we may want to consider having SchemaEditor.add_index() do the waiting, since Django migrations assume all operations run synchronously, since a data migration that follows a schema migration assumes that the previous operations have completed. (If not, it would be a caveat to document.) If we do have schema editor wait, you could use it to create the indexes in tests. If not, I guess waiting after test index creation is the way go.

Totally get the confusion here! It bamboozled me too the first time I ran into the problem.

I would say that having the SchemaEditor wait is not a bad idea! In practice, I don't see many scenarios (please inform me if otherwise!) where someone makes a migration and within 5 seconds begins iterating -- outside of tests -- but I would want it "flaggable" if at all possible.

timgraham · 2025-08-05T14:58:21Z

django_mongodb_backend/expressions/builtins.py

@@ -71,7 +71,7 @@ def col(self, compiler, connection):  # noqa: ARG001
    # Add the column's collection's alias for columns in joined collections.
    has_alias = self.alias and self.alias != compiler.collection_name
    prefix = f"{self.alias}." if has_alias else ""
-    return f"${prefix}{self.target.column}"
+    return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}"


I had trouble seeing that $ prefix was the difference here. Maybe it could be rewritten so as not to repeat {prefix}{self.target.column}".

timgraham · 2025-08-05T14:59:32Z

docs/source/ref/models/search.rst

+Atlas search
+================
+
+The database functions in the ``django_mongodb_backend.expressions.search``


They should be importable from django_mongodb_backend.expressions, similar to django.db.models.functions.

timgraham · 2025-08-05T15:00:04Z

docs/source/ref/models/search.rst

+
+The database functions in the ``django_mongodb_backend.expressions.search``
+module ease the use of MongoDB Atlas search's `full text and vector search
+engine <https://www.mongodb.com/docs/atlas/atlas-search/>`_.


All inks to mongodb.com should use intersphinx. I'll push some updates to get you started.

timgraham · 2025-08-05T15:00:55Z

docs/source/ref/models/search.rst

+``SearchEquals`` objects can be reused and combined with other search
+expressions.
+
+See :ref:`search-operations-combinable`


I wonder if we could structure things so we don't need to repeat this boilerplate on every(?) expression.

timgraham · 2025-08-05T15:01:37Z

docs/source/ref/models/search.rst

+
+

No double blank lines in docs.

timgraham · 2025-08-05T15:05:46Z

docs/source/ref/models/search.rst

+    ]>
+
+The ``path`` argument specifies the field to search and can be a string or a
+:class:`~django.db.models.expressions.Col`. The ``query`` is the user input


Col isn't a public API which is why building the docs gives "WARNING: py:class reference target not found: django.db.models.expressions.Col". I didn't spot any tests with path=<non-string>?

🤔 mmh maybe this part is wrong, it could take columns but since they are referenced using the F. I think I must change this for F or string (the string ends up being an F(string))

timgraham · 2025-08-05T15:08:57Z

tests/queries_/test_search.py

+        )
+
+    def setUp(self):
+        super().setUp()


Unless there's a consideration with inheritance, it's generally not necessary to call super().setup().

timgraham · 2025-08-05T15:10:37Z

tests/queries_/test_search.py

+    delayedAssertCountEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertCountEqual)
+    delayedAssertListEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertListEqual)
+    delayedAssertQuerySetEqual = _delayed_assertion(timeout=2)(
+        TransactionTestCase.assertQuerySetEqual
+    )


Are the non-delayed versions ever used? Maybe it's better to overwrite the original names so we don't have to write "delayedXXXXX" everywhere. Or maybe the waiting could be done in setUp() after data is inserted? Unless some test inserts more data, essentially only the first test's waiting is needed, right?

No, I all the checks are delayed...
Regarding to the second question: right, any test that insert data need to wait. If the data is inserted in the init class, we could only wait once. So If we want to get rid of those delayed, we can wait in the creation part.

timgraham · 2025-08-05T15:12:06Z

tests/queries_/test_search.py

+
+
+@skipUnlessDBFeature("supports_atlas_search")
+class SearchEqualsTest(SearchUtilsMixin):


I've tried to be consistent in this project about using "Tests" (plural) in the class names.

🤔 mmh I didn't notice that. will change.

timgraham · 2025-08-05T15:16:37Z

tests/queries_/test_search.py

+        boost_score = SearchScoreOption({"boost": {"value": 3}})
+
+        qs = Article.objects.annotate(
+            score=SearchEquals(path="headline", value="cross", score=boost_score)
+        )


I'd inline boost_score, or at least omit the blank line. (Only some tests are inconsistent.)

Jibola

Things look great, but I've gone through about half of the code (due to size). I will check the test code tomorrow!

Jibola · 2025-08-05T14:30:26Z

django_mongodb_backend/expressions/builtins.py

@@ -71,7 +71,7 @@ def col(self, compiler, connection):  # noqa: ARG001
    # Add the column's collection's alias for columns in joined collections.
    has_alias = self.alias and self.alias != compiler.collection_name
    prefix = f"{self.alias}." if has_alias else ""
-    return f"${prefix}{self.target.column}"
+    return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}"


What is the difference between these two?

hard to spot. but there is a dollar at the beginning. Will refactor

Jibola · 2025-08-05T14:58:15Z

django_mongodb_backend/compiler.py

+        all_replacements = {**search_replacements, **group_replacements}
+        self.search_pipeline = self._compound_searches_queries(search_replacements)


Why don't we pass the all_replacements into self._compound_searches_queries?

We could. But then It will need to filter them to check that it has one search or vector search operator. It composes only search operators that are store in the replacements. (search_replacements has multiple uses, I can try to refactor it a bit)

Jibola · 2025-08-05T18:32:33Z

django_mongodb_backend/compiler.py

+            if not has_search:
+                raise ValueError(
+                    "Cannot combine two `$vectorSearch` operator. "
+                    "If you need to combine them, consider restructuring your query logic or "
+                    "running them as separate queries."
+                )
+            raise ValueError(
+                "Only one $search operation is allowed per query. "
+                f"Received {len(search_replacements)} search expressions. "
+                "To combine multiple search expressions, use either a CompoundExpression for "
+                "fine-grained control or CombinedSearchExpression for simple logical combinations."
+            )


I think these two ValueErrors need to be switched.

🤔 the second is the case when:
has_vector_search but it does not has search. I think I should refactor this. It is a bit confusing. the not at the beginning is not helping.

Jibola · 2025-08-05T18:35:13Z

django_mongodb_backend/compiler.py

+    def _prepare_search_expressions_for_pipeline(self, expression, search_idx, replacements):
+        searches = {}
+        for sub_expr in self._get_search_expressions(expression):
+            if sub_expr not in replacements:
+                alias = f"__search_expr.search{next(search_idx)}"
+                replacements[sub_expr] = self._get_replace_expr(sub_expr, searches, alias)
+
+    def _prepare_search_query_for_aggregation_pipeline(self, order_by):


I know these are private functions, but can they get a docstring? Same with _get_replace_expr. It's quite complex code so it becomes harder to follow.

Jibola · 2025-08-05T19:25:18Z

django_mongodb_backend/expressions/builtins.py

@@ -71,7 +71,7 @@ def col(self, compiler, connection):  # noqa: ARG001
    # Add the column's collection's alias for columns in joined collections.
    has_alias = self.alias and self.alias != compiler.collection_name
    prefix = f"{self.alias}." if has_alias else ""
-    return f"${prefix}{self.target.column}"
+    return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}"


Per tim's comment, how about we just do this?

Suggested change

return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}"

path = "$" if as_path else ""

return f"{path}{prefix}{self.target.column}"

django_mongodb_backend/compiler.py

Jibola · 2025-08-05T19:29:14Z

django_mongodb_backend/compiler.py

@@ -207,9 +243,36 @@ def _build_aggregation_pipeline(self, ids, group):
                pipeline.append({"$unset": "_id"})
        return pipeline

+    def _compound_searches_queries(self, search_replacements):


I'm fine with it, please just add a docstring to explain the function and the additional comment explaining the need for the checks.

Jibola · 2025-08-05T20:40:42Z

django_mongodb_backend/expressions/search.py

+    Args:
+        path: The document path to compare (as string or expression).
+        value: The exact value to match against.
+        score: Optional expression to modify the relevance score.


Can we add that this is an Optional[SearchScore] type?

Jibola · 2025-08-05T21:28:53Z

django_mongodb_backend/expressions/search.py

+        # Apply De Morgan's Laws.
+        operator = node.operator.negate() if negated else node.operator
+        negated = negated != (node.operator == Operator.NOT)


This logic is a little confusing because it requires some understanding of negate and the state changes.
I'll leave this as a comment here to be reviewed later.

What's an example of a NOT combinable?
I.e., how would I construct NOT (A AND B) or can this only be done via negate?

Jibola · 2025-08-05T21:32:52Z

django_mongodb_backend/expressions/search.py

+        limit: Maximum number of matching documents to return.
+        num_candidates: Optional number of candidates to consider during search.
+        exact: Optional flag to enforce exact matching (default is approximate).
+        filter: Optional filter expression to narrow candidate documents.


We should clarify that we only take raw mql for this step. (Unless I have that incorrect and we resolve SearchExpressions too?)

Jibola

Overall PR looks great! I've got some minor corrections, but other than that, it is good to merge from me. Great work! 🚀

It also looks like there's a ReadTheDocs error:

/home/docs/checkouts/readthedocs.org/user_builds/django-mongodb-backend/checkouts/325/docs/source/ref/models/search.rst:654: WARNING: unknown document: 'atlas:atlas-search/scoring/' [ref.doc]

Jibola · 2025-08-06T14:54:00Z

docs/source/releases/5.2.x.rst

@@ -16,6 +16,12 @@ New features
 - Added :class:`~.fields.PolymorphicEmbeddedModelField` and
  :class:`~.fields.PolymorphicEmbeddedModelArrayField` for storing a model
  instance or list of model instances that may be of more than one model class.
+- Added support for MongoDB Atlas Search expressions, including
+  ``SearchAutocomplete``, :class:`.SearchEquals`, ``SearchVector``, and others.


Suggested change

``SearchAutocomplete``, :class:`.SearchEquals`, ``SearchVector``, and others.

``SearchAutocomplete``, :class:`SearchEquals`, ``SearchVector``, and others.

This suggestion isn't correct. Without the leading dot, the class won't be resolved properly.

Jibola · 2025-08-06T15:18:21Z

tests/queries_/test_search.py

+    def create_search_index(cls, model, index_name, definition, type="search"):
+        collection = cls._get_collection(model)
+        idx = SearchIndexModel(definition=definition, name=index_name, type=type)
+        collection.create_search_index(idx)


NIT: For the sake of testing, we can make this a blocking call and check for the index before continuing.

Jibola · 2025-08-06T15:31:10Z

tests/queries_/models.py

+    headline = models.CharField(max_length=100)
+    number = models.IntegerField()
+    body = models.TextField()
+    location = models.JSONField(null=True)


NIT: (Fun fact) This could be an EmbeddedModelField for a location object.

class Location(EmbeddedModel): type = models.CharField(default="Point") coordinates = models.ArrayField(FloatField(), max_size=2)

Jibola · 2025-08-06T15:35:01Z

tests/queries_/test_search.py

+        like_docs = [
+            {"headline": self.article1.headline, "body": self.article1.body},
+            {"headline": self.article2.headline, "body": self.article2.body},
+        ]
+        like_docs = [{"body": "NASA launches new satellite to explore the galaxy"}]


I think this gets overridden.

Also, should article2 pop up as a valid result?

Suggested change

like_docs = [

{"headline": self.article1.headline, "body": self.article1.body},

{"headline": self.article2.headline, "body": self.article2.body},

]

like_docs = [{"body": "NASA launches new satellite to explore the galaxy"}]

like_docs = [{"body": "NASA launches new satellite to explore the galaxy"}]

timgraham reviewed Jun 25, 2025

View reviewed changes

django_mongodb_backend/functions.py Outdated Show resolved Hide resolved

timgraham reviewed Jun 25, 2025

View reviewed changes

django_mongodb_backend/functions.py Outdated Show resolved Hide resolved

WaVEV force-pushed the atlas-search-lookups branch from 449b6a3 to ca8a7cf Compare June 26, 2025 02:56

WaVEV commented Jul 7, 2025

View reviewed changes

WaVEV force-pushed the atlas-search-lookups branch 3 times, most recently from 9935b25 to a467a57 Compare July 12, 2025 23:32

WaVEV changed the title ~~[WIP] Atlas search lookups~~ Atlas search lookups Jul 14, 2025

WaVEV force-pushed the atlas-search-lookups branch 4 times, most recently from ea2118b to 206b554 Compare July 21, 2025 19:29

timgraham reviewed Jul 22, 2025

View reviewed changes

WaVEV force-pushed the atlas-search-lookups branch 4 times, most recently from 456028d to 65f22e6 Compare July 22, 2025 05:16

WaVEV marked this pull request as ready for review July 24, 2025 19:39

WaVEV force-pushed the atlas-search-lookups branch from eb6eb07 to e7f4d22 Compare July 26, 2025 02:40

WaVEV force-pushed the atlas-search-lookups branch from 0fdb066 to eed2499 Compare August 5, 2025 00:25

timgraham and others added 10 commits August 5, 2025 10:34

Create django_mongodb_backend.expressions package

b774f21

Adapt query and compiler for operator support.

007ad74

Add Search operators.

2b78216

Make operators combinable and add compound expressions.

6211d89

Add vector search operator.

1828ee4

Add search lookup.

d4c2743

Add test search

23f90dd

Add combinable test

67f0bf3

Test clean up.

fd791f3

Add dalayed assertion methods in unit test.

a7af873

WaVEV added 8 commits August 5, 2025 10:34

Support operator as string

67f9c86

Update docstring

6b99ecf

Add docs

11cc8bf

Edit docs.

6d8edba

Update docs.

366c151

add available_apps to unit test

a937cb1

Simplify clean up call

c195010

Add change log

99f6548

WaVEV force-pushed the atlas-search-lookups branch from eed2499 to 99f6548 Compare August 5, 2025 13:35

doc edits

2b8e2b0

timgraham reviewed Aug 5, 2025

View reviewed changes

Jibola reviewed Aug 5, 2025

View reviewed changes

WaVEV added 2 commits August 6, 2025 09:27

Fixed Docs

2baafcf

Docstring and minor changes.

b491d6e

Jibola requested changes Aug 6, 2025

View reviewed changes



		@skipUnlessDBFeature("supports_atlas_search")
		class SearchEqualsTest(SearchUtilsMixin):

		all_replacements = {search_replacements, group_replacements}
		self.search_pipeline = self._compound_searches_queries(search_replacements)

	return f"{prefix}{self.target.column}" if as_path else f"${prefix}{self.target.column}"
	path = "$" if as_path else ""
	return f"{path}{prefix}{self.target.column}"

	``SearchAutocomplete``, :class:`.SearchEquals`, ``SearchVector``, and others.
	``SearchAutocomplete``, :class:`SearchEquals`, ``SearchVector``, and others.

Atlas search lookups #325

Are you sure you want to change the base?

Atlas search lookups #325

Conversation

WaVEV commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WaVEV Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WaVEV Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jibola left a comment

WaVEV commented Jun 24, 2025 •

edited

Loading

WaVEV Jul 22, 2025 •

edited

Loading

WaVEV Jul 23, 2025 •

edited

Loading