Skip to content

Conversation

dbernstein
Copy link
Contributor

@dbernstein dbernstein commented Sep 29, 2025

Description

In order to process of converting the overdrive scripts to celery we want to be able to efficiently download feed data and store it in a redis set for downstream processing. This PR advances that end by providing a means via the Overdrive API to pull a complete "page" of book data, with the option to include metadata or circulation data depending on the set the needs to be built. Long running (ie more than a minute or so) celery tasks require replacing one task with another in order to keep the flow of celery tasks moving by ensuring that no one task commandeers a celery work for any significant length of time. Therefore, this method will be used efficiently build set of book info while not making undue demands on the redis set, all the while ensuring that each chunk of book data can be retried in case of an error.

Motivation and Context

https://ebce-lyrasis.atlassian.net/browse/PP-3015

How Has This Been Tested?

Unit tests added.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

Copy link

codecov bot commented Sep 29, 2025

Codecov Report

❌ Patch coverage is 91.75258% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.41%. Comparing base (5fd6f06) to head (0a98bb6).

Files with missing lines Patch % Lines
...alace/manager/integration/license/overdrive/api.py 91.95% 2 Missing and 5 partials ⚠️
src/palace/manager/util/http/async_http.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2767      +/-   ##
==========================================
- Coverage   92.41%   92.41%   -0.01%     
==========================================
  Files         449      449              
  Lines       42628    42720      +92     
  Branches     5955     5967      +12     
==========================================
+ Hits        39396    39480      +84     
- Misses       2120     2123       +3     
- Partials     1112     1117       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein dbernstein force-pushed the add_fetch_book_list_method_to_overdrive_api branch from 8149c70 to 2211d83 Compare September 29, 2025 21:04
@dbernstein dbernstein force-pushed the add_fetch_book_list_method_to_overdrive_api branch from 2211d83 to b38ab1d Compare September 30, 2025 17:20
@dbernstein dbernstein marked this pull request as ready for review September 30, 2025 20:55
@dbernstein dbernstein requested a review from a team September 30, 2025 20:55
Copy link
Member

@jonathangreen jonathangreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments here to address before this one can be merged.

I'm also wondering if you have looked into this comment in JIRA yet: https://ebce-lyrasis.atlassian.net/browse/PP-2183?focusedCommentId=27994. Before you build this whole redis set infrastructure, I'd like to make sure overdrive doesn't directly give us what we need.

return availability_queue, next_link

def _get_headers(self, auth_token: str) -> dict[str, str]:
return {"Authorization": f"Bearer {auth_token}", "User-Agent": "Palace"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add User-Agent here? Adding this will override the more detailed user agent header set by the HTTP class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove that. I was refactoring a bit and noticed that we were setting the user agent in another place in this file when formatting the auth header. I can remove if you think it's better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the diff and doing a quick grep grep -ni "User-Agent" src/palace/manager/integration/license/overdrive/api.py I don't see anywhere we were previously setting User-Agent in this file. Are you sure you didn't pull this in from elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look in the HTTP code, we always set the user agent, if its not already set:

# Set a user-agent if not already present
headers = get_default_headers()
if (additional_headers := kwargs.get("headers")) is not None:
headers.update(additional_headers)
kwargs["headers"] = headers

So this change would result in the overdrive code having a user agent of Palace, rather then the Palace Manager/version header that we tell our integration partners that we send.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty file should be removed

urls: deque[str] = deque()
pending_requests: list[asyncio.Task[httpx._models.Response]] = []
books: dict[str, Any] = {}
retried_requests: defaultdict[str, int] = defaultdict(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - I factored that out when I moved to the palace async client.

return list(books.values()), next

def create_async_client(self, connections: int = 5) -> AsyncClient:
return AsyncClient.for_web(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be used in the web context? It seems like the plan is to use this as part of a celery task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I didn't realize that was the function of this builder. Will fix.

except BadResponseException as e:
if e.response.status_code == 404:
self.log.warning(
f"404 returned: {e.response.url}: ignoring..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore 404 errors? At the very least I think we need a comment explaining why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought here was that when I was using the palace-tools to download feeds I was seeing 404 for some endpoints (availability I believe) which was causing the download to fail. But I suppose a 404 on a product list endpoint should not be ignored. I'll fix and make some comments.

Comment on lines +693 to +697
if not next:
next_url = extractor_class.link(
response.json(), rel_to_follow
)
next = BookInfoEndpoint(next_url) if next_url else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a race condition, since any request could have come in here right, its whatever request completed successfully first, which might not be the request that was issued first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I have it wrong, but I thought that subsequent requests (for availability and metadata) would only be issued after the product list GET request had been read. Therefore, I can safely assume that if the next link would appear only in the first URI returned. Am I mistaken there?

That said, I do see another problem though (and perhaps this is what you are pointing at): I'm assuming that there will be a non null next link in the product page which I should not assume. I'll take a closer look.

base_url
)
)
id = product["id"].lower()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Really shouldn't override the built-in python id keyword here. Its allowed, but it really should be avoided.

Comment on lines +667 to +668
client.headers.update(self._get_headers(self._client_oauth_token))
client.base_url = URL(base_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this instead of passing them to the client constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Comment on lines +369 to +380
@property
def headers(self) -> httpx.Headers:
return self._httpx_client.headers

@property
def base_url(self) -> URL:
return self._httpx_client.base_url

@base_url.setter
def base_url(self, base_url: URL | str) -> None:
self._httpx_client.base_url = base_url

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: If there isn't a compelling argument why these are needed, I'd rather they be set via the constructor and have these properties removed so we don't expose the internal httpx client state.

next: BookInfoEndpoint | None = None

while pending_requests:
done, pending = await asyncio.wait(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to use asyncio.wait in the code in palace tools, because the length of the queue there was variable length. Here, if I understand correctly whats happening. You know exactly how many requests you will make after the first request has returned. In this case it seems better to use either asyncio.gather or an asyncio TaskGroup.

That gives you deterministic ordering for the requests, so you don't have to rely on URL matching in order to process the responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at the asyncio docs. Thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants