-
Notifications
You must be signed in to change notification settings - Fork 8
Add fetch book list method to overdrive api #2767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2767 +/- ##
==========================================
- Coverage 92.41% 92.41% -0.01%
==========================================
Files 449 449
Lines 42628 42720 +92
Branches 5955 5967 +12
==========================================
+ Hits 39396 39480 +84
- Misses 2120 2123 +3
- Partials 1112 1117 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8149c70
to
2211d83
Compare
… with the option of including metadata and/or availability using an async http client for optimal performance.
2211d83
to
b38ab1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments here to address before this one can be merged.
I'm also wondering if you have looked into this comment in JIRA yet: https://ebce-lyrasis.atlassian.net/browse/PP-2183?focusedCommentId=27994. Before you build this whole redis set infrastructure, I'd like to make sure overdrive doesn't directly give us what we need.
return availability_queue, next_link | ||
|
||
def _get_headers(self, auth_token: str) -> dict[str, str]: | ||
return {"Authorization": f"Bearer {auth_token}", "User-Agent": "Palace"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add User-Agent
here? Adding this will override the more detailed user agent header set by the HTTP
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove that. I was refactoring a bit and noticed that we were setting the user agent in another place in this file when formatting the auth header. I can remove if you think it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the diff and doing a quick grep grep -ni "User-Agent" src/palace/manager/integration/license/overdrive/api.py
I don't see anywhere we were previously setting User-Agent
in this file. Are you sure you didn't pull this in from elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look in the HTTP code, we always set the user agent, if its not already set:
circulation/src/palace/manager/util/http/http.py
Lines 232 to 236 in 3798ec8
# Set a user-agent if not already present | |
headers = get_default_headers() | |
if (additional_headers := kwargs.get("headers")) is not None: | |
headers.update(additional_headers) | |
kwargs["headers"] = headers |
So this change would result in the overdrive code having a user agent of Palace
, rather then the Palace Manager/version
header that we tell our integration partners that we send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This empty file should be removed
urls: deque[str] = deque() | ||
pending_requests: list[asyncio.Task[httpx._models.Response]] = [] | ||
books: dict[str, Any] = {} | ||
retried_requests: defaultdict[str, int] = defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch - I factored that out when I moved to the palace async client.
return list(books.values()), next | ||
|
||
def create_async_client(self, connections: int = 5) -> AsyncClient: | ||
return AsyncClient.for_web( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be used in the web context? It seems like the plan is to use this as part of a celery task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I didn't realize that was the function of this builder. Will fix.
except BadResponseException as e: | ||
if e.response.status_code == 404: | ||
self.log.warning( | ||
f"404 returned: {e.response.url}: ignoring..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ignore 404 errors? At the very least I think we need a comment explaining why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought here was that when I was using the palace-tools to download feeds I was seeing 404 for some endpoints (availability I believe) which was causing the download to fail. But I suppose a 404 on a product list endpoint should not be ignored. I'll fix and make some comments.
if not next: | ||
next_url = extractor_class.link( | ||
response.json(), rel_to_follow | ||
) | ||
next = BookInfoEndpoint(next_url) if next_url else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a race condition, since any request could have come in here right, its whatever request completed successfully first, which might not be the request that was issued first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I have it wrong, but I thought that subsequent requests (for availability and metadata) would only be issued after the product list GET request had been read. Therefore, I can safely assume that if the next link would appear only in the first URI returned. Am I mistaken there?
That said, I do see another problem though (and perhaps this is what you are pointing at): I'm assuming that there will be a non null next link in the product page which I should not assume. I'll take a closer look.
base_url | ||
) | ||
) | ||
id = product["id"].lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Really shouldn't override the built-in python id
keyword here. Its allowed, but it really should be avoided.
client.headers.update(self._get_headers(self._client_oauth_token)) | ||
client.base_url = URL(base_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this instead of passing them to the client constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
@property | ||
def headers(self) -> httpx.Headers: | ||
return self._httpx_client.headers | ||
|
||
@property | ||
def base_url(self) -> URL: | ||
return self._httpx_client.base_url | ||
|
||
@base_url.setter | ||
def base_url(self, base_url: URL | str) -> None: | ||
self._httpx_client.base_url = base_url | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: If there isn't a compelling argument why these are needed, I'd rather they be set via the constructor and have these properties removed so we don't expose the internal httpx client state.
next: BookInfoEndpoint | None = None | ||
|
||
while pending_requests: | ||
done, pending = await asyncio.wait( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to use asyncio.wait
in the code in palace tools, because the length of the queue there was variable length. Here, if I understand correctly whats happening. You know exactly how many requests you will make after the first request has returned. In this case it seems better to use either asyncio.gather
or an asyncio TaskGroup
.
That gives you deterministic ordering for the requests, so you don't have to rely on URL matching in order to process the responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look at the asyncio docs. Thanks for the explanation.
Description
In order to process of converting the overdrive scripts to celery we want to be able to efficiently download feed data and store it in a redis set for downstream processing. This PR advances that end by providing a means via the Overdrive API to pull a complete "page" of book data, with the option to include metadata or circulation data depending on the set the needs to be built. Long running (ie more than a minute or so) celery tasks require replacing one task with another in order to keep the flow of celery tasks moving by ensuring that no one task commandeers a celery work for any significant length of time. Therefore, this method will be used efficiently build set of book info while not making undue demands on the redis set, all the while ensuring that each chunk of book data can be retried in case of an error.
Motivation and Context
https://ebce-lyrasis.atlassian.net/browse/PP-3015
How Has This Been Tested?
Unit tests added.
Checklist