-
Notifications
You must be signed in to change notification settings - Fork 37
Added method to download a list of OpenAlex entities #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
@J535D165 any update on this? Thanks! |
Thanks for this @romain894. Don't hesitate to mark it as Ready for review to catch my attention. You shared an interesting approach to a question that is often asked. There are three interesting parts to that question: how do I collect a big list of identifiers (like OA IDs or DOIs), why is the number of results less than the requested IDs, and why is the order not guaranteed? To simplify the first question, I started working on the logical expressions for filters in PyAlex. One example is the Let's make a list of random IDs: # get a list of 100 random works
sample_ids = [
work["id"].split("/")[-1] for work in Works().sample(100).get(per_page=100)
] Now get the works in batches of 10 (requires Python 3.12 or later) from itertools import batched
works = []
for batch in batched(sample_ids, 10):
works.extend(Works().filter_or(openalex_id=list(batch)).get(per_page=10)) This example can easily be extended with other identifiers, rate limit balancing, and progress monitoring. Even async implementation can be achieved in the future (#54). Ordering is an interesting question. Most users I worked with benefit more from processing the data first and then (left) joining the data with their original data(frames). This is nearly always the most efficient approach in my experience. On the other hand, if the user wants, this is 2 lines of code extra: # get the works in batches of 10
works = []
for batch in batched(sample_ids, 10):
batch_works = Works().filter_or(openalex_id=list(batch)).get(per_page=10)
# In case you want to order the works in the same order as sample_ids
map_ids = {work["id"]: work for work in batch_works}
works.extend(map_ids.get(id_, None) for id_ in batch) This brings me to a dilemma: we want to keep PyAlex a very lightweight wrapper for OpenAlex, and we want to make it as usable as possible. And besides that, I think that many users don't benefit from ordering the results in this way. In that case, is it worth adding this implementation that doesn't directly correspond with OpenAlex? Should we ask OpenAlex to implement this on their side? Or should we add some utility functions to PyAlex? Or should we add good examples on this (my preference at the moment of writing)? Let me know what you think @romain894 (cc @PeterLombaers ) |
Thanks for the contribution @romain894 , this is definitely one of the most frequent use cases, so we should include your PR or add clear examples to the README. I think I also prefer adding more examples over this implementation, though. If I want the data corresponding to a big list identifiers, it would make sense to first create all the requests and then send them async to OpenAlex, instead of one after the other. To facilitate your use case, the async use case and possible other use cases, I think it's better to just have clear examples then to try to implement them all in Pyalex itself. The README could definitely use an update on the examples though, since right now there is no example using the |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
I finished implementing the method, its tests and added the documentation in the README. It works for any type of IDs (openalex_id, doi, issn, orcid and ror). I didn't find other IDs that can be used in OpenAlex, but please highlight it if I missed some. I set a different limit for the batch size for IDs other than OpenAlex IDs as this blog post from OurResearch set the limit at 50. Also, from my past experiences, a number higher than 70 was not working due to the HTTP request size limit. This contradicts the official documentation, so I'm not sure about what we should do. I believe this functionality will be valuable to users as it avoids copy-pasting the code snippet and therefore ease the programming, especially for people not so familiar with Python, and reduces possible errors on the user side. |
I added a method to download a list of OpenAlex entities from a list of IDs. It's getting the IDs 100 per 100, returning Nones when the entities are not found and optionally ordering the list of entities according to the IDs list.
It's a draft as the documentation is not yet added as I would like feedbacks on this implementation. If it's alright, I'll also add a method to get a list of Works from the DOI (I will have a look if it's possible for other entities e.g. with ROR for institutions). I'm using
tqdm
for the loading bar with an option in the config to disable it, this can be changed too.As a side note, the tests are failing due to a bug of OpenAlex that I already had: the count of works are different with
["meta"]["count"]
and.count()
.