Added method to download a list of OpenAlex entities #68

romain894 · 2025-04-13T17:08:01Z

I added a method to download a list of OpenAlex entities from a list of IDs. It's getting the IDs 100 per 100, returning Nones when the entities are not found and optionally ordering the list of entities according to the IDs list.

It's a draft as the documentation is not yet added as I would like feedbacks on this implementation. If it's alright, I'll also add a method to get a list of Works from the DOI (I will have a look if it's possible for other entities e.g. with ROR for institutions). I'm using tqdm for the loading bar with an option in the config to disable it, this can be changed too.

As a side note, the tests are failing due to a bug of OpenAlex that I already had: the count of works are different with ["meta"]["count"] and .count().

for more information, see https://pre-commit.ci

romain894 · 2025-05-19T21:17:45Z

@J535D165 any update on this? Thanks!

J535D165 · 2025-05-22T14:17:58Z

Thanks for this @romain894. Don't hesitate to mark it as Ready for review to catch my attention.

You shared an interesting approach to a question that is often asked. There are three interesting parts to that question: how do I collect a big list of identifiers (like OA IDs or DOIs), why is the number of results less than the requested IDs, and why is the order not guaranteed?

To simplify the first question, I started working on the logical expressions for filters in PyAlex. One example is the filter_or method. In combination, it can make collecting a big bunch of records straightforward. See the following:

Let's make a list of random IDs:

# get a list of 100 random works
sample_ids = [
    work["id"].split("/")[-1] for work in Works().sample(100).get(per_page=100)
]

Now get the works in batches of 10 (requires Python 3.12 or later)

from itertools import batched

works = []
for batch in batched(sample_ids, 10):
    works.extend(Works().filter_or(openalex_id=list(batch)).get(per_page=10))

This example can easily be extended with other identifiers, rate limit balancing, and progress monitoring. Even async implementation can be achieved in the future (#54).

Ordering is an interesting question. Most users I worked with benefit more from processing the data first and then (left) joining the data with their original data(frames). This is nearly always the most efficient approach in my experience. On the other hand, if the user wants, this is 2 lines of code extra:

# get the works in batches of 10
works = []
for batch in batched(sample_ids, 10):
    batch_works = Works().filter_or(openalex_id=list(batch)).get(per_page=10)

    # In case you want to order the works in the same order as sample_ids
    map_ids = {work["id"]: work for work in batch_works}
    works.extend(map_ids.get(id_, None) for id_ in batch)

This brings me to a dilemma: we want to keep PyAlex a very lightweight wrapper for OpenAlex, and we want to make it as usable as possible. And besides that, I think that many users don't benefit from ordering the results in this way. In that case, is it worth adding this implementation that doesn't directly correspond with OpenAlex? Should we ask OpenAlex to implement this on their side? Or should we add some utility functions to PyAlex? Or should we add good examples on this (my preference at the moment of writing)?

Let me know what you think @romain894 (cc @PeterLombaers )

PeterLombaers · 2025-05-23T10:13:37Z

Thanks for the contribution @romain894 , this is definitely one of the most frequent use cases, so we should include your PR or add clear examples to the README.

I think I also prefer adding more examples over this implementation, though. If I want the data corresponding to a big list identifiers, it would make sense to first create all the requests and then send them async to OpenAlex, instead of one after the other. To facilitate your use case, the async use case and possible other use cases, I think it's better to just have clear examples then to try to implement them all in Pyalex itself.

The README could definitely use an update on the examples though, since right now there is no example using the filter_or method.

for more information, see https://pre-commit.ci

romain894 · 2025-06-09T14:53:01Z

I finished implementing the method, its tests and added the documentation in the README. It works for any type of IDs (openalex_id, doi, issn, orcid and ror). I didn't find other IDs that can be used in OpenAlex, but please highlight it if I missed some.

I set a different limit for the batch size for IDs other than OpenAlex IDs as this blog post from OurResearch set the limit at 50. Also, from my past experiences, a number higher than 70 was not working due to the HTTP request size limit. This contradicts the official documentation, so I'm not sure about what we should do.

I believe this functionality will be valuable to users as it avoids copy-pasting the code snippet and therefore ease the programming, especially for people not so familiar with Python, and reduces possible errors on the user side.

romain894 and others added 2 commits April 11, 2025 16:29

added get_from_ids method

58d3dc5

[pre-commit.ci] auto fixes from pre-commit.com hooks

e41d4cb

for more information, see https://pre-commit.ci

PeterLombaers mentioned this pull request Jun 2, 2025

Fix code to download the works of a single author #75

Closed

romain894 and others added 14 commits June 9, 2025 01:26

improved get_from_ids()

046f86b

merged changes from github action

12c7100

[pre-commit.ci] auto fixes from pre-commit.com hooks

a5d98de

for more information, see https://pre-commit.ci

fixed typing for python 3.8

8b75f33

Merge branch 'main' of github.com:romain894/pyalex

36ee7ba

fixed linter error line too long

aa681de

fixed ambiguous variable name

c1ba7c3

improved test get from ids

856a7b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

8d8d4d8

for more information, see https://pre-commit.ci

fixed linter line too long error

ab7af6a

added support for external ids in get_from_ids() + added tests

d6c3298

[pre-commit.ci] auto fixes from pre-commit.com hooks

e909c73

for more information, see https://pre-commit.ci

line too long fix

57fb2d4

added doc and examples for get_from_ids() in README

a7b3e68

romain894 marked this pull request as ready for review June 9, 2025 14:53

Merge branch 'main' into main

9a87f52

paniterka mentioned this pull request Aug 18, 2025

filter_or documentation and API URL length check #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added method to download a list of OpenAlex entities #68

Added method to download a list of OpenAlex entities #68

Uh oh!

romain894 commented Apr 13, 2025

Uh oh!

romain894 commented May 19, 2025

Uh oh!

J535D165 commented May 22, 2025 •

edited

Loading

Uh oh!

PeterLombaers commented May 23, 2025

Uh oh!

romain894 commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added method to download a list of OpenAlex entities #68

Are you sure you want to change the base?

Added method to download a list of OpenAlex entities #68

Uh oh!

Conversation

romain894 commented Apr 13, 2025

Uh oh!

romain894 commented May 19, 2025

Uh oh!

J535D165 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterLombaers commented May 23, 2025

Uh oh!

romain894 commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

J535D165 commented May 22, 2025 •

edited

Loading