Skip to content

Conversation

@raffaem
Copy link
Contributor

@raffaem raffaem commented May 28, 2025

By default get returns only the first 25 entries

By default `get` returns only the first 25 entries
@PeterLombaers
Copy link
Collaborator

Hi @raffaem! Thanks for noting that the example doesn't return all works of an author and for giving a way to collect all the works. I want to keep the examples section a bit more streamlined though, and avoid every example giving a different handcrafted way to get all the pages from a paginator. The easiest way would be to simply link to say something like: 'this example gives the first 25 works of an author, to get all works see the section on pagination.' Alternatively, you could update the example, but make it a bit more clean. There is no need to write a function in an example, you can get all the works using 2 lines of code. See for example this post.

@raffaem
Copy link
Contributor Author

raffaem commented Jun 12, 2025

Hi @raffaem! Thanks for noting that the example doesn't return all works of an author and for giving a way to collect all the works. I want to keep the examples section a bit more streamlined though, and avoid every example giving a different handcrafted way to get all the pages from a paginator. The easiest way would be to simply link to say something like: 'this example gives the first 25 works of an author, to get all works see the section on pagination.' Alternatively, you could update the example, but make it a bit more clean. There is no need to write a function in an example, you can get all the works using 2 lines of code. See for example this post.

I don't understand what those two lines would be.

Something like that:

for batch in batched(sample_ids, 10):
    works.extend(Works().filter_or(openalex_id=list(batch)).get(per_page=10))

would still download just the first page.

How do you check the last page was reached?

@PeterLombaers
Copy link
Collaborator

In the example, every batch will contain 10 records. For every step in the for-loop, you will make a request to OpenAlex for the 10 records in the batch and you will get a response with a page size of 10. So the page will contain all the records from the batch. So there is really no such thing as a 'last page' in the example. There is only the last batch. You can try adding print statements to see what is happening, e.g.:

works = []
for idx, batch in enumerate(batched(sample_ids, 10)):
    print("Batch index: {idx}")
    print(f"Getting identifiers: {sample_ids}")
    page = Works().filter_or(openalex_id=list(batch)).get(per_page=10)
    page_identifiers = [record["id"] for record in page]
    print(f"Page contains identifiers: {page_identifiers}")
    works.extend(page)

Does this make sense to you?

@raffaem
Copy link
Contributor Author

raffaem commented Jun 13, 2025

In the example, every batch will contain 10 records. For every step in the for-loop, you will make a request to OpenAlex for the 10 records in the batch and you will get a response with a page size of 10. So the page will contain all the records from the batch. So there is really no such thing as a 'last page' in the example. There is only the last batch. You can try adding print statements to see what is happening, e.g.:

works = []
for idx, batch in enumerate(batched(sample_ids, 10)):
    print("Batch index: {idx}")
    print(f"Getting identifiers: {sample_ids}")
    page = Works().filter_or(openalex_id=list(batch)).get(per_page=10)
    page_identifiers = [record["id"] for record in page]
    print(f"Page contains identifiers: {page_identifiers}")
    works.extend(page)

Does this make sense to you?

I don't understand why we are filtering by Works' OpenAlex IDs when we want to filter by Works' Authors' OpenAlex IDs and we don't know how many works that author published in advance.

@raffaem
Copy link
Contributor Author

raffaem commented Jun 13, 2025

How would you rewrite my download_author_works function? It taks as input the OpenAlex ID of an author

@PeterLombaers
Copy link
Collaborator

Oh sorry, you're totally right. I got confused with a different question and pointed you to the wrong place. What I gave only works if you already have a list of work identifiers.

Finding all the works from an author can be done using the basic example from the pagination section:

from pyalex import Works

works = []
pager = Works().filter(author={"id": "A5083411784"}).paginate(per_page=200)
for page in pager:
    works.extend(page)

If you want access to the index of the current page for log statements, you wrap the pager in enumerate. If you only want the first n pages, you wrap the pager in itertools.islice.

@J535D165
Copy link
Owner

I like the simplicity of your example @PeterLombaers. I propose to use that example.

@raffaem
Copy link
Contributor Author

raffaem commented Jun 16, 2025

I like the simplicity of your example @PeterLombaers. I propose to use that example.

Yes

@raffaem raffaem closed this Jun 16, 2025
@raffaem
Copy link
Contributor Author

raffaem commented Jun 18, 2025

Oh sorry, you're totally right. I got confused with a different question and pointed you to the wrong place. What I gave only works if you already have a list of work identifiers.

Finding all the works from an author can be done using the basic example from the pagination section:

from pyalex import Works

works = []
pager = Works().filter(author={"id": "A5083411784"}).paginate(per_page=200)
for page in pager:
    works.extend(page)

If you want access to the index of the current page for log statements, you wrap the pager in enumerate. If you only want the first n pages, you wrap the pager in itertools.islice.

Thanks, that was exactly what I needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants