Skip to content

Conversation

@sbarhin
Copy link

@sbarhin sbarhin commented Oct 30, 2025

Fixes

Description

This PR implements automation for Museum Victoria data fetching as discussed in issue #215. The implementation follows the established patterns from existing fetch scripts.
This purpose of this file is to fetch all the records from the Museum Victoria API, then saving the necessary response fields needed for the next phase (processing phase).

  • Fetches data for all record types (article, item, specimen, species) from the Museum Victoria API
  • Prepares and saves meaningful responses into a csv file under the data/2025Q4/1-fetch directory
  • Next actions will be to process and report the data once the fetching script is approved by reviewers

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sbarhin sbarhin requested review from a team as code owners October 30, 2025 15:49
@sbarhin sbarhin requested review from TimidRobot and possumbilities and removed request for a team October 30, 2025 15:49
@sbarhin
Copy link
Author

sbarhin commented Oct 30, 2025

@oree-xx I have opened a new pull request. I guess that is much better

@oree-xx
Copy link
Contributor

oree-xx commented Oct 30, 2025

@sbarhin ohh okay great.

@TimidRobot TimidRobot self-assigned this Oct 31, 2025
@TimidRobot TimidRobot changed the title Add musuems_fetch.py Add Museum Victoria fetch Oct 31, 2025
@sbarhin
Copy link
Author

sbarhin commented Oct 31, 2025

@TimidRobot There are several file changes in my PR, this is due to pulling from the main branch where I believe you merged a certain PR. These changes have taken effect in my branch hence those file changes in my PR.

@TimidRobot
Copy link
Member

@TimidRobot There are several file changes in my PR, this is due to pulling from the main branch where I believe you merged a certain PR. These changes have taken effect in my branch hence those file changes in my PR.

Please revisit the documentation on keeping a branch/fork synchronized with upstream and on resolving merge conflicts. This PR won't be reviewed or merged while these issue are present.

@sbarhin
Copy link
Author

sbarhin commented Oct 31, 2025

@TimidRobot There are several file changes in my PR, this is due to pulling from the main branch where I believe you merged a certain PR. These changes have taken effect in my branch hence those file changes in my PR.

Please revisit the documentation on keeping a branch/fork synchronized with upstream and on resolving merge conflicts. This PR won't be reviewed or merged while these issue are present.

I will do that please. Thank you

@sbarhin
Copy link
Author

sbarhin commented Oct 31, 2025

@TimidRobot I believe we are good now

@TimidRobot TimidRobot changed the title Add Museum Victoria fetch Add Museums Victoria fetch Nov 1, 2025
@sbarhin
Copy link
Author

sbarhin commented Nov 2, 2025

The script takes too long to run (I canceled after 10 minutes).

Please add a --limit option so that it can be developed and tested without taking the full time.

@TimidRobot Please should the --limit apply individual record types or the total number of records altogether?

@sbarhin

This comment was marked as outdated.

@sbarhin

This comment was marked as outdated.

@TimidRobot
Copy link
Member

The script takes too long to run (I canceled after 10 minutes).
Please add a --limit option so that it can be developed and tested without taking the full time.

@TimidRobot Please should the --limit apply individual record types or the total number of records altogether?

@sbarhin Which implementation will satisfies the stated goal?

@TimidRobot
Copy link
Member

@sbarhin force-pushed the museums branch from b7d41b6 to 1c644a4 5 days ago

@sbarhin force-pushed the museums branch from 79e6ef5 to 6ee4793 3 days ago

@sbarhin force-pushed the museums branch from aa8774c to 6ee4793

@sbarhin force pushes are generally a worst practice and should be avoided when unnecessary.

@sbarhin
Copy link
Author

sbarhin commented Nov 5, 2025

The script takes too long to run (I canceled after 10 minutes).
Please add a --limit option so that it can be developed and tested without taking the full time.

@TimidRobot Please should the --limit apply individual record types or the total number of records altogether?

@sbarhin Which implementation will satisfies the stated goal?

I think per individual record type will do

@sbarhin
Copy link
Author

sbarhin commented Nov 6, 2025

@TimidRobot Please what are the next steps? Seems everything is resolved now

licence_data = media_item.get("licence")

# COUNTING THE UNIQUE LICENCE TYPES
license_short_name = licence_data.get("shortName")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the license short name is not reliable (does not include version number).

@sbarhin
Copy link
Author

sbarhin commented Nov 6, 2025

@TimidRobot Please I have made the necessary changes

@sbarhin
Copy link
Author

sbarhin commented Nov 14, 2025

@TimidRobot Hello there

records_processed = 0
current_page = 1
total_pages = None
per_page = min(PER_PAGE, args.limit) if args.limit else PER_PAGE
Copy link
Member

@TimidRobot TimidRobot Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any value in changing per_page. All it does is slow down the execution of the script. It does not address the issue of the script taking a very long time to complete.

Again, your comment suggested the limit would apply to the total records of each type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default per_page is 100 from the documentation. With the presence of --limit say 50, we fetch 50 records perpage for each record type instead of the default 100. No fetch is done once we hit the limit (50) regardless of the number of pages for that record type.
If per_page isn't altered, then the --limit won't work. It would fetch the 100 records anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the --limit applies to the total records for each type regardless of the total pages for a record type. So, if I don't override the perpage with the limit (if provided) it would fetch the 100 records per page and proceed to the next page if there are more records for that type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image @TimidRobot Please see this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot hello there

Comment on lines +83 to +87
def get_requests_session():
"""
Returns a configured requests session with retries and a User-Agent.
"""
return shared.get_session()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't actually do anything and should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Add Museum Victoria as data source

3 participants