Skip to content

Added genetic diversity fields - Fixes #1610 #1611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: staging
Choose a base branch
from

Conversation

arschat
Copy link
Member

@arschat arschat commented Feb 21, 2025

Release notes

#1610

For human_specific.json schema:

  • ethnicity_question
  • ethnicity_parents
  • primary_language
  • mother_father_language
  • current_residence
  • place_of_birth

For residence.json schema:

  • country
  • granular
  • duration
  • area_type

For medical_history.json schema:

  • diet_meat_consumption
  • reproduction_history

For reproduction_history.json schema:

  • menarche_age
  • menopause_status
  • parity
  • gravidity

Reviews requested

  • Need 4 Reviewers to approve because this is a major update

Copy link
Collaborator

@idazucchi idazucchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! It's well organised, I've left some comments so we can discuss a few things

@arschat arschat requested a review from amnonkhen May 14, 2025 13:53
@arschat arschat assigned NoopDog and hannes-ucsc and unassigned NoopDog and hannes-ucsc May 14, 2025
@HumanCellAtlas HumanCellAtlas deleted a comment from idazucchi May 14, 2025
Copy link
Contributor

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which project/dataset is this for?

@arschat
Copy link
Member Author

arschat commented May 23, 2025

Hi @hannes-ucsc this PR is not for a specific project/dataset.

These are the recommendation metadata fields from the HCA Genetic Diversity TaskForce to record the genetic, geographic and in generally human diversity in the HCA studies.

Do you have any specific concerns?

@hannes-ucsc
Copy link
Contributor

I may be out of the loop, but if this isn't going to be used for any actual projects, why is it being added to the schema?

I am worried that this isn't sufficiently modular, risking for human-specific and medical history modules to become a kitchen sink of fields, i.e. a flat, unstructured list of fields, some related and some not. This will make comprehending the schema (and the JSON documents compliant with it) increasingly difficult. For example, menarche_age, menopause_status, parity and gravidity are all clearly related but they are not encapsulated in a module. Another example are the place_of_birth_… fields. The fact that their names all share a prefix is somewhat of a design smell, indicating that they, too, want to be encapsulated. The ethnicity-related fields added here appear to relate to a questionnaire of some sort, similarly suggesting that they should be encapsulated in a module.

@arschat
Copy link
Member Author

arschat commented May 27, 2025

Thank you for your comment Hannes.

Bionetwork coordinators have requested around 250 fields of Tier 2 bionetwork-specific metadata that do not exist currently into our schema. Since the Tier 2 collection has not been officially started yet, we cannot be sure of how frequent all of those fields will be filled. We've shared our concerns with bionetwork coordinators on the feasibility of the metadata collection, but we trust their confidence on the collection.

Regarding the modularity. There are indeed some fields that could be clustered together but we choose this modelling to avoid extensive "module-in-a-module" structure since we've avoided this until now (with the exception of ontology modules). I am happy to encapsulate similar fields in modules either inside human_specific/ medical_history modules or as separate modules in donor_organism if you prefer this modelling.

@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented May 27, 2025

Bionetwork coordinators have requested around 250 fields of Tier 2 bionetwork-specific metadata that do not exist

If I understand you correctly, this PR is just the first slate of a series of fairly involved changes. Since it is extremely difficult to fix the schema once metadata using it has been released, it is very important that we get this right from the beginning. Is there any substantive documentation about this effort that you could share?

avoid extensive "module-in-a-module" structure since we've avoided this until now

Given the sheer number of new fields you cite, lack of modularity is a serious concern. I don't see why nesting modules is problematic. Hierarchical structures are commonplace in computer science—in biology, too, for that matter—and have proven to be a useful modeling approach.

@arschat
Copy link
Member Author

arschat commented Jun 18, 2025

Hi @hannes-ucsc I understand your concerns whether this definitions are going to be adopted by contributors in this way or not.
However this PR, is about the additional Genetic Diversity fields, that have been suggested by the Human Cell Atlas Genetic Divesity Taskforce, have been accepted by the HCA Organization Committee and are going to be requested across all bionetworks' Tier 2 metadata. Thus, they are unlikely to be changed.

Given your feedback on modularity, I refactored some fields in a more modular way. Let me know if this works better for you. I could split into different PRs for each module but please let me know your comments here before we move into new PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants