Edited the split command for parsing GFF #511

EreboPSilva · 2025-09-23T17:05:16Z

Complex structures in the attributes of refseq GFF (shown in example), caused range to be miss-assigned thus failing due to the end coordinate being smalled than the starting one.

After some looking around, turns out the split that originally handles the attribute was not design to manage complex, nested, attributes. In changing that, but keeping a condition for the normal non-nested cases, all instances in my test were handled as expected.

Simple case:
[[rest of the line]]transl_except=(pos:16975745..16975747%2Caa:Other)

Nested case:
[[rest of the line]]transl_except=(pos:complement(join(11655574%2C11 655751..11655752))%2Caa:Other),(pos:complement(11655454..11655456)%2Caa:Other)

Requirements

When creating your Pull request, please fill out the template below:

PR details

Fix for a very specific (and probably not very wide-spreaded edge case.

Include a short description
(See above for details). Amend of a split function that was not covering all possible cases.

Include links to JIRA tickets
No Jira ticket.

Testing

Have you tested it?
Tested it in a small subset of the problematic GFF (the scaffold containing the entity that originally flagged the issue.

Assign to the weekly GitHub reviewer

@JackCurragh

To avoid issues with complex structures in the attributes that caused ssues down-the-line with the expected structure of subsequent range assignment.

EreboPSilva · 2025-09-23T17:24:46Z

Looking into bettering this further.

EreboPSilva · 2025-09-24T15:47:11Z

The problem here is transl_except (or translation exceptions, such as selenocysteines, stop codons to be ignored, substitutions, ...). This info is in the attributes column in the GFF file.

Essentially, what we want from this, are 3 values: start of exception, end of exception (will tend to be a codon in most cases, so lenght of 3), and type (which tells us the nature of the exception, seleno, substitution, whatnot, ...).

This follows a more or less stable structure in the file I've anylised:

(pos:complement(N..N),aa:TYPE)
(pos:N..N,aa:TYPE)
(pos:N,aa:TYPE) -> For a single nucleotide

Also, a single CDS can have several of this, presented a comma-separated list of the above structures.

The split function that parses the attribute info to get these values now acepts all these structures (worth mentioning that due the format that the core db expects, the single nucleotid will be recorded using the position as both start and end. The "rule" for validation that the script has request that start be less than end +1, so this still fullfils this.

Now, the problem is that there is another possibility. That is that the exception takes place between different coordinates. Speaking biologically, if a codon is spread between 2 exons, the attribute will include something on the lines of:

(pos:join(N..N,N),aa:TYPE)

And different variation of the above, such as N,N..N, or complement versions of it.

The way we hand this information to the DB can't support this (from what I have seen, I could be wrong). So the current version skips and ignore these. I've though of skipping just the exception to translation, but I fear if we do we'll cause issues down the line when our own pipeline tries to translate something and comes up with a stop codon, or something...

The point being, I need to investigate further how this info in handled down the line after it's handed to the DB. If it's fine we simply add a new format for these cases, if it messes up with something we keep skiping these. There represent a very small percentage of the GFF so the loss is minimal.

to accomodate more cases and exclude the joins that cause issues

EreboPSilva · 2025-09-24T15:54:14Z

Pushed last changes, added a long explanation of the problem, what I've solved and what I [sadly] haven't.
It's not pretty but I don;t want to invest more time on this.

JackCurragh

This seems like the best we can do with the current data model. I wonder if an issue should be created on core data model to highlight this deficiency.

With regard to getting it merged, I would just request that the DEBUG statements are cleaned up and a comment distilling what you explained in the PR and why we have to have this solution is added.

Otherwise it looks good to me

Edited the split command for parsing GFF

89926d9

To avoid issues with complex structures in the attributes that caused ssues down-the-line with the expected structure of subsequent range assignment.

EreboPSilva closed this Sep 23, 2025

refining the split

57aac69

to accomodate more cases and exclude the joins that cause issues

EreboPSilva reopened this Sep 24, 2025

EreboPSilva requested a review from JackCurragh September 24, 2025 15:52

JackCurragh requested changes Oct 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edited the split command for parsing GFF #511

Edited the split command for parsing GFF #511

Uh oh!

EreboPSilva commented Sep 23, 2025 •

edited

Loading

Uh oh!

EreboPSilva commented Sep 23, 2025

Uh oh!

EreboPSilva commented Sep 24, 2025

Uh oh!

EreboPSilva commented Sep 24, 2025

Uh oh!

JackCurragh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Edited the split command for parsing GFF #511

Are you sure you want to change the base?

Edited the split command for parsing GFF #511

Uh oh!

Conversation

EreboPSilva commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

PR details

Testing

Assign to the weekly GitHub reviewer

Uh oh!

EreboPSilva commented Sep 23, 2025

Uh oh!

EreboPSilva commented Sep 24, 2025

Uh oh!

EreboPSilva commented Sep 24, 2025

Uh oh!

JackCurragh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EreboPSilva commented Sep 23, 2025 •

edited

Loading