-
Notifications
You must be signed in to change notification settings - Fork 36
Edited the split command for parsing GFF #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
To avoid issues with complex structures in the attributes that caused ssues down-the-line with the expected structure of subsequent range assignment.
Looking into bettering this further. |
The problem here is transl_except (or translation exceptions, such as selenocysteines, stop codons to be ignored, substitutions, ...). This info is in the attributes column in the GFF file. Essentially, what we want from this, are 3 values: start of exception, end of exception (will tend to be a codon in most cases, so lenght of 3), and type (which tells us the nature of the exception, seleno, substitution, whatnot, ...). This follows a more or less stable structure in the file I've anylised:
Also, a single CDS can have several of this, presented a comma-separated list of the above structures. The split function that parses the attribute info to get these values now acepts all these structures (worth mentioning that due the format that the core db expects, the single nucleotid will be recorded using the position as both start and end. The "rule" for validation that the script has request that start be less than end +1, so this still fullfils this. Now, the problem is that there is another possibility. That is that the exception takes place between different coordinates. Speaking biologically, if a codon is spread between 2 exons, the attribute will include something on the lines of:
And different variation of the above, such as The way we hand this information to the DB can't support this (from what I have seen, I could be wrong). So the current version skips and ignore these. I've though of skipping just the exception to translation, but I fear if we do we'll cause issues down the line when our own pipeline tries to translate something and comes up with a stop codon, or something... The point being, I need to investigate further how this info in handled down the line after it's handed to the DB. If it's fine we simply add a new format for these cases, if it messes up with something we keep skiping these. There represent a very small percentage of the GFF so the loss is minimal. |
to accomodate more cases and exclude the joins that cause issues
Pushed last changes, added a long explanation of the problem, what I've solved and what I [sadly] haven't. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the best we can do with the current data model. I wonder if an issue should be created on core data model to highlight this deficiency.
With regard to getting it merged, I would just request that the DEBUG
statements are cleaned up and a comment distilling what you explained in the PR and why we have to have this solution is added.
Otherwise it looks good to me
Complex structures in the attributes of refseq GFF (shown in example), caused range to be miss-assigned thus failing due to the end coordinate being smalled than the starting one.
After some looking around, turns out the split that originally handles the attribute was not design to manage complex, nested, attributes. In changing that, but keeping a condition for the normal non-nested cases, all instances in my test were handled as expected.
Simple case:
[[rest of the line]]transl_except=(pos:16975745..16975747%2Caa:Other)
Nested case:
[[rest of the line]]transl_except=(pos:complement(join(11655574%2C11 655751..11655752))%2Caa:Other),(pos:complement(11655454..11655456)%2Caa:Other)
Requirements
When creating your Pull request, please fill out the template below:
PR details
Fix for a very specific (and probably not very wide-spreaded edge case.
Include a short description
(See above for details). Amend of a split function that was not covering all possible cases.
Include links to JIRA tickets
No Jira ticket.
Testing
Have you tested it?
Tested it in a small subset of the problematic GFF (the scaffold containing the entity that originally flagged the issue.
Assign to the weekly GitHub reviewer
@JackCurragh