Skip to content

Conversation

@leannehaggerty
Copy link
Member

@Ensembl/plantazoa can you please check that this new consistency check works for your stable IDs? If not, I can update so that it only runs on genebuild cores.

Updated Stable ID check to check that all prefixes per features are consistent.

Tested:
changed prefix for one gene from ENSIMR -> ENSXXX, then test fails

 perl $ENSCODE/ensembl-datacheck/scripts/run_datachecks.pl -host mysql-ens-genebuild-prod-6 -port 4532 -user ensro -dbname leanne_gca964332325v1_core_113_1 -dbtype core -n GeneStableID
GeneStableID ..
# Subtest: GeneStableID
    # Subtest: acropora_muricata_gca964332325v1, core, leanne_gca964332325v1_core_113_1, EnsemblMetazoa
        ok 1 - gene table has non-NULL stable IDs
        ok 2 - gene table has unique stable IDs
        ok 3 - transcript table has non-NULL stable IDs
        ok 4 - transcript table has unique stable IDs
        ok 5 - exon table has non-NULL stable IDs
        ok 6 - exon table has unique stable IDs
        ok 7 - translation table has non-NULL stable IDs
        ok 8 - translation table has unique stable IDs
        # Observed base prefixes per table:
        #   exon         : ENSIMR
        #   gene         : ENSIMR, ENSXXX
        #   transcript   : ENSIMR
        #   translation  : ENSIMR
        not ok 9 - Stable ID base prefix is consistent across genes, transcripts, exons and translations

        #   Failed test 'Stable ID base prefix is consistent across genes, transcripts, exons and translations'
        #   at /hps/software/users/ensembl/genebuild/leanne/repositories_tmp/ensembl-datacheck/lib/Bio/EnsEMBL/DataCheck/Checks/GeneStableID.pm line 133.
        1..9
        # Looks like you failed 1 test of 9.
    not ok 1 - acropora_muricata_gca964332325v1, core, leanne_gca964332325v1_core_113_1, EnsemblMetazoa

    #   Failed test 'acropora_muricata_gca964332325v1, core, leanne_gca964332325v1_core_113_1, EnsemblMetazoa'
    #   at /hps/software/users/ensembl/genebuild/leanne/repositories_tmp//ensembl-datacheck/lib/Bio/EnsEMBL/DataCheck/DbCheck.pm line 737.
    1..1
    # Looks like you failed 1 test of 1.
not ok 1 - GeneStableID

#   Failed test 'GeneStableID'
#   at /hps/software/users/ensembl/genebuild/leanne/repositories_tmp//ensembl-datacheck/lib/Bio/EnsEMBL/DataCheck/BaseCheck.pm line 170.
1..1
Failed 1/1 subtests

Test Summary Report
-------------------
GeneStableID (Wstat: 0 Tests: 1 Failed: 1)
  Failed test:  1
Files=1, Tests=1,  4 wallclock secs ( 0.23 usr +  0.03 sys =  0.26 CPU)
Result: FAIL

Info about the new check:
Given a stable_id, we first drop any .version, then we strip trailing digits. From the remaining letters/underscores, if the last character is one of G T E P (the feature-type letter), we remove that one letter to get the base prefix. Examples:

  • ENSG00000123456.3 → ENSG00000123456 → remove digits → ENSG → remove feature letter G → ENS
  • ENSMUST00000123456 → remove digits → ENSMUST → remove T → ENSMUS
  • BRAKERXXXT00001234 → remove digits → BRAKERXXXT → remove T → BRAKER

Copy link
Member

@EreboPSilva EreboPSilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@vsitnik vsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stable_id_prefix_consistency_check should be selected based on 'genebuild.method' metakey


$self->translation_stable_id_check($species_id);
# NEW: check base prefix consistency across feature types
$self->stable_id_prefix_consistency_check($species_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you condition it with a genebuild.method ?
Like we have it here

$self->stable_id_check('gene', $species_id);
$self->stable_id_check('transcript', $species_id);
$self->stable_id_check('exon', $species_id);
$self->translation_stable_id_check($species_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check this for metazoa/plants data. Most likely is not true for microbes if translations have the same sequences, as they are using sequence derived hashes as IDs.
Upd. Checked. Seems to be hold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants