Skip to content

Input Files

Azat Badretdin edited this page Sep 14, 2020 · 23 revisions

To annotate a genome with PGAP using the instructions in Quick Start, two types of input files are necessary:

Genome assembly sequence file

The sequences constituting the genome assembly should be provided in a fasta file.

  • Each sequence in the file must have a definition line beginning with '>' and a unique identifier (SeqID), eg >contig001 or >contig002.
  • The genome assembly size (measured as the count of bases in the input fasta sequences ignoring Ns) must be within the reasonable range expected for the organism. If the size range is not known for the genus species, the minimum and maximum size allowed are 15 Kb and 100 Mb respectively.

The SeqIDs must:

  1. Be less than 50 characters long
  2. Only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
  3. Be unique within a genome
  • All sequences must be 199 nucleotides or more.
  • There should be no N at the beginning or end of each sequence.
  • No sequence should be all Ns.
  • Stretches of 10 Ns or more will be considered gaps of known length.

Metadata files

The attributes of the genome assembly must be provided as two files in YAML format.
⚠️ for indentations, make sure to use spaces rather than tabs!

Generic YAML file

Provides the necessary information for running the pipeline.
All fields are required.

  • fasta - Path to the genome assembly fasta file (see above, Genome assembly sequence file)
  • submol - Path to the file specifying the origin of the genome assembly (see below)

Example

fasta: 
    class: File
    location: Ecoli1_genomic.fna
submol:
  class: File
  location: E_coli1.yaml

Metadata YAML file (submol)

submol in the generic YAML file described above points to a another file in the same directory, the metadata file. This metadata file provides the information to include in the output of PGAP. Some of this information is optional.

Currently the metadata in the submol.yaml file only supports the 7-bit ASCII subset of Unicode. We are working on supporting the full UTF-8 character set.

  • topology - optional. Topology of the sequences included in the fasta file. Possible values are linear or circular. Circular means that the first base in the sequence is adjacent to the last base. Please provide the topology in the metadata YAML file only if it is applicable to ALL sequences in the fasta file. If some sequences in the assembled genome are circular and others linear, include the topology in the definition line of each sequence in the fasta file with the tag value pair [topology=circular] or [topology=linear], after the SeqID and a space (e.g. >seq1 [topology=circular]). If the topology is provided in neither the metadata YAML nor the fasta file, the sequences will be presumed to be linear.
  • organism
    genus_species - binomial name or, if the species is unknown, genus for the sequenced organism. This identifier must be valid in NCBI Taxonomy (see Taxonomy information for how to find out if the name is valid).
    strain - optional. Strain of the sequenced organism
  • contact info - optional, but include if intending to submit to GenBank. The main contact for this genome assembly
    last_name - Last name
    first_name - First name
    email - Email address
    organization - Organization or consortium submitting the genome assembly
    department - Department or division submitting the genome assembly
    phone - optional. Phone number
    fax - optional. Fax number
    street - Street address
    city - City
    state - State or region
    postal_code: Postal code
    country - Country
  • authors - optional, but include if intending to submit to GenBank. Author(s) of the genome assembly. Authors can be different from the contact.
    last_name - Last name
    first_name - First name
    middle_initial - optional. First letter of middle name.
  • consortium - optional. Name of the project that generated the genome assembly
  • comment - optional. Free text comment about the genome assembly. Appears in the COMMENT section of each GenBank sequence record.
  • bioproject - optional. BioProject ID (PRJXX) for the project, if available
  • biosample - optional. BioSample ID (SAMXXX) for the sequenced sample, if available
  • locus_tag_prefix - optional. One to 9-letter prefix to use for naming genes on this genome assembly. If an official locus tag prefix was already reserved from an INSDC organization (GenBank, ENA or DDBJ) for the given BioSample and BioProject pair, provide here. Otherwise, provide a string of your choice. If no value is provided, the prefix 'pgaptmp' will be used. See more details in this Note about locus tags.
  • sra - optional. Sequence reads used to build the assembly
    accession - Sequence Read Archive (SRA) accession for the run (with SRR, ERR or DRR prefix)
  • publications - optional. Publication describing the genome assembly
    pmid - PubMed ID for the publication

Example:

topology: 'circular'
organism:
    genus_species: 'Escherichia coli'
    strain: 'my_strain'
contact_info:
    last_name: 'Doe'
    first_name: 'Jane'
    email: '[email protected]'
    organization: 'NIH'
    department: 'NCBI'
    phone: '301-555-0245'
    fax: '301-555-1234'
    street: '9000 Rockville Pike'
    city: 'Bethesda'
    state: 'MD'
    postal_code: '20850'
    country: 'USA'
authors:
    - author:  
        last_name: 'Doe'    
        first_name: 'Jane'
        middle_initial: 'A'
    - author:  
        last_name: 'Doe'    
        first_name: 'John'
consortium: 'E. coli genome group'
bioproject: 'PRJ9999999'
biosample: 'SAMN99999999'      
locus_tag_prefix: 'pgaptmp'
sra:
    - accession: 'SRR9999999'
    - accession: 'ERR9999999'
publications:
    - publication:
        pmid: 29112715

Taxonomy information

How to find if your organism of interest is registered in NCBI taxonomy?

  1. Go to NCBI Taxonomy
  2. Enter the organism name in the search box and press Search
  3. Click on the result
  4. Verify that the rank is 'genus' or more specific

Note about locus tags

  • You can run PGAP with the locus tag prefix (LTP) of your choice, whether or not you plan to submit the annotated genome to GenBank.
  • If you plan to submit to GenBank and if you wish to have the final locus tags in the PGAP output, then you should register the BioProject and then the BioSample at https://submit.ncbi.nlm.nih.gov/subs/ PRIOR to running PGAP, and provide the BioProject, BioSample and the LTP that are returned in the input YAML file.
  • If you run PGAP with an arbitrarily chosen LTP and later decide to submit the PGAP-annotated genome to GenBank, the LTP will be automatically changed to the ones assigned to the BioProject:BioSample pair during processing of the genome.
Clone this wiki locally