Skip to content

Submitting UCE Data to NCBI Genbank

Carl Oliveros edited this page Aug 29, 2017 · 15 revisions
Title:       Submitting UCE Data to NCBI Genbank     
Project:     faircloth-lab documentation project  
Author:      Carl Oliveros, Brant Faircloth  
Affiliation: faircloth-lab  
Web:         http://faircloth-lab.org  
Date:        22 June 2017

Purpose

These are the steps to follow to submit data from enriched UCE contigs (and other sorts of enriched, contig-like data) to the NCBI Targeted Locus Study database, which is part of NCBI Genbank.

Steps

  1. Register an NCBI BioProject (if you have not already done so)

  2. Register NCBI BioSamples for a BioProject (if you have not already done so)

  3. Find the alignment files for the incomplete matrix containing data from ALL of the loci you enriched (this should be the untrimmed contigs).

  4. Prepare an ncbi.conf file that contains metadata for all contigs for each sample that looks similar to:

    [metadata]
    molecule:DNA
    moltype:genomic
    specimen_voucher:{}
    bioproject:PRJNA304409
    biosample:{}
    
    [organisms]
    abroscopus_albogularis_28164:Abroscopus albogularis
    acanthiza_murina_12152:Acanthiza murina
    
    [biosamples]
    abroscopus_albogularis_28164:SAMN04301695
    acanthiza_murina_12152:SAMN04301696
    
    [vouchers]
    abroscopus_albogularis_28164:KU:28164
    acanthiza_murina_12152:KU:12152
    

    Use only valid institution codes (which can be found here) in the voucher information. You can also include an [exclude taxa] or an [exclude loci] section in the ncbi.conf file if you wish to exclude some samples or loci.

  5. Run the phyluce_ncbi_prep_uce_align_files_for_ncbi_targeted_locus_db program against the config file and the folder of alignments that you are annotating.

    phyluce_ncbi_prep_uce_align_files_for_ncbi_targeted_locus_db \
        --alignments /path/to/your/alignments \
        --conf ncbi.conf \
        --output fsatbl_directory \
        --input-format nexus
    

    This program will create one .fsa file and one .tbl file for each of your samples.

  6. Create metadata file on http://www.ncbi.nlm.nih.gov/WebSub/template.cgi. Fill in the form and download it as template.sbt.

    template.sbt

  7. Create a text file named comment.txt with contents similar to:

    We identified thousands of ultra-conserved elements (UCEs) in 106 birds to
    examine songbird diversification.  All of the UCEs from one bird are
    included in a single TLS project.
    
  8. Use the wizard in https://submit.ncbi.nlm.nih.gov/structcomment/nongenomes/ to create a an assembly structured comment and save it as assembly.cmt.

    assembly.cmt

  9. Get tbl2asn. For a description of all command line arguments of tbl2asn, go to http://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/. Make a separate folder named sequin that will hold the *.sqn files you are about to generate. Now, run tbl2asn on each of the files in the fsatbl_directory using the command:

    tbl2asn -t template.sbt \
    -p fsatbl_directory \
    -Y comment.txt \
    -w assembly.cmt \
    -H y -a s -V v -r sequin
    

    Check the validations files (*.val) in the sequin folder and correct any errors. You can ignore warnings.

  10. Submit all *.sqn files using SequinMacroSend by filling in your information along with the following addition as a "note":

```
This submission is meant for the Targeted Locus Study database and the contigs
associated with this submission are from target enriched ultraconserved element
loci (sensu Faircloth et al. 2012). Several of these UCE sequences may be <200
bp in length.  In previous emails between Brant Faircloth, Michael Baxter, Rich
McVeigh, and DeAnne Olsen Cravaritis, it was decided (not sure who) that for
these types of loci (associated with ultra-conserved elements) accepting
sequences < 200 bp was allowed.
```

If you have more than 19 files, it appears you need to upload these in batches.  Once submitted, GenBank staff will provide your accession numbers and/or to feedback on your submission.
  1. Provided NCBI staff do not notify you of any problems, YOU ARE DONE!! You can safely ignore the steps below - you should not have to deal with Sequin.

Optional Steps (only perform these if requested by NCBI staff)

In the case that you are asked to perform vector screening by NCBI staff, you will need to perform the following steps:

  1. Open the Sequin program and for each of the *.sqn file in the sequin folder (yes, each and every one of them) perform the following steps:

    1. Open the file using the "Read Existing Record" button:

    sequin-open

    1. Select the "Edit" menu, then select "Edit submitter info." Enter a release date on the Submission tab then click "Accept":

    sequin-submitter

    1. Select the Search menu, then select Vector Screen and Vector Search & Trim Tool. Click on the Search Univec button and wait for the search results:

    search-univec

    1. After the vector search has completed, click on Select Only Strong and Moderate then click on Trim Selected Sequences. After trimming has completed, click on Dismiss and close the Trimmed Locations window.

    2. Select the Search menu, then select Validate. Resolve any errors if any are found.

    3. Select the File menu, then select Save As. Add -final suffix to the file name (e.g. split_1-final.sqn). Click Yes when asked to "propagate descriptors", then close the window.

    sequin-propogate-descriptors

    1. Repeat the above steps for all *.sqn files in the sequin folder.
Clone this wiki locally