Converting sequences by squizz

Sometimes I need to have a DNA or amino acid sequence (or an alignment) in several different formats. A neat little program that can convert between different sequence formats is squizz. Its primary function is to serve as a sequence format file checker, but can do some conversions.

The program is available at SBo, and as the description there states, the most common formats are supported :

  • sequence formats: EMBL, FASTA, GCG, GDE, GENBANK, IG, NBRF, PIR (codata), RAW, and SWISSPROT.
  • alignment formats: CLUSTAL, FASTA, MEGA, MSF, NEXUS, PHYLIP (interleaved / sequential) and STOCKHOLM.

To see them as a list, type:

squizz -l

As an example, let’s take the E. coli 6-phosphogluconate dehydrogenase gene, complete CDS (GenBank: M23181.1). Download it in a GenBank format:

LOCUS       ECOGNDF                 1013 bp    DNA     linear   BCT 26-APR-1993
DEFINITION  E.coli 6-phosphogluconate dehydrogenase gene, complete cds.
ACCESSION   M23181
VERSION     M23181.1  GI:146243
KEYWORDS    6-phosphogluconate dehydrogenase.
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
            Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 1013)
  AUTHORS   Miller,R.D., Dykhuizen,D.E. and Hartl,D.L.
  TITLE     Fitness effects of a deletion mutation increasing transcription of
            the 6-phosphogluconate dehydrogenase gene in Escherichia coli
  JOURNAL   Mol. Biol. Evol. 5 (6), 691-703 (1988)
   PUBMED   2464736
COMMENT     Original source text: E.coli K12 DNA.
FEATURES             Location/Qualifiers
     source          1..1013
                     /organism="Escherichia coli"
                     /mol_type="genomic DNA"
                     /strain="K-12"
                     /db_xref="taxon:562"
     gene            639..1013
                     /gene="6-phosphogluconate dehydrogenase"
     CDS             639..>1013
                     /gene="6-phosphogluconate dehydrogenase"
                     /codon_start=1
                     /transl_table=11
                     /product="6-phosphogluconate dehydrogenase"
                     /protein_id="AAA23924.1"
                     /db_xref="GI:146244"
                     /translation="MSKQQIGVVGMAVMGRNLALNIESRGYTVSIFNRSREKTEEVIA
                     ENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLDKGDIIIDGGN
                     TFFQDTIRRNRELSAEGFNFIGT"
ORIGIN      
        1 agatctcact aaaaactggg gataacgcct taaatggcga agaaacggtc taaataggct
       61 gattcaaggc atttacggga gaaaaaatcg gctcaaacat gaagaaatga aatgactgag
      121 tcagccgaga agaatttccc cgcttattcg caccttcctt aataaaacaa aaattcctaa
      181 agaaagtatc tattctgata cggttgttga ttggtgcgca ggatcattta tgctggtacg
      241 tttttcagat tttgtgcgtg taaatggctt cgatcaaggt tactttatgt actgtgaaga
      301 tattgacctg tgcttgaggc ttagcctggc tggtgtcaga cttcattatg ttcccgcttt
      361 tcatgcgata cattatgctc atcatgacaa tcgaagtttt ttttcaaaag ccttcagatg
      421 gcacttaaaa agtactttta gatatttagc cagaaaacgt attttatcaa atcgcaactt
      481 tgatcgaatt tcatcagttt ttcacccgta atataaagcc gtaagcatat aagcatggat
      541 aagctattta tactttaata agtactttgt atacttattt gcgaacattc caggccgcga
      601 gcattcagcg cggtgatcac acctgacagg agtatgtaat gtccaagcaa cagatcggcg
      661 tagtcggtat ggcagtgatg ggacgcaacc ttgcgctcaa catcgaaagc cgtggttata
      721 ccgtctctat tttcaaccgt tcccgtgaga agacggaaga agtgattgcc gaaaatccag
      781 gcaagaaact ggttccttac tatacggtga aagagtttgt cgaatctctg gaaacgcctc
      841 gtcgcatcct gttaatggtg aaagcaggtg caggcacgga tgctgctatt gattccctca
      901 aaccatatct cgataaagga gacatcatca ttgatggtgg taacaccttc ttccaggaca
      961 ctattcgtcg taatcgtgag ctttcagcag agggctttaa cttcatcggt acc
//

Let’s say I want to convert the DNA sequence from the above GenBank file to a FASTA format. This is easily done by:

squizz M23181.1.genbank -c FASTA > M23181.1.fasta

The format of the input file (M23181.1.genbank) will be automatically detected. The -c option specifies the format that the sequence will be converted to (FASTA). By default the converted sequence will be shown in the terminal output. I want to automatically save it to a file, via the > M23181.1.fasta redirect. The sequence is now converted:

>ECOGNDF E.coli 6-phosphogluconate dehydrogenase gene, complete cds
agatctcactaaaaactggggataacgccttaaatggcgaagaaacggtctaaataggctgattcaaggcatttacggga
gaaaaaatcggctcaaacatgaagaaatgaaatgactgagtcagccgagaagaatttccccgcttattcgcaccttcctt
aataaaacaaaaattcctaaagaaagtatctattctgatacggttgttgattggtgcgcaggatcatttatgctggtacg
tttttcagattttgtgcgtgtaaatggcttcgatcaaggttactttatgtactgtgaagatattgacctgtgcttgaggc
ttagcctggctggtgtcagacttcattatgttcccgcttttcatgcgatacattatgctcatcatgacaatcgaagtttt
ttttcaaaagccttcagatggcacttaaaaagtacttttagatatttagccagaaaacgtattttatcaaatcgcaactt
tgatcgaatttcatcagtttttcacccgtaatataaagccgtaagcatataagcatggataagctatttatactttaata
agtactttgtatacttatttgcgaacattccaggccgcgagcattcagcgcggtgatcacacctgacaggagtatgtaat
gtccaagcaacagatcggcgtagtcggtatggcagtgatgggacgcaaccttgcgctcaacatcgaaagccgtggttata
ccgtctctattttcaaccgttcccgtgagaagacggaagaagtgattgccgaaaatccaggcaagaaactggttccttac
tatacggtgaaagagtttgtcgaatctctggaaacgcctcgtcgcatcctgttaatggtgaaagcaggtgcaggcacgga
tgctgctattgattccctcaaaccatatctcgataaaggagacatcatcattgatggtggtaacaccttcttccaggaca
ctattcgtcgtaatcgtgagctttcagcagagggctttaacttcatcggtacc

Let’s convert it to something else, say EMBL:

squizz M23181.1.genbank -c EMBL > M23181.1.embl

Here’s the result:

ID   M23181; SV 1; XXX; XXX; XXX; XXX; 1013 BP.
AC   M23181;
DE   E.coli 6-phosphogluconate dehydrogenase gene, complete cds
KW   6-phosphogluconate dehydrogenase.
SQ   Sequence 1013 BP; 295 A; 206 C; 219 G; 293 T; 0 other;
     agatctcact aaaaactggg gataacgcct taaatggcga agaaacggtc taaataggct        60
     gattcaaggc atttacggga gaaaaaatcg gctcaaacat gaagaaatga aatgactgag       120
     tcagccgaga agaatttccc cgcttattcg caccttcctt aataaaacaa aaattcctaa       180
     agaaagtatc tattctgata cggttgttga ttggtgcgca ggatcattta tgctggtacg       240
     tttttcagat tttgtgcgtg taaatggctt cgatcaaggt tactttatgt actgtgaaga       300
     tattgacctg tgcttgaggc ttagcctggc tggtgtcaga cttcattatg ttcccgcttt       360
     tcatgcgata cattatgctc atcatgacaa tcgaagtttt ttttcaaaag ccttcagatg       420
     gcacttaaaa agtactttta gatatttagc cagaaaacgt attttatcaa atcgcaactt       480
     tgatcgaatt tcatcagttt ttcacccgta atataaagcc gtaagcatat aagcatggat       540
     aagctattta tactttaata agtactttgt atacttattt gcgaacattc caggccgcga       600
     gcattcagcg cggtgatcac acctgacagg agtatgtaat gtccaagcaa cagatcggcg       660
     tagtcggtat ggcagtgatg ggacgcaacc ttgcgctcaa catcgaaagc cgtggttata       720
     ccgtctctat tttcaaccgt tcccgtgaga agacggaaga agtgattgcc gaaaatccag       780
     gcaagaaact ggttccttac tatacggtga aagagtttgt cgaatctctg gaaacgcctc       840
     gtcgcatcct gttaatggtg aaagcaggtg caggcacgga tgctgctatt gattccctca       900
     aaccatatct cgataaagga gacatcatca ttgatggtgg taacaccttc ttccaggaca       960
     ctattcgtcg taatcgtgag ctttcagcag agggctttaa cttcatcggt acc             1013
//

Squizz can also convert between different multiple sequence alignment formats. So useful!

Leave a comment

Filed under Academic

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s