Converting sequences by squizz
Posted: 2016-07-06 Filed under: academic | Tags: amino acid, convert, DNA, msa, protein, sequence, squizz Leave a commentSometimes I need to have a DNA or amino acid sequence (or an alignment) in several different formats. A neat little program that can convert between different sequence formats is squizz. Its primary function is to serve as a sequence format file checker, but can do some conversions.
The program is available at SBo, and as the description there states, the most common formats are supported :
- sequence formats: EMBL, FASTA, GCG, GDE, GENBANK, IG, NBRF, PIR (codata), RAW, and SWISSPROT.
- alignment formats: CLUSTAL, FASTA, MEGA, MSF, NEXUS, PHYLIP (interleaved / sequential) and STOCKHOLM.
To see them as a list, type:
squizz -l
As an example, let’s take the E. coli 6-phosphogluconate dehydrogenase gene, complete CDS (GenBank: M23181.1). Download it in a GenBank format:
LOCUS ECOGNDF 1013 bp DNA linear BCT 26-APR-1993
DEFINITION E.coli 6-phosphogluconate dehydrogenase gene, complete cds.
ACCESSION M23181
VERSION M23181.1 GI:146243
KEYWORDS 6-phosphogluconate dehydrogenase.
SOURCE Escherichia coli
ORGANISM Escherichia coli
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE 1 (bases 1 to 1013)
AUTHORS Miller,R.D., Dykhuizen,D.E. and Hartl,D.L.
TITLE Fitness effects of a deletion mutation increasing transcription of
the 6-phosphogluconate dehydrogenase gene in Escherichia coli
JOURNAL Mol. Biol. Evol. 5 (6), 691-703 (1988)
PUBMED 2464736
COMMENT Original source text: E.coli K12 DNA.
FEATURES Location/Qualifiers
source 1..1013
/organism="Escherichia coli"
/mol_type="genomic DNA"
/strain="K-12"
/db_xref="taxon:562"
gene 639..1013
/gene="6-phosphogluconate dehydrogenase"
CDS 639..>1013
/gene="6-phosphogluconate dehydrogenase"
/codon_start=1
/transl_table=11
/product="6-phosphogluconate dehydrogenase"
/protein_id="AAA23924.1"
/db_xref="GI:146244"
/translation="MSKQQIGVVGMAVMGRNLALNIESRGYTVSIFNRSREKTEEVIA
ENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLDKGDIIIDGGN
TFFQDTIRRNRELSAEGFNFIGT"
ORIGIN
1 agatctcact aaaaactggg gataacgcct taaatggcga agaaacggtc taaataggct
61 gattcaaggc atttacggga gaaaaaatcg gctcaaacat gaagaaatga aatgactgag
121 tcagccgaga agaatttccc cgcttattcg caccttcctt aataaaacaa aaattcctaa
181 agaaagtatc tattctgata cggttgttga ttggtgcgca ggatcattta tgctggtacg
241 tttttcagat tttgtgcgtg taaatggctt cgatcaaggt tactttatgt actgtgaaga
301 tattgacctg tgcttgaggc ttagcctggc tggtgtcaga cttcattatg ttcccgcttt
361 tcatgcgata cattatgctc atcatgacaa tcgaagtttt ttttcaaaag ccttcagatg
421 gcacttaaaa agtactttta gatatttagc cagaaaacgt attttatcaa atcgcaactt
481 tgatcgaatt tcatcagttt ttcacccgta atataaagcc gtaagcatat aagcatggat
541 aagctattta tactttaata agtactttgt atacttattt gcgaacattc caggccgcga
601 gcattcagcg cggtgatcac acctgacagg agtatgtaat gtccaagcaa cagatcggcg
661 tagtcggtat ggcagtgatg ggacgcaacc ttgcgctcaa catcgaaagc cgtggttata
721 ccgtctctat tttcaaccgt tcccgtgaga agacggaaga agtgattgcc gaaaatccag
781 gcaagaaact ggttccttac tatacggtga aagagtttgt cgaatctctg gaaacgcctc
841 gtcgcatcct gttaatggtg aaagcaggtg caggcacgga tgctgctatt gattccctca
901 aaccatatct cgataaagga gacatcatca ttgatggtgg taacaccttc ttccaggaca
961 ctattcgtcg taatcgtgag ctttcagcag agggctttaa cttcatcggt acc
//
Let’s say I want to convert the DNA sequence from the above GenBank file to a FASTA format. This is easily done by:
squizz M23181.1.genbank -c FASTA > M23181.1.fasta
The format of the input file (M23181.1.genbank) will be automatically detected. The -c option specifies the format that the sequence will be converted to (FASTA). By default the converted sequence will be shown in the terminal output. I want to automatically save it to a file, via the > M23181.1.fasta redirect. The sequence is now converted:
>ECOGNDF E.coli 6-phosphogluconate dehydrogenase gene, complete cds agatctcactaaaaactggggataacgccttaaatggcgaagaaacggtctaaataggctgattcaaggcatttacggga gaaaaaatcggctcaaacatgaagaaatgaaatgactgagtcagccgagaagaatttccccgcttattcgcaccttcctt aataaaacaaaaattcctaaagaaagtatctattctgatacggttgttgattggtgcgcaggatcatttatgctggtacg tttttcagattttgtgcgtgtaaatggcttcgatcaaggttactttatgtactgtgaagatattgacctgtgcttgaggc ttagcctggctggtgtcagacttcattatgttcccgcttttcatgcgatacattatgctcatcatgacaatcgaagtttt ttttcaaaagccttcagatggcacttaaaaagtacttttagatatttagccagaaaacgtattttatcaaatcgcaactt tgatcgaatttcatcagtttttcacccgtaatataaagccgtaagcatataagcatggataagctatttatactttaata agtactttgtatacttatttgcgaacattccaggccgcgagcattcagcgcggtgatcacacctgacaggagtatgtaat gtccaagcaacagatcggcgtagtcggtatggcagtgatgggacgcaaccttgcgctcaacatcgaaagccgtggttata ccgtctctattttcaaccgttcccgtgagaagacggaagaagtgattgccgaaaatccaggcaagaaactggttccttac tatacggtgaaagagtttgtcgaatctctggaaacgcctcgtcgcatcctgttaatggtgaaagcaggtgcaggcacgga tgctgctattgattccctcaaaccatatctcgataaaggagacatcatcattgatggtggtaacaccttcttccaggaca ctattcgtcgtaatcgtgagctttcagcagagggctttaacttcatcggtacc
Let’s convert it to something else, say EMBL:
squizz M23181.1.genbank -c EMBL > M23181.1.embl
Here’s the result:
ID M23181; SV 1; XXX; XXX; XXX; XXX; 1013 BP.
AC M23181;
DE E.coli 6-phosphogluconate dehydrogenase gene, complete cds
KW 6-phosphogluconate dehydrogenase.
SQ Sequence 1013 BP; 295 A; 206 C; 219 G; 293 T; 0 other;
agatctcact aaaaactggg gataacgcct taaatggcga agaaacggtc taaataggct 60
gattcaaggc atttacggga gaaaaaatcg gctcaaacat gaagaaatga aatgactgag 120
tcagccgaga agaatttccc cgcttattcg caccttcctt aataaaacaa aaattcctaa 180
agaaagtatc tattctgata cggttgttga ttggtgcgca ggatcattta tgctggtacg 240
tttttcagat tttgtgcgtg taaatggctt cgatcaaggt tactttatgt actgtgaaga 300
tattgacctg tgcttgaggc ttagcctggc tggtgtcaga cttcattatg ttcccgcttt 360
tcatgcgata cattatgctc atcatgacaa tcgaagtttt ttttcaaaag ccttcagatg 420
gcacttaaaa agtactttta gatatttagc cagaaaacgt attttatcaa atcgcaactt 480
tgatcgaatt tcatcagttt ttcacccgta atataaagcc gtaagcatat aagcatggat 540
aagctattta tactttaata agtactttgt atacttattt gcgaacattc caggccgcga 600
gcattcagcg cggtgatcac acctgacagg agtatgtaat gtccaagcaa cagatcggcg 660
tagtcggtat ggcagtgatg ggacgcaacc ttgcgctcaa catcgaaagc cgtggttata 720
ccgtctctat tttcaaccgt tcccgtgaga agacggaaga agtgattgcc gaaaatccag 780
gcaagaaact ggttccttac tatacggtga aagagtttgt cgaatctctg gaaacgcctc 840
gtcgcatcct gttaatggtg aaagcaggtg caggcacgga tgctgctatt gattccctca 900
aaccatatct cgataaagga gacatcatca ttgatggtgg taacaccttc ttccaggaca 960
ctattcgtcg taatcgtgag ctttcagcag agggctttaa cttcatcggt acc 1013
//
Squizz can also convert between different multiple sequence alignment formats. So useful!
