Converting sequences by squizz
Posted: 2016-07-06 Filed under: academic | Tags: amino acid, convert, DNA, msa, protein, sequence, squizz Leave a commentSometimes I need to have a DNA or amino acid sequence (or an alignment) in several different formats. A neat little program that can convert between different sequence formats is squizz. Its primary function is to serve as a sequence format file checker, but can do some conversions.
The program is available at SBo, and as the description there states, the most common formats are supported :
- sequence formats: EMBL, FASTA, GCG, GDE, GENBANK, IG, NBRF, PIR (codata), RAW, and SWISSPROT.
- alignment formats: CLUSTAL, FASTA, MEGA, MSF, NEXUS, PHYLIP (interleaved / sequential) and STOCKHOLM.
To see them as a list, type:
squizz -l
As an example, let’s take the E. coli 6-phosphogluconate dehydrogenase gene, complete CDS (GenBank: M23181.1). Download it in a GenBank format:
LOCUS ECOGNDF 1013 bp DNA linear BCT 26-APR-1993 DEFINITION E.coli 6-phosphogluconate dehydrogenase gene, complete cds. ACCESSION M23181 VERSION M23181.1 GI:146243 KEYWORDS 6-phosphogluconate dehydrogenase. SOURCE Escherichia coli ORGANISM Escherichia coli Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 1013) AUTHORS Miller,R.D., Dykhuizen,D.E. and Hartl,D.L. TITLE Fitness effects of a deletion mutation increasing transcription of the 6-phosphogluconate dehydrogenase gene in Escherichia coli JOURNAL Mol. Biol. Evol. 5 (6), 691-703 (1988) PUBMED 2464736 COMMENT Original source text: E.coli K12 DNA. FEATURES Location/Qualifiers source 1..1013 /organism="Escherichia coli" /mol_type="genomic DNA" /strain="K-12" /db_xref="taxon:562" gene 639..1013 /gene="6-phosphogluconate dehydrogenase" CDS 639..>1013 /gene="6-phosphogluconate dehydrogenase" /codon_start=1 /transl_table=11 /product="6-phosphogluconate dehydrogenase" /protein_id="AAA23924.1" /db_xref="GI:146244" /translation="MSKQQIGVVGMAVMGRNLALNIESRGYTVSIFNRSREKTEEVIA ENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLDKGDIIIDGGN TFFQDTIRRNRELSAEGFNFIGT" ORIGIN 1 agatctcact aaaaactggg gataacgcct taaatggcga agaaacggtc taaataggct 61 gattcaaggc atttacggga gaaaaaatcg gctcaaacat gaagaaatga aatgactgag 121 tcagccgaga agaatttccc cgcttattcg caccttcctt aataaaacaa aaattcctaa 181 agaaagtatc tattctgata cggttgttga ttggtgcgca ggatcattta tgctggtacg 241 tttttcagat tttgtgcgtg taaatggctt cgatcaaggt tactttatgt actgtgaaga 301 tattgacctg tgcttgaggc ttagcctggc tggtgtcaga cttcattatg ttcccgcttt 361 tcatgcgata cattatgctc atcatgacaa tcgaagtttt ttttcaaaag ccttcagatg 421 gcacttaaaa agtactttta gatatttagc cagaaaacgt attttatcaa atcgcaactt 481 tgatcgaatt tcatcagttt ttcacccgta atataaagcc gtaagcatat aagcatggat 541 aagctattta tactttaata agtactttgt atacttattt gcgaacattc caggccgcga 601 gcattcagcg cggtgatcac acctgacagg agtatgtaat gtccaagcaa cagatcggcg 661 tagtcggtat ggcagtgatg ggacgcaacc ttgcgctcaa catcgaaagc cgtggttata 721 ccgtctctat tttcaaccgt tcccgtgaga agacggaaga agtgattgcc gaaaatccag 781 gcaagaaact ggttccttac tatacggtga aagagtttgt cgaatctctg gaaacgcctc 841 gtcgcatcct gttaatggtg aaagcaggtg caggcacgga tgctgctatt gattccctca 901 aaccatatct cgataaagga gacatcatca ttgatggtgg taacaccttc ttccaggaca 961 ctattcgtcg taatcgtgag ctttcagcag agggctttaa cttcatcggt acc //
Let’s say I want to convert the DNA sequence from the above GenBank file to a FASTA format. This is easily done by:
squizz M23181.1.genbank -c FASTA > M23181.1.fasta
The format of the input file (M23181.1.genbank
) will be automatically detected. The -c
option specifies the format that the sequence will be converted to (FASTA
). By default the converted sequence will be shown in the terminal output. I want to automatically save it to a file, via the > M23181.1.fasta
redirect. The sequence is now converted:
>ECOGNDF E.coli 6-phosphogluconate dehydrogenase gene, complete cds agatctcactaaaaactggggataacgccttaaatggcgaagaaacggtctaaataggctgattcaaggcatttacggga gaaaaaatcggctcaaacatgaagaaatgaaatgactgagtcagccgagaagaatttccccgcttattcgcaccttcctt aataaaacaaaaattcctaaagaaagtatctattctgatacggttgttgattggtgcgcaggatcatttatgctggtacg tttttcagattttgtgcgtgtaaatggcttcgatcaaggttactttatgtactgtgaagatattgacctgtgcttgaggc ttagcctggctggtgtcagacttcattatgttcccgcttttcatgcgatacattatgctcatcatgacaatcgaagtttt ttttcaaaagccttcagatggcacttaaaaagtacttttagatatttagccagaaaacgtattttatcaaatcgcaactt tgatcgaatttcatcagtttttcacccgtaatataaagccgtaagcatataagcatggataagctatttatactttaata agtactttgtatacttatttgcgaacattccaggccgcgagcattcagcgcggtgatcacacctgacaggagtatgtaat gtccaagcaacagatcggcgtagtcggtatggcagtgatgggacgcaaccttgcgctcaacatcgaaagccgtggttata ccgtctctattttcaaccgttcccgtgagaagacggaagaagtgattgccgaaaatccaggcaagaaactggttccttac tatacggtgaaagagtttgtcgaatctctggaaacgcctcgtcgcatcctgttaatggtgaaagcaggtgcaggcacgga tgctgctattgattccctcaaaccatatctcgataaaggagacatcatcattgatggtggtaacaccttcttccaggaca ctattcgtcgtaatcgtgagctttcagcagagggctttaacttcatcggtacc
Let’s convert it to something else, say EMBL:
squizz M23181.1.genbank -c EMBL > M23181.1.embl
Here’s the result:
ID M23181; SV 1; XXX; XXX; XXX; XXX; 1013 BP. AC M23181; DE E.coli 6-phosphogluconate dehydrogenase gene, complete cds KW 6-phosphogluconate dehydrogenase. SQ Sequence 1013 BP; 295 A; 206 C; 219 G; 293 T; 0 other; agatctcact aaaaactggg gataacgcct taaatggcga agaaacggtc taaataggct 60 gattcaaggc atttacggga gaaaaaatcg gctcaaacat gaagaaatga aatgactgag 120 tcagccgaga agaatttccc cgcttattcg caccttcctt aataaaacaa aaattcctaa 180 agaaagtatc tattctgata cggttgttga ttggtgcgca ggatcattta tgctggtacg 240 tttttcagat tttgtgcgtg taaatggctt cgatcaaggt tactttatgt actgtgaaga 300 tattgacctg tgcttgaggc ttagcctggc tggtgtcaga cttcattatg ttcccgcttt 360 tcatgcgata cattatgctc atcatgacaa tcgaagtttt ttttcaaaag ccttcagatg 420 gcacttaaaa agtactttta gatatttagc cagaaaacgt attttatcaa atcgcaactt 480 tgatcgaatt tcatcagttt ttcacccgta atataaagcc gtaagcatat aagcatggat 540 aagctattta tactttaata agtactttgt atacttattt gcgaacattc caggccgcga 600 gcattcagcg cggtgatcac acctgacagg agtatgtaat gtccaagcaa cagatcggcg 660 tagtcggtat ggcagtgatg ggacgcaacc ttgcgctcaa catcgaaagc cgtggttata 720 ccgtctctat tttcaaccgt tcccgtgaga agacggaaga agtgattgcc gaaaatccag 780 gcaagaaact ggttccttac tatacggtga aagagtttgt cgaatctctg gaaacgcctc 840 gtcgcatcct gttaatggtg aaagcaggtg caggcacgga tgctgctatt gattccctca 900 aaccatatct cgataaagga gacatcatca ttgatggtgg taacaccttc ttccaggaca 960 ctattcgtcg taatcgtgag ctttcagcag agggctttaa cttcatcggt acc 1013 //
Squizz can also convert between different multiple sequence alignment formats. So useful!