FASTA syntax highlight

FASTA is among the most common text formats for nucleotide or amino acid sequences. I use medit and a feature that I have been looking for is a syntax highlighting for FASTA files. I want to have colour code for nucleotides and amino acids, similarly to how scripting or programming languages are displayed by the editor. I have been looking for this feature for a long time and all I found was the fasta.lang at GNOME’s wiki and the fastalang at git. Both projects are for Gedit and support only nucleotide sequences. Let’s say neither of these did the job for me, but I appreciate very much both projects. I decided to do things myself.

I want amino acids to be colour coded too, not just nucleotides. This presents a certain problem when assigning the colours: A, T, C and G stand for adenine, thymine, cytosine and guanine but are also the one letter codes tor Ala (alanine), Thr (threonine), Cys (cysteine) and Gly (glycine) amino acids. Anyway, I think I managed to pick good enough colours to go around this problem.

Here is the fasta.lang file I made (there’s a download link in the bottom of the page):

<?xml version="1.0" encoding="UTF-8"?>
<!--
 Copyright 2017 Petar Petrov
 All rights reserved.

 Thanks to https://github.com/wrf/fastalang

 Redistribution and use of this script, with or without modification, is
 permitted provided that the following conditions are met:

 1. Redistributions of this script must retain the above copyright
    notice, this list of conditions and the following disclaimer.

  THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY EXPRESS OR IMPLIED
  WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
  MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
  EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->
<language id="fasta" _name="FASTA" version="2.0" _section="Scientific">
	<metadata>
		<property name="mimetypes">text/fasta</property>
		<property name="globs">*.fa;*.faa;*.fst;*.fas;*.fasta</property>
	</metadata>
    <styles>
		<style id="header" _name="Header" map-to="def:type"/>
		<style id="a" _name="Ala (alanine) or A (adenine)" map-to="fasta:a"/>
		<style id="c" _name="Cys (cysteine) or C (cytosine)" map-to="fasta:c"/>
		<style id="g" _name="Gly (glycine) or G (guanine)" map-to="fasta:g"/>
		<style id="t" _name="Thr (threonine) or T (thymine)" map-to="fasta:t"/>
		<style id="n" _name="Asn (Asparagine) or any Nucleotide" map-to="fasta:n"/>
		<style id="r" _name="Arg (Arginine)" map-to="fasta:r"/>
		<style id="d" _name="Asn (Aspartic acid)" map-to="fasta:d"/>
		<style id="q" _name="Gln (Glutamine)" map-to="fasta:q"/>
		<style id="e" _name="Glu (Glutamic acid)" map-to="fasta:e"/>
		<style id="h" _name="His (Histidine)" map-to="fasta:h"/>
		<style id="i" _name="Ile (Isoleucine)" map-to="fasta:i"/>
		<style id="l" _name="Leu (Leucine)" map-to="fasta:l"/>
		<style id="k" _name="Lys (Lysine)" map-to="fasta:k"/>
		<style id="m" _name="Met (Methionine)" map-to="fasta:m"/>
		<style id="f" _name="Phe (Phenylalanine)" map-to="fasta:f"/>
		<style id="p" _name="Pro (Proline)" map-to="fasta:p"/>
		<style id="s" _name="Ser (Serine)" map-to="fasta:s"/>
		<style id="w" _name="Trp (Tryptophan)" map-to="fasta:w"/>
		<style id="y" _name="Tyr (Tyrosine)" map-to="fasta:y"/>
		<style id="v" _name="Val (Valine)" map-to="fasta:v"/>
		<style id="x" _name="Any amino acid or not nucleotide" map-to="def:x"/>

    </styles>

	<default-regex-options case-sensitive="false"/>
	<definitions>
		<context id="header" style-ref="header" end-at-line-end="true">
			<start>&gt;</start>
		</context>

		<context id="a" style-ref="a">
			<match>[A]+</match>
		</context>

		<context id="c" style-ref="c">
			<match>[C]+</match>
		</context>

		<context id="g" style-ref="g">
			<match>[G]+</match>
		</context>

		<context id="t" style-ref="t">
			<match>[TU]+</match>
		</context>

		<context id="n" style-ref="n">
			<match>[N]+</match>
		</context>

		<context id="r" style-ref="r">
			<match>[R]+</match>
		</context>
		
		<context id="d" style-ref="d">
			<match>[D]+</match>
		</context>
		
		<context id="q" style-ref="q">
			<match>[Q]+</match>
		</context>

		<context id="e" style-ref="e">
			<match>[E]+</match>
		</context>
		
		<context id="h" style-ref="h">
			<match>[H]+</match>
		</context>

		<context id="i" style-ref="i">
			<match>[I]+</match>
		</context>

		<context id="l" style-ref="l">
			<match>[L]+</match>
		</context>
		
		<context id="k" style-ref="k">
			<match>[K]+</match>
		</context>
		
		<context id="m" style-ref="m">
			<match>[M]+</match>
		</context>

		<context id="f" style-ref="f">
			<match>[F]+</match>
		</context>
		
		<context id="p" style-ref="p">
			<match>[P]+</match>
		</context>

		<context id="s" style-ref="s">
			<match>[S]+</match>
		</context>
		
		<context id="w" style-ref="w">
			<match>[W]+</match>
		</context>
		
		<context id="y" style-ref="y">
			<match>[Y]+</match>
		</context>

		<context id="v" style-ref="v">
			<match>[V]+</match>
		</context>
		
		<context id="x" style-ref="x">
			<match>[X]+</match>
		</context>

		<context id="fasta">
			<include>
				<context ref="header"/>
				<context ref="a"/>
				<context ref="c"/>
				<context ref="g"/>
				<context ref="t"/>
				<context ref="n"/>
				<context ref="r"/>
				<context ref="d"/>
				<context ref="q"/>
				<context ref="e"/>
				<context ref="h"/>
				<context ref="i"/>
				<context ref="l"/>
				<context ref="k"/>
				<context ref="m"/>
				<context ref="f"/>
				<context ref="p"/>
				<context ref="s"/>
				<context ref="w"/>
				<context ref="y"/>
				<context ref="v"/>
				<context ref="x"/>
			</include>
		</context>
	</definitions>
</language>

Unfortunatelly it cannot be used with the existing colour schemes right away, since they miss the language specific styles for the fasta.lang. My colour scheme of choice is Tango. For medit, it is found in /usr/share/medit-1/language-specs/tango.xml. I took it from there, renamed the file to biotango.xml and modified it as follows (there’s a download link in the bottom of the page):

<?xml version="1.0" encoding="UTF-8"?>
<!--

 Copyright (C) 2006-2007 GtkSourceView team
 Author: Michael Monreal <michael.monreal@gmail.com>

 Modified for 'fasta.lang' by Petar Petrov 2017
 
 This library is free software; you can redistribute it and/or
 modify it under the terms of the GNU Library General Public
 License as published by the Free Software Foundation; either
 version 2 of the License, or (at your option) any later version.

 This library is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 Library General Public License for more details.

 You should have received a copy of the GNU Library General Public
 License along with this library; if not, write to the
 Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 Boston, MA 02111-1307, USA.

-->
<style-scheme id="biotango" _name="BioTango" version="1.0">
  <author>Michael Monreal</author>
  <_description>Color scheme using Tango color palette</_description>

  <!-- Tango Palette -->
  <color name="butter1"                     value="#fce94f"/>
  <color name="butter2"                     value="#edd400"/>
  <color name="butter3"                     value="#c4a000"/>
  <color name="chameleon1"                  value="#8ae234"/>
  <color name="chameleon2"                  value="#73d216"/>
  <color name="chameleon3"                  value="#4e9a06"/>
  <color name="orange1"                     value="#fcaf3e"/>
  <color name="orange2"                     value="#f57900"/>
  <color name="orange3"                     value="#ce5c00"/>
  <color name="skyblue1"                    value="#729fcf"/>
  <color name="skyblue2"                    value="#3465a4"/>
  <color name="skyblue3"                    value="#204a87"/>
  <color name="plum1"                       value="#ad7fa8"/>
  <color name="plum2"                       value="#75507b"/>
  <color name="plum3"                       value="#5c3566"/>
  <color name="chocolate1"                  value="#e9b96e"/>
  <color name="chocolate2"                  value="#c17d11"/>
  <color name="chocolate3"                  value="#8f5902"/>
  <color name="scarletred1"                 value="#ef2929"/>
  <color name="scarletred2"                 value="#cc0000"/>
  <color name="scarletred3"                 value="#a40000"/>
  <color name="aluminium1"                  value="#eeeeec"/>
  <color name="aluminium2"                  value="#d3d7cf"/>
  <color name="aluminium3"                  value="#babdb6"/>
  <color name="aluminium4"                  value="#888a85"/>
  <color name="aluminium5"                  value="#555753"/>
  <color name="aluminium6"                  value="#2e3436"/>

  <!-- legacy styles for old lang files: do NOT use them in lang files -->
  <style name="Others"                      foreground="chameleon3" bold="true"/>
  <style name="Others 2"                    foreground="chameleon3"/>
  <style name="Others 3"                    foreground="plum3"/>

  <!-- Bracket Matching -->
  <style name="bracket-match"               foreground="aluminium1" background="aluminium3" bold="true"/>
  <style name="bracket-mismatch"            foreground="aluminium1" background="scarletred3" bold="true"/>

  <!-- Right Margin -->
  <style name="right-margin"                foreground="aluminium5" background="aluminium4"/>
  
  <!-- Search Matching -->
  <style name="search-match"                background="butter1"/>  

  <!-- Comments -->
  <style name="def:comment"                 foreground="skyblue3"/>
  <style name="def:shebang"                 foreground="skyblue3" bold="true"/>
  <style name="def:doc-comment-element"     italic="true"/>

  <!-- Constants -->
  <style name="def:constant"                foreground="plum1"/>
  <style name="def:special-char"            foreground="plum3"/>

  <!-- Identifiers -->
  <style name="def:identifier"              foreground="skyblue1"/>

  <!-- Statements -->
  <style name="def:statement"               foreground="scarletred3" bold="true"/>

  <!-- Types -->
  <style name="def:type"                    foreground="chameleon3" bold="true"/>

  <!-- Others -->
  <style name="def:preprocessor"            foreground="chocolate3"/>
  <style name="def:error"                   background="scarletred2" bold="true"/>
  <style name="def:note"                    background="orange1" bold="true"/>
  <style name="def:underlined"              italic="true" underline="true"/>

  <!-- Language specific -->
  <style name="diff:added-line"             foreground="chameleon3"/>
  <style name="diff:removed-line"           foreground="plum3"/>
  <style name="diff:changed-line"           use-style="def:preprocessor"/>
  <style name="diff:diff-file"              use-style="def:type"/>
  <style name="diff:location"               use-style="def:statement"/>
  <style name="diff:special-case"           use-style="def:constant"/>

  <style name="xml:tags"                    foreground="chameleon3"/>
  <style name="xml:namespace"               bold="true"/>

  <style name="js:object"                   foreground="chameleon3" bold="true"/>
  <style name="js:constructors"             foreground="chameleon3"/>

  <style name="latex:display-math"          foreground="plum3"/>
  <style name="latex:command"               foreground="chameleon3" bold="true"/>
  <style name="latex:include"               use-style="def:preprocessor"/>

  <style name="sh:variable"                 foreground="plum3"/>
  <style name="sh:variable-definition"      foreground="chameleon3"/>

  <style name="fasta:a"	                    foreground="chameleon3" />
  <style name="fasta:c"                     foreground="butter3" />
  <style name="fasta:g"                     foreground="chocolate3" />
  <style name="fasta:t"                     foreground="plum3" />
  <style name="fasta:n"                     foreground="aluminium3" />
  <style name="fasta:r"                     foreground="scarletred1" />
  <style name="fasta:d"                     foreground="skyblue2" />
  <style name="fasta:q"                     foreground="aluminium3" />
  <style name="fasta:e"                     foreground="skyblue2" />  
  <style name="fasta:h"                     foreground="scarletred1" />
  <style name="fasta:i"                     foreground="chameleon2" />
  <style name="fasta:l"                     foreground="chameleon2" />
  <style name="fasta:k"                     foreground="scarletred1" />
  <style name="fasta:m"                     foreground="chameleon1" />
  <style name="fasta:f"                     foreground="chameleon3" />
  <style name="fasta:p"                     foreground="orange1" />
  <style name="fasta:s"                     foreground="plum1" />
  <style name="fasta:w"                     foreground="chameleon3" />
  <style name="fasta:y"                     foreground="plum1" />
  <style name="fasta:v"                     foreground="orange1" />
  <style name="fasta:x"                     foreground="#000000" />
</style-scheme>

I saved both files to ~/.local/share/medit-1/language-specs/ and selected the new colour scheme from Preferences > Color Scheme > BioTango.

How does it look?

I want to say it very clearly:
I do not know anything (well, almost) about xml syntax, lang files, etc, etc. What I did was simply inspect what others have done and try to adjust things to my needs. Therefore, when someone who knows the matter well reads this, please let me know if I did the modifications correctly.

I made a tarball containing both files (use at your own risk):
biolang-0.1.tar.gz

This is just a start, but I am very interested to improve my fasta.lang and BioTango. I can play with the colours, assign a more sophisticated scheme for the amino acids, or make several versions that follow the colour convension of SeaView, Ugene and other programs. I like the Tango colour scheme/style, but the other standart styles can be modified to support fasta.lang as well. I suppose fasta.lang and Biotango will work for Gedit (at least version 2), but I have not tried. If someone makes it work for Gedit, or wants me to investigate ;), please let me know.

I will appreciate feedback. And help?


One Comment on “FASTA syntax highlight”

  1. Shaarazad says:

    This looks good! My advice would be maybe to highlight methionine and tryptophane more, because they’re the ones you phase or build your model from, it would be easier to spot them right away for use in model buliding.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s