Cleaning up with sed 2
Posted: 2017-07-08 Filed under: academic | Tags: clean, fasta, grep, sed, sequence Leave a commentI have a list of nucleotide sequences in fasta format, obtained from NCBI. I want to clean the sequences names, leaving only the accession numbers. Additionally, I want to extract the accession numbers to a separate list. I am not that familiar with sed
, since I use it mainly in my SlackBuilds to fix small things. Therefore, for my fuuture reference, I decided to sum things up.
As an example, let’s take a few sequences:
>lcl|AJ243429.1_cds_CAB96211.1_1 [gene=PTPRc] [protein=CD45] [protein_id=CAB96211.1] [location=161..3901] ATGGCAGGTTATTTTGGTCTGAATATCCTGCTCTGGGCTGGAATCGTCACTTTGGCCTCCTCCTCCACTA CCTCTCCTGCACTCACCACTCATGCAAGCACCGTGACAGAGGCTAATGTCACGCTTTTAGTAGCGCCCCT GACCTCCACTCACAACTCATCATCAAACCTCTCCACCAATCTAAGCCTGAGTTCGACCACCCCAGCCGCT GCCAGTGAGACCACGGTCTCCTTGACGACCCCAACGACCCCGATGGCCCTGACTAACATCTCAGCTATAA CAAGCTCAACAAATGCTACCAGTGATGCCCCCACGCAGCCTCCTGCAGCTCCACAGTGTTCTTACACTCT GAAGCCCATTCCGTTTGGCTTCAGACTCACCATCAGTAATTTTACAACTGCTGGATTATACAACCTAACT CTGCAGGAGCAGGACGATATGACAGAGACAACATATTCTCCTGATAAAACATCTTTGGACATCTTGGACC TGAAACCCTGCACCAGCTACAGGCACTGGCTCTCTCTCACACTTCAACACAGCGCCATCGACTGCAATCA GACTGTAAATACCACCGCTACCCTGCAACCGAAGATGGAAGACTTTCAAAATGTCAAATCCTCACCTGGA AGCATCTGTTATCAGAGCAAATGGAACATCCGATCTCTCCTAACTCAACCCAACAACATGCAACTGCGCG ACGATGACACTTTCTGCATCAGATACGATTCAAAGGACATTTGCACGAATTTCACAAAAAACGTTTCCAC AAGCTGTGGTTCGTTTCCCTTCACCAAATTCATCGATCTTGTTCCAAGTCACATAACTATGAGACGTTTA GGTGTATATCCTGAAAAAATAGAGACAACGTTCCCTCCAAACTGTAAAAACTGGACCGTGGAACACGTCT GTTCAGAAAAGGGAAGCCCCCACCAGAATAAGAGCCTGACAGAGTTGGAGCCCTACACAGACTACAGCTG TGCTGCTGTCGTCACGCGTGGGACCGTCTCCATGTGGACGGAGGACATTGACATCAGGATCAACTGCAAT CTGCTAATATATGAATGCAGTAAATCCAATGTTTTAACCAACACCTCCATGAATCTGCACTGGAACTTGA CGAGCTCCGTCTGTAAAGACCTTTTTACCCGGTTCAACTTTTCCTACAACTGCTACTGTGACAACCCTGT AAAAGGAAGTAAAGATGTTCCTCTAAAGTCTCAGCAAGTATCGTGTTCATTTTCTGCCCTTGATGCCTTC GAAGACTACACGTGTGGGATCAACGTCAAATACAAGAAGTTCTCCGTCCTCGACCGGGTGTTTCTTTATA GGACTGAACCTGGAATACCAGAAAGGCCACCTCAGCTGCATTTGGATGTTCATGAGCACAATCAGATTAC AGTTAAAGTTGATAAAATCAGCAGATTTAACGGACCGTCGAAGTTTTACATTGTGCGCCTGTATGAAGGC AAAACCCATAAGGAAACCAGGAATGGAACCAAGCCGTCATTTGTATTTAAAGATCTCAGTTACTCCACAG AGTACACCGTGCAGGCGTCCACATTCAATGGCCATTTTGAGAGCAGTCCACACACAAGAACCACTTCCAC CTTCTACAATGAACCCGCTCTGATCAATTTTCTGATCTTCCTCATCATCGTCACATCTGTGGCTCTGATA CTTGTGGCTTATAAGATTTATGTCCTGAAGAGGAAGAGGTCCCGAGACACCAGTGAAAGCATGATTCTTA TACCCAAAACTAACGATGAAGAAAAGCTGATATTTGTTGAGCCCCTTACATCAGAAACGCTGTTGGATGC CTACAAGAGAAAGATTGCTGATGAAGGAAGACTTTTCCTGGCTGAATTCCAGAGCATCCCAAGAATATTC TCCAAGTACACCGTGAAAGAGGCCAAAAAGTCCCACAATGTCCCTAAGAACCGCTATGTGGACATCCTGC CATATGATTATAATCGGGTCCAACTGACCACTGGGAACGGCAGCGCAGGCTGTGACTACATCAACGCCAG CTTTATAGACGGGTTCAAGGAATTAAAAAAGTACATAGCAGCTCAAGGTCCGAAGGAGGAGACTGTGAGT AACTTTTGGAGGATGATCTGGGAGCAGCAGACCTCCATCATCGTCATGGTTTCACGCTGTGAAGAGGGAA ACAGGATAAAGTGTGCTCAGTACTGGCCATCAGAAGATCGAGACACCGAGATCTTTGAGGAGTTTATAGT AAAGCTGACCTCAGAAGACCACTACCCCGACTACATCATTCGCCATCTGAGTCTAACAAATAAGAAGGAT AAGGGTTCAGAGAGGGAGGTGACTCACATCCAGTTCATGAGCTGGCCCGACCACGGCGTTCCCGAGGAGG CGCAACTCCTCCTGAAACTGAGGCGCCGAGTAAACTCATTCAAGAATTTCTTCAGTGGTCCCATCGTCGT CCACTGCAGTGCTGGAGTCGGCAGGACGGGTACCTACATCGGCATCGATGCCATGATGGAGAGTCTGGAG GCCGAAGGAAGGGTGGACATCTACGGCTACGTGGTCATGCTCCGCCGACAAAGATGTCTGATGGTTCAGG TTGAGGCTCAGTACATCCTGATTCACCAGGCGCTGCTGGAGCACACCCAGTTTGGTGAGACCGAGAACAC CCTGCAGGAGCTCCACAGCACGCTGAACACACTCAAACAGAGAAGCTCCGACAATGAGCCAACTTTACTG GAGGATGAGTTTGAACGACTCCCTAACTTCAAAAACTGGAGGACCTTCAACACCGGCGTTACAGAAGAAA ACAAGAAGAAGAATCGCACTTCATCCGTCATCCCATATGACTACAACAGAGTATTGCTGAAACTCGACGA AGAAAAAAGCCATGAGAGCGATCCCGATCCCGATGACGATTACGATACGTCCTCATACGATGACATCGAG GACAGCGCCAAGTACATCAATGCCTCCCTCTTAAGTGGTTACTGGGGCCCGCACACCTTCATCTCAGCAC AGACGCCTCTGCCAGACACTGTGGCTGACTTCTGGTCCATGATTCTCCAGAAAAGAGTATCTATTGTTGT CATGCTCTCCGACTGCTCTCAGGGAGACGAGGACTGTGTTTACTGGGGTGAGGACAAGAAGACAGTTGAG GACATTGAAGTGGAGACTGAGAGCACAGACAACTCCCCCAATTTCATCCTCCGCAACATGATGATCCATC ACCTGAAGACCGACGCGCACCAGCAAGTGAAACACTTCCAGTTCCTGAAGTGGGGGGACAAAGAGGTGCC GGAGAAACCACAAGATCTGGCCGACCTGATCAAGGAGATCAAGCACAGATGTGGCTACACGTGGCCGAGG AGCACGCCCATCATTGTCCACTGCAACGACGGCTCCTCCCGTTCAGGGGCCTTCTGTGCCCTGTGGAACC TTCTGGACAACGCTGAGAAAGAGAAGATGGTGGATGTTTTCCAGGTGGTCAAGACTCTTCGGAAGGAGAG ACAGGGCATGTGCCCCAGTCTGGAGCAGTACCAGTTCTTGTATGATGCACTGGAAGTGGTTTATCCAGTC CAGAACGGGGATGTGAAAGCAGCACAGAACTCTGTCCAGATTGTCAATGAAACAGCAGAGCAGCAGGCCA GCACTACCCGCACTGACCACCAGGAGGCAGAAGAGGGTGCCGATGGAGACGTTTCCACGGCGACTGGGGA GAAGAGCAGCACTGTCACCGTTGAGGTTTGA >lcl|AJ243430.1_cds_CAB96212.1_1 [gene=PTPRc] [protein=CD45] [protein_id=CAB96212.1] [location=join(599..659,2831..2977,3110..3208,3355..3384,3577..3825,3908..4126,4212..4319,4397..4528,4608..4754,5751..5900,6001..6209,6280..6340,6413..6521,6600..6637,6833..6942,7447..7537,7724..7800,7926..7962,8039..8136,8230..8355,8478..8635,8778..8913,9359..9508,9911..10001,10095..10234,10483..10617,10808..10924,11009..11166,11399..11534,11676..11894)] ATGGCAGGTTATTTTGGTCTGAATATCCTGCTCTGGGCTGGAATCGTCACTTTGGCCTCCTCCTCCACTA CCTCTCCTGCACTCACCACTCATGCAAGCACCGTGACAGAGGCTAATGTCACGCTTTTAGTAGCGCCCCT GACCTCCACTCACAACTCATCATCAAACCTCTCCACCAATCTAAGCCTGAGTTCGACCACCCCAGCCGCT GCCAGTGAGACCACGGTCTCCTTGACGACCCCAACGACCCCGATGGCCCTGACTAACATCTCAGCTATAA CAAGCTCAACAAATGCTACCAGTGATGCCCCCACGCAGCCTCCTGCAGCTCCACAGTGTTCTTACACTCT GAAGCCCATTCCGTTTGGCTTCAGACTCACCATCAGTAACTTTACAACTGCTGGATTATACAACCTAACT CTGCAGGAGCAGGACGATATGACAACACATTCTCCTGATAAAACATCTTTGGACATCTTGGACCTGAAAC CCTGCACCAGCTACAGGCACTGGCTCTCTCTCACACTTCAACACAGCGCCATCGACTGCAACCAGACTGT AAATACCACCGCTACCCTGCAACCGAAGATGGAAGACTTTCAAAATGTCAAATCCTCACCTGGAAGCATC TGTTATCAGAGCAAATGGAACATCCGATCTCTCCTAACTCAACCCAACAACATGCAACTGCGCGACGATG ACACTTTCTGCATCAGATACGATTCAAAGGACATTTGCACGAATTTCACAAAAAACGTTTCCACAAGCTG TGGTTCGTTTCCCTTCACCAAATTCATCGATCTTGTTCCAAGTCACATAACTATGAGACGTTTAGGTGTA TATCCTGAAAAAATAGAGACAACGTTCCCTCCAAACTGTAAAAACTGGACCGTGGAACACGTCGTCTGTT CAGAAAAGGGAAGCCCCCACCAGAATAAGAGCCTGAAAGAGTTGGAGCCCTACACAGACTACAGCTGTGC TGCTGTCGTCACGCGTGGGACCGTCTCCATGTGGACGGAGGACATTGACATCAGGATCAACTGCAATCTG ATAATATATGAATGCAGTAAATCCAATGTTTTAACCAACACCTCCATGAATCTGCACTGGAACTTGACGA GCTCCGTCTGTAAAGACCTTTTTACCCGGTTCAACTTTTCCTACAACTGCTACTGTGACAACCCTGTAAA AGGAAGTAAAGATGTTTTTATAAAGTCTCAGCAAGTATCGTGTTCATTTTCTTCCCTTGATGCCTTCAAA GACTACACGTGTGGGATCAACGTCAAATACAAGAAGTTCTCCGTCCTCGACCGGGTGTTTCTTTATAGGA CTGAACCTGGAATACCAGAAAGGCCACCTCAGCTGCATTTGGATGTTTATGAGCACAATCAGATTACAGT TAAAGTTGATAAAATCAGCAGATTTAACGGACCGTCGAAGTTTTACATTGTGCGCCTGTATGAAGGCAAA ACCCATAAGGAAACCAGGAATGGAACCAAGCCGTCATTTGTATTTAAAGATCTCAGTTACTCCACAGAGT ACACCGTGCAGGCGTCCACATTCAATGGCCATTTTGAGAGCAGTCCACACACAAGAACCACTTCCACCTT CTACAATGAACCCGCTCTGATCAATTTTCTGATCTTCCTCATCATCGTCACATCTGTGGCTCTGATACTT GTGGCTTATAAGATTTATGTCCTGAAGAGGAAGAGGTCCCGAGACACCAGTGAAAGCATGATTCTTATAC CCAAAACTAACGATGAAGAAAAGCTGATATTTGTTGAGCCCCTTACATCAGAAACGCTGTTGGATGCCTA CAAGAGAAAGATTGCTGATGAAGGAAGACTTTTCCTGGCTGAATTCCAGAGCATCCCAAGAATATTCTCC AAGTACACCGTGAAAGAGGCCAAAAAGTCCCACAATGTCCCTAAGAACCGCTATGTGGACATCCTGCCAT ATGATTATAATCGGGTCCAACTGACCACTGGGAACGGCAGCGCAGGCTGTGACTACATCAACGCCAGCTT TATAGACGGGTTCAAGGAATTAAAAAAGTACATAGCAGCTCAAGGTCCGAAGGAGGAGACTGTGAGTAAC TTTTGGAGGATGATCTGGGAGCAGCAGACCTCCATCATCGTCATGGTTTCACGCTGTGAAGAGGGAAACA GGATAAAGTGTGCTCAGTACTGGCCATCAGAAGATCGAGACACCGAGATCTTTGAGGAGTTTATAGTAAA GCTGACCTCAGAAGACCACTACCCCGACTACATCATTCGCCATCTGAGTCTAACAAATAAGAAGGATAAG GGTTCAGAGAGGGAGGTGACTCACATCCAGTTCATGAGCTGGCCCGACCACGGCGTTCCCGAGGAGGCGC AACTCCTCCTGAAACTGAGGCGCCGAGTAAACTCATTCAAGAATTTCTTCAGTGGTCCCATCGTCGTCCA CTGCAGTGCTGGAGTCGGCAGGACGGGTACCTACATCGGCATCGATGCCATGATGGAGAGTCTGGAGGCC GAAGGAAGGGTGGACATCTACGGCTACGTGGTCATGCTCCGCCGACAAAGATGTCTGATGGTTCAGGTTG AGGCTCAGTACATCCTGATTCACCAGGCGCTGCTGGAGCACACCCAGTTTGGTGAGACCGAGAACACCCT GCAGGAGCTCCACAGCACGCTGAACACACTCAAACAGAGAAGCTCCGACAATGAGCCAACTTTACTGGAG GATGAGTTTGAACGACTCCCTAACTTCAAAAACTGGAGGACCTTCAACACCGGCGTTACAGAAGAAAACA AGAAGAAGAATCGCACTTCATCCGTCATCCCATATGACTACAACAGAGTATTGCTGAAACTCGACGAAGA AAAAAGCCATGAGAGCGATCCCGATCCCGATGACGATTACGATACGTCCTCATACGATGACATCGAGGAC AGCGCCAAGTACATCAATGCCTCCCTCTTAAGTGGTTACTGGGGCCCGCACACCTTCATCTCAGCACAGA CGCCTCTGCCAGACACTGTGGCTGACTTCTGGTCCATGATTCTCCAGAAAAGAGTATCTATTGTTGTCAT GCTCTCCGACTGCTCTCAGGGAGACGAGGACTGTGTTTACTGGGGTGAGGACAAGAAGACAGTTGAGGAC ATTGAAGTGGAGACTGAGAGCACAGACAACTCCCCCAATTTCATCCTCCGCAACATGATGATCCATCACC TGAAGACCGACGCGCACCAGCAAGTGAAACACTTCCAGTTCCTGAAGTGGGGGGACAAAGAGGTGCCGGA GAAACCACAAGATCTGGCCGACCTGATCAAGGAGATCAAGCACAGATGTGGCTACACGTGGCCGAGGAGC ACGCCCATCATTGTCCACTGCAACGACGGCTCCTCCCGTTCAGGGGCCTTCTGTGCCCTGTGGAACCTTC TGGACAACGCTGAGAAAGAGAAGATGGTGGATGTTTTCCAGGTGGTCAAGACTCTTCGGAAGGAGAGACA GGGCATGTGCCCCAGTCTGGAGCAGTACCAGTTCTTGTATGATGCACTGGAAGTGGTTTATCCAGTCCAG AACGGGGATGTGAAAGCAGCACAGAACTCTGTCCAGATTGTCAATGAAACAGCAGAGCAGCAGGCCAGCA CTACCCGCACTGACCACCAGGAGGCAGAAGAGGGTGCCGATGGAGACGTTTCCACGGCGACTGGGGAGAA GAGCAGCACTGTCACCGTTGAGGTTTGA >lcl|AB031424.1_cds_BAA92179.1_1 [protein=CD45] [protein_id=BAA92179.1] [location=149..3799] ATGGCGAGATTTTTTGCGCTAGGACCCTTGGTTTTGGTCCTATGCTTGGTCTTATTCCTCTCACCCACAA GCACTTCTGAACAAGACACAGACACAAAAGTAACAGGTTCTTCTAACTCACCCACCATCATGCAATCCGC TACTCCTACTCCTGCCTTACCTACAGGTTCAGTAAAGGGAGTTAAAGCAAATGATTCAGGCACCACAGAA GTCCCATGTCAATACCGACTTATAGTGAATGGCTATGAAAAAAATTCCCTTTTGGTGAATATCAATGGCT CCAAAAGCCAAAATTACACAATCAGGATAAAGGATAAGAAGAGTGAGAAGAAAATCCCTGTCAGTATTGT ATCTGACACAACTACCATCGACATACTGTTTGAATGGCTAAAACCATGCACTAATTACAGTGTAAACATA GATAACTGCAATGTTGTCGGAGAAAACCATTTTAATTTGAATGTCTTGAAAAGTGTAAACCATTCTGCAG AACCTAAGGGTGATGAGCAAGTGTGTTTTAAAGATAGCAAATGGAATTTGACAAAATGTGTTAATATAAC AACTGAAGACTCTTGTGTTCAGAAGCCCATTCAGATTAATCTAGAAACATGCAACTACACAATGAATGTC CACATGCCCCCAGTCAAACCTGAAATAACCTTCAGTGAATCTATTCCAACTAAGTTTGTGTGGAGTAATA AGCCACTCAGTTCAGACAATGTAACGAGTCAGTGTAAAGACACATTTAAGGTCAAGTGCACTGATGACAC TGAGTATGATTTGGATAAGGATGTGTTTTTGAAACCTTCAGAAGAATACACATGCACTGGAAAATATAAA TATTACAATACGTCTATTGATAGCAAATCAAAACAAGTTAACATCCATTGCGAATGGATAAATGAAACTG TGTTTAGTGAAGTAACCAGCAACTCAATTGTGATTTTCTGGAATCTGTCCCCAGGGGATAAATGCAAAGG AATTACATGGGAAAAATTTTCAGCCATTTGTAAAAAATCTCATGATCAAAAATGCAGAAAAGTAAATGAC ACAACTGCAGAATGCGTATTTTCTGAATTAAAGTCGTATATCAATTACACATGCACACTTTCAGCCGAGG TCAACAAATCCATTGGCGTCTTCCCTATTTATACCAGAGATTTTAATCAAAGAACAAATTCTTCAAATCC CAGTTTTGAAGACCATGATAATGTCAAAGTGAAGGAAACATCTCATAATTCTTTTAATGTAAATTGCAAA AATTTGAAACCTGATGAATGGAATGGACCTAAGGGAATATATATAGCCAAACTCTTAAGTAATAGACCAC CTGTAACAAAAGAAAATCCTAAATCCTGTTCATTTACATTTGAAAATCTGTACTACCACACAGAATATAC TATTGAGATTATAGCCATAAATGGAAATAAAAATGAATCTTCTGCTAAAGCTACCCATTCCACTAGATAT AACGATAAGGCTCTCATTGGATTCCTGGTGTTCCTCATCCTTGTGACGTCACTTGCACTCCTTTTTGTGT TATACAAACTTTTCCTGCTCAAAAGAAAACGGACCGCTGAGGATGAAGAGATTCTTCTTACACAACCCCT GCGCAGAGTTGAGCCCATCTATGCTGGCGGTTTAGTGGAGGCTTACAAGAACAAGATCGCTGATGAGGGC CGTCTGTTTATGGATGAGTTTCAGAGCATTCCCCGAATTTTCTCCAATTACACCATCAAAGAAGCGAAGA AATCAGAAAACCAGTACAAGAACCGCTACGTTGACATTCTGCCTTATGACTATAATCGGGTAACTCTCTC AACTGGAGGTGAAGACAACTACATCAATGCCAGCTTTATCGAGGGATATCGAGAGCCAAAGAAATACATT GCAGCCCAAGGACCCAAGGATGAGACAGTTGTTGATTTCTGGCGAATGGTTTGGGAGCAAAAGTCATCCA TTATAGTCATGGTCACCCGTTGTGAGGAAGGAAACAAGACCAAATGCGCTCAGTACTGGCCATCTCTGGA CAGAGAGGCTGAGATTTTTGAAGAGTTTGTGGTGAAAATCCGATCTGAGGAACACTGCCCTGATTATGTC ATTCGCCATCTGATCCTGAACAATAAGCGAGAGAAGGGGTCAGAAAGAGAGGTCACTCATATCCAGTTCA TCAGCTGGCCTGACCATGGTGTGCCTGGAGACCCCAGCCTGCTTCTGAAGCTCAGGCGAAGAGTCAACTC TTTCAAGAACTTCTTCAGTGGCCCGGTGGTGGTCCACTGCAGTGCAGGGGTCGGCCGCACAGGAACATAT ATGAGCATTGATGCAATGATTGAATCTCTTGAGGCAGAAGGCAGAGTGGACATCTATGGATTCGTAGCAA AACTACGTCGCCAGCGTTGCCTCATGGTTCAAGTGGAGGCCCAGTACATACTGATCCACACTGCTCTGAT TGAGTACAATCAGTTTGGAGAGACAGAGGTCTCACTGTCTGAATTCCACTCAGTGCTCAACACTCTCAGA CAGAAAAACGGCAGTGACCCCAGTTTGCTTGAAAAGGAGTTCCAGAAACTCCCTAAATTTAAAAAATGGA GAACAATGAACACGGGAAGCAGTGAGGACAACAAGAAGAAAAACCGCGACTCAGCTGTTATCCCATACGA CTTTAATAGAGTCATATTTAGACTGGACATTGAGGGCAACCAGACCAGTGACCCTGAGGATGAGGAGGAG TATTCTTCAGACGAGGAAGAAGAATCAAACCAATACATCAACGCCTCATTCATTGATGGCTACTGGTGTA ATAAGAGTCTGATTGCAGCACAGGGGCCTTTACAAAGCACAGCGGAAGAGTTCCTGCTCATGCTGTACCA GCAACAAACTAAAACACTGGTCATGCTCACAGACTGCCAAGAGGACGGCAAGGATTATTGTTTTCAGTAT TGGGGAGATGAAAAGAAAACGTACGGAGACATGGAAATTGAGGTAAAAAAGACAGAAACCTTCCCAACGT ACGTGAGACGGCATCTGGAAATACAGTCCTTAAAGAAGAAAGAGGTCTTGAAGGTGGACCAGTACCAGTT CCTGAAATGGAGAGGCCGCGAGCTTCCAGAGAACGCTCAGGAGTTGGTAGAGATGGCAAGTATCATCAGA GAGAACGGCCATTATGACAACAGCAAAACAAACCGGAATGTCCCCATAGTGGTGCACTGCAACAACGGCT CATCTCGTACCGGAATCTTCTGTGCCTTATGGAATCTCCTGGACAGCGCCTACACAGAGAAGCTGGTGGA CGTCCTCCAAGAGGTCAAAAACTTGCGCAAGCTGAGGCAGGGGATGGTGGAAACAATCGAGCAGTATCAG TTCCTCTACACGGCTCTAGAAGGCGCTTTCCCTGTGCAGAACGGTGCCGTAAAGACGCCCCCTGCAAAGG ACGCCGCGCAGGTCATCAACGAAACCACGGCGCTCCTCACCGAGCCCAACAGCACTTCCGGGGCTGACCA GAAGGAGGCGGAAGAGAGCACAGCCGCAGAGAGCAATGAGCAGGGGGCCAAGGAGAGCAGCACTGCTGAG GCTCCAGCAGCAGAGGGAGACGGAGAGAAAGCCACATCCGAGGGCCCGACCAATGGCCCTACCGCCACTG TGGAGGTTTAA
First, clear the lcl
(local) sequence identifier:
sed -i "s:lcl|::g" cd45.fa
Next, remove everythin after the _cds
tag (including the tag itself), leaving just the accession number for each sequence. Let’s save the output as a new file:
sed "s:_cds.*::g" cd45.fa > cd45_accession.fa
Now, it’s time to extract the accession numbers only to a new file:
grep -o ">.*" cd45_accession.fa > cd45_accession.list
Finally, clean the >
symbol preceding each accesion number in the list:
sed -i "s:>::g" cd45_accession.list
That’s it!