Cleaning up with sed 2

I have a list of nucleotide sequences in fasta format, obtained from NCBI. I want to clean the sequences names, leaving only the accession numbers. Additionally, I want to extract the accession numbers to a separate list. I am not that familiar with sed, since I use it mainly in my SlackBuilds to fix small things. Therefore, for my fuuture reference, I decided to sum things up.

As an example, let’s take a few sequences:

>lcl|AJ243429.1_cds_CAB96211.1_1 [gene=PTPRc] [protein=CD45] [protein_id=CAB96211.1] [location=161..3901]
ATGGCAGGTTATTTTGGTCTGAATATCCTGCTCTGGGCTGGAATCGTCACTTTGGCCTCCTCCTCCACTA
CCTCTCCTGCACTCACCACTCATGCAAGCACCGTGACAGAGGCTAATGTCACGCTTTTAGTAGCGCCCCT
GACCTCCACTCACAACTCATCATCAAACCTCTCCACCAATCTAAGCCTGAGTTCGACCACCCCAGCCGCT
GCCAGTGAGACCACGGTCTCCTTGACGACCCCAACGACCCCGATGGCCCTGACTAACATCTCAGCTATAA
CAAGCTCAACAAATGCTACCAGTGATGCCCCCACGCAGCCTCCTGCAGCTCCACAGTGTTCTTACACTCT
GAAGCCCATTCCGTTTGGCTTCAGACTCACCATCAGTAATTTTACAACTGCTGGATTATACAACCTAACT
CTGCAGGAGCAGGACGATATGACAGAGACAACATATTCTCCTGATAAAACATCTTTGGACATCTTGGACC
TGAAACCCTGCACCAGCTACAGGCACTGGCTCTCTCTCACACTTCAACACAGCGCCATCGACTGCAATCA
GACTGTAAATACCACCGCTACCCTGCAACCGAAGATGGAAGACTTTCAAAATGTCAAATCCTCACCTGGA
AGCATCTGTTATCAGAGCAAATGGAACATCCGATCTCTCCTAACTCAACCCAACAACATGCAACTGCGCG
ACGATGACACTTTCTGCATCAGATACGATTCAAAGGACATTTGCACGAATTTCACAAAAAACGTTTCCAC
AAGCTGTGGTTCGTTTCCCTTCACCAAATTCATCGATCTTGTTCCAAGTCACATAACTATGAGACGTTTA
GGTGTATATCCTGAAAAAATAGAGACAACGTTCCCTCCAAACTGTAAAAACTGGACCGTGGAACACGTCT
GTTCAGAAAAGGGAAGCCCCCACCAGAATAAGAGCCTGACAGAGTTGGAGCCCTACACAGACTACAGCTG
TGCTGCTGTCGTCACGCGTGGGACCGTCTCCATGTGGACGGAGGACATTGACATCAGGATCAACTGCAAT
CTGCTAATATATGAATGCAGTAAATCCAATGTTTTAACCAACACCTCCATGAATCTGCACTGGAACTTGA
CGAGCTCCGTCTGTAAAGACCTTTTTACCCGGTTCAACTTTTCCTACAACTGCTACTGTGACAACCCTGT
AAAAGGAAGTAAAGATGTTCCTCTAAAGTCTCAGCAAGTATCGTGTTCATTTTCTGCCCTTGATGCCTTC
GAAGACTACACGTGTGGGATCAACGTCAAATACAAGAAGTTCTCCGTCCTCGACCGGGTGTTTCTTTATA
GGACTGAACCTGGAATACCAGAAAGGCCACCTCAGCTGCATTTGGATGTTCATGAGCACAATCAGATTAC
AGTTAAAGTTGATAAAATCAGCAGATTTAACGGACCGTCGAAGTTTTACATTGTGCGCCTGTATGAAGGC
AAAACCCATAAGGAAACCAGGAATGGAACCAAGCCGTCATTTGTATTTAAAGATCTCAGTTACTCCACAG
AGTACACCGTGCAGGCGTCCACATTCAATGGCCATTTTGAGAGCAGTCCACACACAAGAACCACTTCCAC
CTTCTACAATGAACCCGCTCTGATCAATTTTCTGATCTTCCTCATCATCGTCACATCTGTGGCTCTGATA
CTTGTGGCTTATAAGATTTATGTCCTGAAGAGGAAGAGGTCCCGAGACACCAGTGAAAGCATGATTCTTA
TACCCAAAACTAACGATGAAGAAAAGCTGATATTTGTTGAGCCCCTTACATCAGAAACGCTGTTGGATGC
CTACAAGAGAAAGATTGCTGATGAAGGAAGACTTTTCCTGGCTGAATTCCAGAGCATCCCAAGAATATTC
TCCAAGTACACCGTGAAAGAGGCCAAAAAGTCCCACAATGTCCCTAAGAACCGCTATGTGGACATCCTGC
CATATGATTATAATCGGGTCCAACTGACCACTGGGAACGGCAGCGCAGGCTGTGACTACATCAACGCCAG
CTTTATAGACGGGTTCAAGGAATTAAAAAAGTACATAGCAGCTCAAGGTCCGAAGGAGGAGACTGTGAGT
AACTTTTGGAGGATGATCTGGGAGCAGCAGACCTCCATCATCGTCATGGTTTCACGCTGTGAAGAGGGAA
ACAGGATAAAGTGTGCTCAGTACTGGCCATCAGAAGATCGAGACACCGAGATCTTTGAGGAGTTTATAGT
AAAGCTGACCTCAGAAGACCACTACCCCGACTACATCATTCGCCATCTGAGTCTAACAAATAAGAAGGAT
AAGGGTTCAGAGAGGGAGGTGACTCACATCCAGTTCATGAGCTGGCCCGACCACGGCGTTCCCGAGGAGG
CGCAACTCCTCCTGAAACTGAGGCGCCGAGTAAACTCATTCAAGAATTTCTTCAGTGGTCCCATCGTCGT
CCACTGCAGTGCTGGAGTCGGCAGGACGGGTACCTACATCGGCATCGATGCCATGATGGAGAGTCTGGAG
GCCGAAGGAAGGGTGGACATCTACGGCTACGTGGTCATGCTCCGCCGACAAAGATGTCTGATGGTTCAGG
TTGAGGCTCAGTACATCCTGATTCACCAGGCGCTGCTGGAGCACACCCAGTTTGGTGAGACCGAGAACAC
CCTGCAGGAGCTCCACAGCACGCTGAACACACTCAAACAGAGAAGCTCCGACAATGAGCCAACTTTACTG
GAGGATGAGTTTGAACGACTCCCTAACTTCAAAAACTGGAGGACCTTCAACACCGGCGTTACAGAAGAAA
ACAAGAAGAAGAATCGCACTTCATCCGTCATCCCATATGACTACAACAGAGTATTGCTGAAACTCGACGA
AGAAAAAAGCCATGAGAGCGATCCCGATCCCGATGACGATTACGATACGTCCTCATACGATGACATCGAG
GACAGCGCCAAGTACATCAATGCCTCCCTCTTAAGTGGTTACTGGGGCCCGCACACCTTCATCTCAGCAC
AGACGCCTCTGCCAGACACTGTGGCTGACTTCTGGTCCATGATTCTCCAGAAAAGAGTATCTATTGTTGT
CATGCTCTCCGACTGCTCTCAGGGAGACGAGGACTGTGTTTACTGGGGTGAGGACAAGAAGACAGTTGAG
GACATTGAAGTGGAGACTGAGAGCACAGACAACTCCCCCAATTTCATCCTCCGCAACATGATGATCCATC
ACCTGAAGACCGACGCGCACCAGCAAGTGAAACACTTCCAGTTCCTGAAGTGGGGGGACAAAGAGGTGCC
GGAGAAACCACAAGATCTGGCCGACCTGATCAAGGAGATCAAGCACAGATGTGGCTACACGTGGCCGAGG
AGCACGCCCATCATTGTCCACTGCAACGACGGCTCCTCCCGTTCAGGGGCCTTCTGTGCCCTGTGGAACC
TTCTGGACAACGCTGAGAAAGAGAAGATGGTGGATGTTTTCCAGGTGGTCAAGACTCTTCGGAAGGAGAG
ACAGGGCATGTGCCCCAGTCTGGAGCAGTACCAGTTCTTGTATGATGCACTGGAAGTGGTTTATCCAGTC
CAGAACGGGGATGTGAAAGCAGCACAGAACTCTGTCCAGATTGTCAATGAAACAGCAGAGCAGCAGGCCA
GCACTACCCGCACTGACCACCAGGAGGCAGAAGAGGGTGCCGATGGAGACGTTTCCACGGCGACTGGGGA
GAAGAGCAGCACTGTCACCGTTGAGGTTTGA

>lcl|AJ243430.1_cds_CAB96212.1_1 [gene=PTPRc] [protein=CD45] [protein_id=CAB96212.1] [location=join(599..659,2831..2977,3110..3208,3355..3384,3577..3825,3908..4126,4212..4319,4397..4528,4608..4754,5751..5900,6001..6209,6280..6340,6413..6521,6600..6637,6833..6942,7447..7537,7724..7800,7926..7962,8039..8136,8230..8355,8478..8635,8778..8913,9359..9508,9911..10001,10095..10234,10483..10617,10808..10924,11009..11166,11399..11534,11676..11894)]
ATGGCAGGTTATTTTGGTCTGAATATCCTGCTCTGGGCTGGAATCGTCACTTTGGCCTCCTCCTCCACTA
CCTCTCCTGCACTCACCACTCATGCAAGCACCGTGACAGAGGCTAATGTCACGCTTTTAGTAGCGCCCCT
GACCTCCACTCACAACTCATCATCAAACCTCTCCACCAATCTAAGCCTGAGTTCGACCACCCCAGCCGCT
GCCAGTGAGACCACGGTCTCCTTGACGACCCCAACGACCCCGATGGCCCTGACTAACATCTCAGCTATAA
CAAGCTCAACAAATGCTACCAGTGATGCCCCCACGCAGCCTCCTGCAGCTCCACAGTGTTCTTACACTCT
GAAGCCCATTCCGTTTGGCTTCAGACTCACCATCAGTAACTTTACAACTGCTGGATTATACAACCTAACT
CTGCAGGAGCAGGACGATATGACAACACATTCTCCTGATAAAACATCTTTGGACATCTTGGACCTGAAAC
CCTGCACCAGCTACAGGCACTGGCTCTCTCTCACACTTCAACACAGCGCCATCGACTGCAACCAGACTGT
AAATACCACCGCTACCCTGCAACCGAAGATGGAAGACTTTCAAAATGTCAAATCCTCACCTGGAAGCATC
TGTTATCAGAGCAAATGGAACATCCGATCTCTCCTAACTCAACCCAACAACATGCAACTGCGCGACGATG
ACACTTTCTGCATCAGATACGATTCAAAGGACATTTGCACGAATTTCACAAAAAACGTTTCCACAAGCTG
TGGTTCGTTTCCCTTCACCAAATTCATCGATCTTGTTCCAAGTCACATAACTATGAGACGTTTAGGTGTA
TATCCTGAAAAAATAGAGACAACGTTCCCTCCAAACTGTAAAAACTGGACCGTGGAACACGTCGTCTGTT
CAGAAAAGGGAAGCCCCCACCAGAATAAGAGCCTGAAAGAGTTGGAGCCCTACACAGACTACAGCTGTGC
TGCTGTCGTCACGCGTGGGACCGTCTCCATGTGGACGGAGGACATTGACATCAGGATCAACTGCAATCTG
ATAATATATGAATGCAGTAAATCCAATGTTTTAACCAACACCTCCATGAATCTGCACTGGAACTTGACGA
GCTCCGTCTGTAAAGACCTTTTTACCCGGTTCAACTTTTCCTACAACTGCTACTGTGACAACCCTGTAAA
AGGAAGTAAAGATGTTTTTATAAAGTCTCAGCAAGTATCGTGTTCATTTTCTTCCCTTGATGCCTTCAAA
GACTACACGTGTGGGATCAACGTCAAATACAAGAAGTTCTCCGTCCTCGACCGGGTGTTTCTTTATAGGA
CTGAACCTGGAATACCAGAAAGGCCACCTCAGCTGCATTTGGATGTTTATGAGCACAATCAGATTACAGT
TAAAGTTGATAAAATCAGCAGATTTAACGGACCGTCGAAGTTTTACATTGTGCGCCTGTATGAAGGCAAA
ACCCATAAGGAAACCAGGAATGGAACCAAGCCGTCATTTGTATTTAAAGATCTCAGTTACTCCACAGAGT
ACACCGTGCAGGCGTCCACATTCAATGGCCATTTTGAGAGCAGTCCACACACAAGAACCACTTCCACCTT
CTACAATGAACCCGCTCTGATCAATTTTCTGATCTTCCTCATCATCGTCACATCTGTGGCTCTGATACTT
GTGGCTTATAAGATTTATGTCCTGAAGAGGAAGAGGTCCCGAGACACCAGTGAAAGCATGATTCTTATAC
CCAAAACTAACGATGAAGAAAAGCTGATATTTGTTGAGCCCCTTACATCAGAAACGCTGTTGGATGCCTA
CAAGAGAAAGATTGCTGATGAAGGAAGACTTTTCCTGGCTGAATTCCAGAGCATCCCAAGAATATTCTCC
AAGTACACCGTGAAAGAGGCCAAAAAGTCCCACAATGTCCCTAAGAACCGCTATGTGGACATCCTGCCAT
ATGATTATAATCGGGTCCAACTGACCACTGGGAACGGCAGCGCAGGCTGTGACTACATCAACGCCAGCTT
TATAGACGGGTTCAAGGAATTAAAAAAGTACATAGCAGCTCAAGGTCCGAAGGAGGAGACTGTGAGTAAC
TTTTGGAGGATGATCTGGGAGCAGCAGACCTCCATCATCGTCATGGTTTCACGCTGTGAAGAGGGAAACA
GGATAAAGTGTGCTCAGTACTGGCCATCAGAAGATCGAGACACCGAGATCTTTGAGGAGTTTATAGTAAA
GCTGACCTCAGAAGACCACTACCCCGACTACATCATTCGCCATCTGAGTCTAACAAATAAGAAGGATAAG
GGTTCAGAGAGGGAGGTGACTCACATCCAGTTCATGAGCTGGCCCGACCACGGCGTTCCCGAGGAGGCGC
AACTCCTCCTGAAACTGAGGCGCCGAGTAAACTCATTCAAGAATTTCTTCAGTGGTCCCATCGTCGTCCA
CTGCAGTGCTGGAGTCGGCAGGACGGGTACCTACATCGGCATCGATGCCATGATGGAGAGTCTGGAGGCC
GAAGGAAGGGTGGACATCTACGGCTACGTGGTCATGCTCCGCCGACAAAGATGTCTGATGGTTCAGGTTG
AGGCTCAGTACATCCTGATTCACCAGGCGCTGCTGGAGCACACCCAGTTTGGTGAGACCGAGAACACCCT
GCAGGAGCTCCACAGCACGCTGAACACACTCAAACAGAGAAGCTCCGACAATGAGCCAACTTTACTGGAG
GATGAGTTTGAACGACTCCCTAACTTCAAAAACTGGAGGACCTTCAACACCGGCGTTACAGAAGAAAACA
AGAAGAAGAATCGCACTTCATCCGTCATCCCATATGACTACAACAGAGTATTGCTGAAACTCGACGAAGA
AAAAAGCCATGAGAGCGATCCCGATCCCGATGACGATTACGATACGTCCTCATACGATGACATCGAGGAC
AGCGCCAAGTACATCAATGCCTCCCTCTTAAGTGGTTACTGGGGCCCGCACACCTTCATCTCAGCACAGA
CGCCTCTGCCAGACACTGTGGCTGACTTCTGGTCCATGATTCTCCAGAAAAGAGTATCTATTGTTGTCAT
GCTCTCCGACTGCTCTCAGGGAGACGAGGACTGTGTTTACTGGGGTGAGGACAAGAAGACAGTTGAGGAC
ATTGAAGTGGAGACTGAGAGCACAGACAACTCCCCCAATTTCATCCTCCGCAACATGATGATCCATCACC
TGAAGACCGACGCGCACCAGCAAGTGAAACACTTCCAGTTCCTGAAGTGGGGGGACAAAGAGGTGCCGGA
GAAACCACAAGATCTGGCCGACCTGATCAAGGAGATCAAGCACAGATGTGGCTACACGTGGCCGAGGAGC
ACGCCCATCATTGTCCACTGCAACGACGGCTCCTCCCGTTCAGGGGCCTTCTGTGCCCTGTGGAACCTTC
TGGACAACGCTGAGAAAGAGAAGATGGTGGATGTTTTCCAGGTGGTCAAGACTCTTCGGAAGGAGAGACA
GGGCATGTGCCCCAGTCTGGAGCAGTACCAGTTCTTGTATGATGCACTGGAAGTGGTTTATCCAGTCCAG
AACGGGGATGTGAAAGCAGCACAGAACTCTGTCCAGATTGTCAATGAAACAGCAGAGCAGCAGGCCAGCA
CTACCCGCACTGACCACCAGGAGGCAGAAGAGGGTGCCGATGGAGACGTTTCCACGGCGACTGGGGAGAA
GAGCAGCACTGTCACCGTTGAGGTTTGA

>lcl|AB031424.1_cds_BAA92179.1_1 [protein=CD45] [protein_id=BAA92179.1] [location=149..3799]
ATGGCGAGATTTTTTGCGCTAGGACCCTTGGTTTTGGTCCTATGCTTGGTCTTATTCCTCTCACCCACAA
GCACTTCTGAACAAGACACAGACACAAAAGTAACAGGTTCTTCTAACTCACCCACCATCATGCAATCCGC
TACTCCTACTCCTGCCTTACCTACAGGTTCAGTAAAGGGAGTTAAAGCAAATGATTCAGGCACCACAGAA
GTCCCATGTCAATACCGACTTATAGTGAATGGCTATGAAAAAAATTCCCTTTTGGTGAATATCAATGGCT
CCAAAAGCCAAAATTACACAATCAGGATAAAGGATAAGAAGAGTGAGAAGAAAATCCCTGTCAGTATTGT
ATCTGACACAACTACCATCGACATACTGTTTGAATGGCTAAAACCATGCACTAATTACAGTGTAAACATA
GATAACTGCAATGTTGTCGGAGAAAACCATTTTAATTTGAATGTCTTGAAAAGTGTAAACCATTCTGCAG
AACCTAAGGGTGATGAGCAAGTGTGTTTTAAAGATAGCAAATGGAATTTGACAAAATGTGTTAATATAAC
AACTGAAGACTCTTGTGTTCAGAAGCCCATTCAGATTAATCTAGAAACATGCAACTACACAATGAATGTC
CACATGCCCCCAGTCAAACCTGAAATAACCTTCAGTGAATCTATTCCAACTAAGTTTGTGTGGAGTAATA
AGCCACTCAGTTCAGACAATGTAACGAGTCAGTGTAAAGACACATTTAAGGTCAAGTGCACTGATGACAC
TGAGTATGATTTGGATAAGGATGTGTTTTTGAAACCTTCAGAAGAATACACATGCACTGGAAAATATAAA
TATTACAATACGTCTATTGATAGCAAATCAAAACAAGTTAACATCCATTGCGAATGGATAAATGAAACTG
TGTTTAGTGAAGTAACCAGCAACTCAATTGTGATTTTCTGGAATCTGTCCCCAGGGGATAAATGCAAAGG
AATTACATGGGAAAAATTTTCAGCCATTTGTAAAAAATCTCATGATCAAAAATGCAGAAAAGTAAATGAC
ACAACTGCAGAATGCGTATTTTCTGAATTAAAGTCGTATATCAATTACACATGCACACTTTCAGCCGAGG
TCAACAAATCCATTGGCGTCTTCCCTATTTATACCAGAGATTTTAATCAAAGAACAAATTCTTCAAATCC
CAGTTTTGAAGACCATGATAATGTCAAAGTGAAGGAAACATCTCATAATTCTTTTAATGTAAATTGCAAA
AATTTGAAACCTGATGAATGGAATGGACCTAAGGGAATATATATAGCCAAACTCTTAAGTAATAGACCAC
CTGTAACAAAAGAAAATCCTAAATCCTGTTCATTTACATTTGAAAATCTGTACTACCACACAGAATATAC
TATTGAGATTATAGCCATAAATGGAAATAAAAATGAATCTTCTGCTAAAGCTACCCATTCCACTAGATAT
AACGATAAGGCTCTCATTGGATTCCTGGTGTTCCTCATCCTTGTGACGTCACTTGCACTCCTTTTTGTGT
TATACAAACTTTTCCTGCTCAAAAGAAAACGGACCGCTGAGGATGAAGAGATTCTTCTTACACAACCCCT
GCGCAGAGTTGAGCCCATCTATGCTGGCGGTTTAGTGGAGGCTTACAAGAACAAGATCGCTGATGAGGGC
CGTCTGTTTATGGATGAGTTTCAGAGCATTCCCCGAATTTTCTCCAATTACACCATCAAAGAAGCGAAGA
AATCAGAAAACCAGTACAAGAACCGCTACGTTGACATTCTGCCTTATGACTATAATCGGGTAACTCTCTC
AACTGGAGGTGAAGACAACTACATCAATGCCAGCTTTATCGAGGGATATCGAGAGCCAAAGAAATACATT
GCAGCCCAAGGACCCAAGGATGAGACAGTTGTTGATTTCTGGCGAATGGTTTGGGAGCAAAAGTCATCCA
TTATAGTCATGGTCACCCGTTGTGAGGAAGGAAACAAGACCAAATGCGCTCAGTACTGGCCATCTCTGGA
CAGAGAGGCTGAGATTTTTGAAGAGTTTGTGGTGAAAATCCGATCTGAGGAACACTGCCCTGATTATGTC
ATTCGCCATCTGATCCTGAACAATAAGCGAGAGAAGGGGTCAGAAAGAGAGGTCACTCATATCCAGTTCA
TCAGCTGGCCTGACCATGGTGTGCCTGGAGACCCCAGCCTGCTTCTGAAGCTCAGGCGAAGAGTCAACTC
TTTCAAGAACTTCTTCAGTGGCCCGGTGGTGGTCCACTGCAGTGCAGGGGTCGGCCGCACAGGAACATAT
ATGAGCATTGATGCAATGATTGAATCTCTTGAGGCAGAAGGCAGAGTGGACATCTATGGATTCGTAGCAA
AACTACGTCGCCAGCGTTGCCTCATGGTTCAAGTGGAGGCCCAGTACATACTGATCCACACTGCTCTGAT
TGAGTACAATCAGTTTGGAGAGACAGAGGTCTCACTGTCTGAATTCCACTCAGTGCTCAACACTCTCAGA
CAGAAAAACGGCAGTGACCCCAGTTTGCTTGAAAAGGAGTTCCAGAAACTCCCTAAATTTAAAAAATGGA
GAACAATGAACACGGGAAGCAGTGAGGACAACAAGAAGAAAAACCGCGACTCAGCTGTTATCCCATACGA
CTTTAATAGAGTCATATTTAGACTGGACATTGAGGGCAACCAGACCAGTGACCCTGAGGATGAGGAGGAG
TATTCTTCAGACGAGGAAGAAGAATCAAACCAATACATCAACGCCTCATTCATTGATGGCTACTGGTGTA
ATAAGAGTCTGATTGCAGCACAGGGGCCTTTACAAAGCACAGCGGAAGAGTTCCTGCTCATGCTGTACCA
GCAACAAACTAAAACACTGGTCATGCTCACAGACTGCCAAGAGGACGGCAAGGATTATTGTTTTCAGTAT
TGGGGAGATGAAAAGAAAACGTACGGAGACATGGAAATTGAGGTAAAAAAGACAGAAACCTTCCCAACGT
ACGTGAGACGGCATCTGGAAATACAGTCCTTAAAGAAGAAAGAGGTCTTGAAGGTGGACCAGTACCAGTT
CCTGAAATGGAGAGGCCGCGAGCTTCCAGAGAACGCTCAGGAGTTGGTAGAGATGGCAAGTATCATCAGA
GAGAACGGCCATTATGACAACAGCAAAACAAACCGGAATGTCCCCATAGTGGTGCACTGCAACAACGGCT
CATCTCGTACCGGAATCTTCTGTGCCTTATGGAATCTCCTGGACAGCGCCTACACAGAGAAGCTGGTGGA
CGTCCTCCAAGAGGTCAAAAACTTGCGCAAGCTGAGGCAGGGGATGGTGGAAACAATCGAGCAGTATCAG
TTCCTCTACACGGCTCTAGAAGGCGCTTTCCCTGTGCAGAACGGTGCCGTAAAGACGCCCCCTGCAAAGG
ACGCCGCGCAGGTCATCAACGAAACCACGGCGCTCCTCACCGAGCCCAACAGCACTTCCGGGGCTGACCA
GAAGGAGGCGGAAGAGAGCACAGCCGCAGAGAGCAATGAGCAGGGGGCCAAGGAGAGCAGCACTGCTGAG
GCTCCAGCAGCAGAGGGAGACGGAGAGAAAGCCACATCCGAGGGCCCGACCAATGGCCCTACCGCCACTG
TGGAGGTTTAA

First, clear the lcl (local) sequence identifier:

sed -i "s:lcl|::g" cd45.fa

Next, remove everythin after the _cds tag (including the tag itself), leaving just the accession number for each sequence. Let’s save the output as a new file:

sed "s:_cds.*::g" cd45.fa > cd45_accession.fa

Now, it’s time to extract the accession numbers only to a new file:

grep -o ">.*" cd45_accession.fa > cd45_accession.list

Finally, clean the > symbol preceding each accesion number in the list:

sed -i "s:>::g" cd45_accession.list 

That’s it!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s