RegEx with ATOM
- Course : Evolutionary Genetics UniBas (LV 25600-01 / HS2019)
- Topic : Regular Expressions with Text Editor ATOM
- Version: 191015 Jean-Claude Walser
Following a short introduction into regular expression. The introductions is divided into three sections:
- A. Examples
- B. Explore
- C. Exercises - Find Solution(s)
- D. Your Turn (Assignment) - Please Be Creative!
In the explore part you have all the pieces (input, search term, replacement, and output). You have to figure out how it works. See the examples first to get an idea.
For the exercises input(s) (e.g. what you have) and output(s) (e.g. what you want) are given. Try to figure out the find and replacements terms. There are exercises with multiple steps. You start with the first input on top. Convert it into the text below and use it as input for the next step to convert it into the text below. Continue until you have reached the last text block. Solutions are further below but please give it a serious try first!
The first two parts of the introduction should have given you enough confidence to create your own RegEx exercise.
Note 1: This introduction was written with ATOM in mind but it might work for other text editors. Make sure you have Regex, Case Sensitive, and Within Current Selection activated!
Note 2: EOL - End of line encoding. Which character is consider as end of line or newline?
- LF
\n
Line Feed or Newline Character -
CR
\r
Carriage Return -
Unix: Unix systems consider
\n
as a line terminator. Unix considers\r
as going back to the start of the same line. - Older Mac OSs consider
\r
as a newline terminator but newer OSs are more compliant with Unix systems and use\n
as the newline. - Windows has a different style of newline (of course), it uses the combination of both CR and LF as the newline character – ‘\r\n’.
A - Examples¶
A1: w (word character)¶
Input: 123 456 ABC def ,,..--++__??!!
find: \w
replace: x
Output: xxx xxx xxx xxx ,,..--++xx??!!
What did we learn from this example?
=> w finds characters but special characters (e.g. -, +, or ?) and white space are ignored.
A2: w+ (one or more words)¶
Input: 123 456 ABC def ,,..--++__??!!
find: \w+
replace: x
Output: x x x x ,,..--++x??!!
What did I learn from this example?
=> w+ finds groups of characters but again ignores special characters.
A3: W (special character)¶
c) Input: 123 456 ABC def ,,..--++__??!!
find: \W
replace: x
Output: 123x456xABCxdefxxxxxxxxx__xxxx
What did I learn from this example:
=> W includes special characters (e.g., +,?) including whitespace but ignores digits, letters and _
A4: W+ (special characters)¶
Input: 123 456 ABC def ,,..--++__??!!
find: \W+
replace: x
Output: 123x456xABCxdefx__x
=> W+ finds groups of special characters. It ignores regular characters.
Summary:¶
- w one regular character
- w+ one or more regular character
- W not a regular charter
- W+ not regular charters
B - Explore¶
B1: . (dot)¶
1 2 3 4 5 1,2,3,4,5 1-2-3-4-5
find: .
replace: x
xxxxxxxxx xxxxxxxxx xxxxxxxxx
What is the meaning of . (dot)
? Can we extend it with .+ (dot-plus)?
B2: {n} (curly brackets)¶
abc aabc aaabc
find: a{2}
replace: x
abc xbc xabc
What is the meaning of a{n}?
B3: {x,y} (more curly brackets)¶
abc aabc aaabc aaaabc
find: a{2,4}
replace: a
abc abc abc abc
What is the meaning of {x,y}?
B4: [x] (square brackets)¶
>Seq1 ATGCACGTTAGTTGCA >Seq2 ATGCACGTTCGTTGCA >Seq3 ATGCACGTTCGTTGCA >Seq4 ATGCACGTTGGTTGCA
find: TT[ATCG]GTT
replace: NNNNNN
What is the meaning of [ATCG]?
B5: [\^X] (more square brackets)¶
>Seq1 ATGCACGTTAGTTGCA >Seq2 ATGCACGTTCGTTGCA >Seq3 ATGCACGTTTGTTGCA >Seq4 ATGCACGTTGGTTGCA
find: TT[^A]GTT
replace: NNNNNN
What is the meaning of [^X]?
B6: [X-Y] (even more square brackets)¶
>Seq1 ATGCACGTTAGTTGCA >Seq2 ATGCACGTTCGTTGCA >Seq3 ATGCACGTTCGTTGCA >Seq4 ATGCACGTTGGTTGCA
find: TT[A-Z]GTT
replace: NNNNNN
What is the meaning of [A-Z]?
B7: ^¶
Seq1 ATGCACGTTAGTTGCA Seq2 ATTCACGTTCGTTGCA Seq3 ATCCACGTTCGTTGCA Seq4 ATCCACGTTGGTTGCA
find: ^ATG replace: NNN
What is the meaning of ^?
B8: $¶
Seq1 ATGCACGTTAGTTGCA Seq2 ATTCACGTTCGTTATG Seq3 ATCCACGTTCGTTGCA Seq4 ATCCACGTTGGTTATG
find: ATG$ replace: NNN
What is the meaning of $?
C - Exercises¶
Do not worry, solutions are at the bottom!
C1 Simple replace¶
SEQ 01 AAAAAAAAAAAA SEQ 02 CCCCCCCCCCCC SEQ 03 GGGGGGGGGGGG
find:?
replace:?
>SEQ_01 AAAAAAAAAAAA >SEQ_02 CCCCCCCCCCCC >SEQ_03 GGGGGGGGGGGG
C2 Move¶
123X 456X 789X
find:?
replace:?
X123 X456 X789
find:?
replace:?
1X23 4X56 7X89
find:?
replace:?
12X3 45X6 78X9
find:?
replace:?
123X 456X 789X
C3 Re-arrange¶
A,B,C a,b,c 1,2,3
find:?
replace:?
C,A,B c,a,b 3,1,2
C4 Re-format¶
Mus musclus Agalma elegans Frillagalma vitiazi Cordagalma tottoni Shortia galacifolia
find:?
replace:?
M. musclus A. elegans F. vitiazi C. tottoni S. galacifolia
find:
replace:
Mus -> M. musclus -> M_musclus Agalma -> A. elegans -> A_elegans Frillagalma -> F. vitiazi -> F_vitiazi Cordagalma -> C. tottoni -> C_tottoni Shortia -> S. galacifolia -> S_galacifolia
C5 Remove multiple characters¶
Zürich 47.3667° N, 8.5500° E Basel 47.5667° N, 7.6000° E St.Gallen 47.4167° N, 9.3667° E Lausanne 46.5198° N, 6.6335° E Lugano 46.0000° N, 8.9500° E
find:?
replace:?
Zürich 47.3667°, 8.5500° Basel 47.5667°, 7.6000° St.Gallen 47.4167°, 9.3667° Lausanne 46.5198°, 6.6335° Lugano 46.0000°, 8.9500°
C6 Re-arrange¶
21 17'24.68"N 157 51'41.50"W 38 30'36.62"N 28 17'16.87"W 8 59'53.30"S 157 58'13.70"W 10 24'47.84"N 51 21'54.61"E 22 52'41.65"S 48 9'46.62"E
find:?
replace:?
21 17'24.68"N 157 51'41.50"W 38 30'36.62"N 28 17'16.87"W 8 59'53.30"S 157 58'13.70"W 10 24'47.84"N 51 21'54.61"E 22 52'41.65"S 48 9'46.62"E
find:?
replace:?
21 17'24.68"N -157 51'41.50" 38 30'36.62"N -28 17'16.87" -8 59'53.30" -157 58'13.70" 10 24'47.84"N 51 21'54.61"E -22 52'41.65" 48 9'46.62"E
find:?
replace:?
21 17'24.68" -157 51'41.50" 38 30'36.62" -28 17'16.87" -8 59'53.30" -157 58'13.70" 10 24'47.84" 51 21'54.61" -22 52'41.65" 48 9'46.62"
C7 Format header (NCBI full to accession and species)¶
>gi|608606245|gb|KF962059.1| Agalma elegans voucher XMAE1 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT >gi|270271668|gb|GQ119987.1| Frillagalma sp. BO-2009 isolate Agfr06 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT >gi|62866985|gb|AY937366.1| Cordagalma cordiforme cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT
find:
replace:
>KF962059 Agalma_elegans GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT >GQ119987 Frillagalma_sp AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT >AY937366 Cordagalma_cordiforme AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT
C8 Remove empty lines¶
>Seq1 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq2 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq3 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq4 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq5 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
find:
replace:
>Seq1 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq2 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq3 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq4 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC >Seq5 ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
C9 Adjust decimal¶
3.14159265359
find:
replace:
3.14
C10 Reduce¶
d dd ddd dddd ddddd ddddd
find:
replace:
d d d d d d
C11 Space¶
1 2 3 4 5 6 7 1-2--3---4----5-----6------7
find:
replace:
1 2 3 4 5 6 7 1-2 3 4 5 6 7
C12 Poly A¶
>SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA
find:
replace:
>SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTGC >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT
C13 Poly AA¶
>SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA
find:
replace:
>SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTG >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCT
D - Assignment¶
Now you have see many examples and you worked your way through many exercises. Use this knowledge to apply it to a problem of your liking. You might recycle some of the exercises from above. You can also use the table file and move, change , and replace records. In case you cannot make it work and would need help, describe the problem you encountered.
Solutions
C1 Simple Replace find: ^SEQ replace: >SEQ_ C2 Move find: (\d\d\d)(\w) replace: $2$1 find: (\w)(\d)(\d+) replace: $2$1$3 find: (\d)(\w)(\d)(\d) replace: $1$3$2$4 find: (\d+)(\w)(\d) replace: $1$3$2 C3 Re-arrange find: (\w,\w),(\w) replace: $2,$1 find: (\w),(\w),(\w) replace: $3,$1,$2 C4 Re-format find: (\w)\w+ (\w+) replace: $1. $2 find: (\w)(\w+) (\w+) replace: $1$2 -> $1. $3 -> $1_$3 C5 Remove multiple characters find: [NE] replace: 6) Re-arrange find: (\"[NS])\n replace: $1\t # > Note: \n (=\r) end of line # \t tab find: ([0-9]+ [0-9 \' \" \.]+)[WS] replace: -$1 find: [NE] replace: C7 Format header find: (>)gi\|\d+\|gb\|(\w+).1\| (\w+) (\w+).* replace: $1$2_$3_$4 C8 Remove empty lines search: ^$\r replace: C9 Adjust decimal find: (\d\.)(\d{2})\d+ replace: $1$2 C10 Reduce find: d{2,} replace: d C11 Space find: d[d]{1,7} replace: d C12 Poly A >SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA find: (\w+[TGC])A* replace: $1 >SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTGC >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT C13 Poly AA >SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA find: (\w+[TGC])A*[TGC]A* replace: $1 >SEQ01 ATGCGATCGACTGATCGATCGTGACTAGCTG >SEQ02 ATCCCGATGGAATGATCGATCAAAACTAGCT