RegEx with ATOM

Course : Evolutionary Genetics UniBas (LV 25600-01 / HS2019)
Topic : Regular Expressions with Text Editor ATOM
Version: 191015 Jean-Claude Walser

Following a short introduction into regular expression. The introductions is divided into three sections:

A. Examples
B. Explore
C. Exercises - Find Solution(s)
D. Your Turn (Assignment) - Please Be Creative!

In the explore part you have all the pieces (input, search term, replacement, and output). You have to figure out how it works. See the examples first to get an idea.

For the exercises input(s) (e.g. what you have) and output(s) (e.g. what you want) are given. Try to figure out the find and replacements terms. There are exercises with multiple steps. You start with the first input on top. Convert it into the text below and use it as input for the next step to convert it into the text below. Continue until you have reached the last text block. Solutions are further below but please give it a serious try first!

The first two parts of the introduction should have given you enough confidence to create your own RegEx exercise.

Note 1: This introduction was written with ATOM in mind but it might work for other text editors. Make sure you have Regex, Case Sensitive, and Within Current Selection activated!

Note 2: EOL - End of line encoding. Which character is consider as end of line or newline?

LF \n Line Feed or Newline Character
CR \r Carriage Return
Unix: Unix systems consider \n as a line terminator. Unix considers \r as going back to the start of the same line.
Older Mac OSs consider \r as a newline terminator but newer OSs are more compliant with Unix systems and use \n as the newline.
Windows has a different style of newline (of course), it uses the combination of both CR and LF as the newline character – ‘\r\n’.

A - Examples¶

A1: w (word character)¶

Input: 123 456 ABC def ,,..--++__??!!

find: \w replace: x

Output: xxx xxx xxx xxx ,,..--++xx??!!

What did we learn from this example?

=> w finds characters but special characters (e.g. -, +, or ?) and white space are ignored.

A2: w+ (one or more words)¶

Input: 123 456 ABC def ,,..--++__??!!

find: \w+ replace: x

Output: x x x x ,,..--++x??!!

What did I learn from this example?

=> w+ finds groups of characters but again ignores special characters.

A3: W (special character)¶

c) Input: 123 456 ABC def ,,..--++__??!!

find: \W replace: x

Output: 123x456xABCxdefxxxxxxxxx__xxxx

What did I learn from this example:

=> W includes special characters (e.g., +,?) including whitespace but ignores digits, letters and _

A4: W+ (special characters)¶

Input: 123 456 ABC def ,,..--++__??!!

find: \W+ replace: x

Output: 123x456xABCxdefx__x

=> W+ finds groups of special characters. It ignores regular characters.

Summary:¶

w one regular character

w+ one or more regular character

W not a regular charter

W+ not regular charters

B - Explore¶

B1: . (dot)¶

1   2   3   4   5
1,2,3,4,5
1-2-3-4-5

find: . replace: x

xxxxxxxxx
xxxxxxxxx
xxxxxxxxx

What is the meaning of . (dot)
? Can we extend it with .+ (dot-plus)?

B2: {n} (curly brackets)¶

abc aabc aaabc

find: a{2} replace: x

abc xbc xabc

What is the meaning of a{n}?

B3: {x,y} (more curly brackets)¶

abc aabc aaabc aaaabc

find: a{2,4} replace: a

abc abc abc abc

What is the meaning of {x,y}?

B4: [x] (square brackets)¶

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTCGTTGCA
>Seq4
ATGCACGTTGGTTGCA

find: TT[ATCG]GTT replace: NNNNNN

What is the meaning of [ATCG]?

B5: [\^X] (more square brackets)¶

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTTGTTGCA
>Seq4
ATGCACGTTGGTTGCA

find: TT[^A]GTT replace: NNNNNN

What is the meaning of [^X]?

B6: [X-Y] (even more square brackets)¶

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTCGTTGCA
>Seq4
ATGCACGTTGGTTGCA

find: TT[A-Z]GTT replace: NNNNNN

What is the meaning of [A-Z]?

B7: ^¶

Seq1 ATGCACGTTAGTTGCA Seq2 ATTCACGTTCGTTGCA Seq3 ATCCACGTTCGTTGCA Seq4 ATCCACGTTGGTTGCA

find: ^ATG replace: NNN

What is the meaning of ^?

B8: $¶

Seq1 ATGCACGTTAGTTGCA Seq2 ATTCACGTTCGTTATG Seq3 ATCCACGTTCGTTGCA Seq4 ATCCACGTTGGTTATG

find: ATG$ replace: NNN

What is the meaning of $?

C - Exercises¶

Do not worry, solutions are at the bottom!

C1 Simple replace¶

SEQ 01
AAAAAAAAAAAA
SEQ 02
CCCCCCCCCCCC
SEQ 03
GGGGGGGGGGGG

find:? replace:?

>SEQ_01
AAAAAAAAAAAA
>SEQ_02
CCCCCCCCCCCC
>SEQ_03
GGGGGGGGGGGG

C2 Move¶

123X
456X
789X

find:? replace:?

X123
X456
X789

find:? replace:?

1X23
4X56
7X89

find:? replace:?

12X3
45X6
78X9

find:? replace:?

123X
456X
789X

C3 Re-arrange¶

A,B,C
a,b,c
1,2,3

find:? replace:?

C,A,B
c,a,b
3,1,2

C4 Re-format¶

Mus musclus
Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia

find:? replace:?

M. musclus
A. elegans
F. vitiazi
C. tottoni
S. galacifolia

find: replace:

Mus -> M. musclus -> M_musclus
Agalma -> A. elegans -> A_elegans
Frillagalma -> F. vitiazi -> F_vitiazi
Cordagalma -> C. tottoni -> C_tottoni
Shortia -> S. galacifolia -> S_galacifolia

C5 Remove multiple characters¶

Zürich 47.3667° N, 8.5500° E
Basel 47.5667° N, 7.6000° E
St.Gallen 47.4167° N, 9.3667° E
Lausanne 46.5198° N, 6.6335° E
Lugano 46.0000° N, 8.9500° E

find:? replace:?

Zürich 47.3667°, 8.5500°
Basel 47.5667°, 7.6000°
St.Gallen 47.4167°, 9.3667°
Lausanne 46.5198°, 6.6335°
Lugano 46.0000°, 8.9500°

C6 Re-arrange¶

 21 17'24.68"N
157 51'41.50"W
 38 30'36.62"N
 28 17'16.87"W
  8 59'53.30"S
157 58'13.70"W
 10 24'47.84"N
 51 21'54.61"E
 22 52'41.65"S
 48  9'46.62"E

find:? replace:?

 21 17'24.68"N  157 51'41.50"W
 38 30'36.62"N   28 17'16.87"W
  8 59'53.30"S  157 58'13.70"W
 10 24'47.84"N   51 21'54.61"E
 22 52'41.65"S   48  9'46.62"E

find:? replace:?

 21 17'24.68"N  -157 51'41.50"
 38 30'36.62"N   -28 17'16.87"
  -8 59'53.30"  -157 58'13.70"
 10 24'47.84"N   51 21'54.61"E
 -22 52'41.65"   48  9'46.62"E

find:? replace:?

21 17'24.68"    -157 51'41.50"
38 30'36.62"     -28 17'16.87"
-8 59'53.30"    -157 58'13.70"
10 24'47.84"     51 21'54.61"
-22 52'41.65"    48  9'46.62"

C7 Format header (NCBI full to accession and species)¶

>gi|608606245|gb|KF962059.1| Agalma elegans voucher XMAE1 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA
ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT
>gi|270271668|gb|GQ119987.1| Frillagalma sp. BO-2009 isolate Agfr06 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA
GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT
>gi|62866985|gb|AY937366.1| Cordagalma cordiforme cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG
GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT

find: replace:

>KF962059 Agalma_elegans
GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA
ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT
>GQ119987 Frillagalma_sp
AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA
GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT
>AY937366 Cordagalma_cordiforme
AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG
GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT

C8 Remove empty lines¶

>Seq1
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq2
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq3
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq4
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq5
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

find: replace:

>Seq1
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq2
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq3
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq4
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq5
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

C9 Adjust decimal¶

3.14159265359

find: replace:

3.14

C10 Reduce¶

d dd ddd dddd ddddd ddddd

find: replace:

d d d d d d

C11 Space¶

1 2  3   4    5     6      7
1-2--3---4----5-----6------7

find: replace:

1 2 3 4 5 6 7
1-2 3 4 5 6 7

C12 Poly A¶

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

find: replace:

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGC
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT

C13 Poly AA¶

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

find: replace:

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTG
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCT

D - Assignment¶

Now you have see many examples and you worked your way through many exercises. Use this knowledge to apply it to a problem of your liking. You might recycle some of the exercises from above. You can also use the table file and move, change , and replace records. In case you cannot make it work and would need help, describe the problem you encountered.

Solutions

C1 Simple Replace

find: ^SEQ
replace: >SEQ_


C2 Move

find: (\d\d\d)(\w)
replace: $2$1

find: (\w)(\d)(\d+)
replace: $2$1$3

find: (\d)(\w)(\d)(\d)
replace: $1$3$2$4

find: (\d+)(\w)(\d)
replace: $1$3$2


C3 Re-arrange

find: (\w,\w),(\w)
replace: $2,$1

find: (\w),(\w),(\w)
replace: $3,$1,$2

C4 Re-format

find: (\w)\w+ (\w+)
replace: $1. $2

find: (\w)(\w+) (\w+)
replace: $1$2 -> $1. $3 -> $1_$3


C5 Remove multiple characters

find:  [NE]
replace:


6) Re-arrange

find: (\"[NS])\n
replace: $1\t

# > Note: \n (=\r) end of line
#         \t tab

find: ([0-9]+ [0-9 \' \" \.]+)[WS]
replace: -$1

find: [NE]
replace:


C7 Format header

find: (>)gi\|\d+\|gb\|(\w+).1\| (\w+) (\w+).*
replace: $1$2_$3_$4


C8 Remove empty lines

search: ^$\r
replace:


C9 Adjust decimal

find: (\d\.)(\d{2})\d+
replace: $1$2


C10 Reduce

find: d{2,}
replace: d


C11 Space

find: d[d]{1,7}
replace: d


C12 Poly A

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

find: (\w+[TGC])A*
replace: $1

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGC
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT


C13 Poly AA

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

find: (\w+[TGC])A*[TGC]A*
replace: $1

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTG
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCT