RegEx - UniBas - Evolutionary Genetics

Learning Objectives

Understand the basics of regular expressions (regex) Learn the syntax and structure of regular expressions, including common symbols and their meanings.
Master pattern matching Develop the ability to create patterns that match specific sequences of characters and learn how to apply them to different types of text data.
Use quantifiers and capture groups Understand how to use quantifiers to specify repetition of matches and capture groups to extract specific parts of a match.

Regex (short for Regular Expressions) is a powerful pattern matching and text manipulation tool. It allows users to search, match, and extract specific sequences of characters from text. In this tutorial, you'll learn how to use Regex through examples, interactive exercises, and tasks designed to build your confidence and creativity in working with text patterns.

The tutorial is structured into the following sections:

[A] Examples: Start by reviewing some simple examples to get a feel for how Regex works in practice.
[B] Explore: In this section, you’ll investigate how different pieces (input, search term, replacement, and output) interact. Use the examples as a guide to help you understand the mechanics.
[C] Your turn: Here, you'll apply what you've learned to solve challenges. You'll be given inputs (what you have) and outputs (what you want). Your task is to figure out the correct "find" and "replace" terms. Some exercises are multi-step, requiring you to convert text progressively—use each result as input for the next step until you reach the final output.

Enable RegEx

This tutorial is designed with the Atom text editor in mind but works with other editors as well. Just ensure that Regex, Case Sensitive, and Within Current Selection options are enabled in your editor.

End-of-Line (EOL) Encoding

Different systems interpret the end of a line (newline) in varying ways, which is important when working with text files:

LF (n): Line Feed, used by Unix-based systems as a newline character.
CR (r): Carriage Return, used by older Mac OS versions as a newline.
CR + LF (rn): A combination used by Windows to indicate the end of a line.

Understanding these differences is crucial when applying Regex to text files, especially across different operating systems.

A - Examples

A1: `\w` (word character)

Input: 123 456 ABC def ,,..--++__??!!

find: \w replace: x

Output: xxx xxx xxx xxx ,,..--++xx??!!

What did we learn from this example?

\w matches any word character (letters, digits, or underscores), but it ignores special characters (e.g., -, +, ?) and whitespace.

A2: `\w+` (one or more word characters)

Input: 123 456 ABC def ,,..--++__??!!

find: \w+ replace: x

Output: x x x x ,,..--++x??!!

What did I learn from this example?

\w+ matches groups of word characters (one or more consecutive letters, digits, or underscores), but still ignores special characters and whitespace.

A3: `\W` (no word character)

Input: 123 456 ABC def ,,..--++__??!!

find: \W replace: x

Output: 123x456xABCxdefxxxxxxxxx__xxxx

What did I learn from this example:

\W matches non-word characters, including special characters (e.g., +, ?), punctuation, and whitespace, but it ignores letters, digits, and underscores.

A4: `\W+` (one or more non-word characters)

Input: 123 456 ABC def ,,..--++__??!!

find: \W+ replace: x

Output: 123x456xABCxdefx__x

What did we learn from this example?

\W+ matches groups of one or more non-word characters, including special characters and whitespace, but it ignores letters, digits, and underscores.

Summary

\w : Matches a single word character (letter, digit, or underscore).
\w+: Matches one or more consecutive word characters.
\W : Matches a single non-word character (special character, whitespace).
\W+: Matches one or more consecutive non-word characters.

B - Explore

B1: `.` (dot)

1   2   3   4   5
1,2,3,4,5
1-2-3-4-5

find: . replace: x

xxxxxxxxx
xxxxxxxxx
xxxxxxxxx

What did we learn: The dot (.) matches any character except a newline.

Can we extend this with .+ (dot followed by plus)? What would that do?

B2: `{n}` (curly brackets)

abc aabc aaabc

find: a{2} replace: x

abc xbc xabc

What did we learn: a{2} matches exactly two consecutive a characters.

What happens when you change {2} to {3} or {1}?

B3: `{x,y}` (range with curly brackets)

abc aabc aaabc aaaabc

find: a{2,4} replace: a

abc abc abc abc

What did we learn: a{2,4} matches between 2 and 4 consecutive a characters.

What happens if you increase or decrease the range?

B4: `[x]` (square brackets for character sets)

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTCGTTGCA
>Seq4
ATGCACGTTGGTTGCA

find: TT[ATCG]GTT replace: NNNNNN

What did we learn: [ATCG] matches any single character within the set (A, T, C, or G).

How would the output change if you used [AG] instead?

B5: `[^X]` (negation inside square brackets)

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTTGTTGCA
>Seq4
ATGCACGTTGGTTGCA

find: TT[^A]GTT replace: NNNNNN

What did we learn: [^A] matches any character except A.

How would this behave if you used [^CT]?

B6: `[X-Y]` (ranges inside square brackets)

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTCGTTGCA
>Seq4
ATGCACGTTGGTTGCA

find: TT[A-Z]GTT replace: NNNNNN

What did we learn: [A-Z] matches any uppercase letter from A to Z.

How would this behave if you used [a-Z] instead?

B7: `^` (caret for start of string)

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATTCACGTTCGTTGCA
>Seq3
ATCCACGTTCGTTGCA
>Seq4
ATCCACGTTGGTTGCA

find: ^ATG replace: NNN

What did we learn: The caret (^) matches the start of a string.

What happens if you try ^ATT?

B8: `$` (dollar for end of string)

>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATTCACGTTCGTTATG
>Seq3
ATCCACGTTCGTTGCA
>Seq4
ATCCACGTTGGTTATG

find: ATG$ replace: NNN

What did we learn: The dollar sign ($) matches the end of a string.

What happens if you try to find a string that starts and ends with ATG?

C - Challenges

Do not worry, solutions are at the bottom!

C1: Simple Replace

Find a pattern that allows you to modify the text so the sequence names are in the format >SEQ_01 and so on. Write your find and replace commands.

Original Text:

SEQ 01
AAAAAAAAAAAA
SEQ 02
CCCCCCCCCCCC
SEQ 03
GGGGGGGGGGGG

Expected Output:

>SEQ_01
AAAAAAAAAAAA
>SEQ_02
CCCCCCCCCCCC
>SEQ_03
GGGGGGGGGGGG

C2: Move

Move X to the beginning of each line.

Original Text:

123X
456X
789X

Expected Output:

X123
X456
X789

C3: Re-arrange

Re-arrange the order of elements to be C,A,B, c,a,b, and 3,1,2.

Original Text:

A,B,C
a,b,c
1,2,3

Expected Output:

C,A,B
c,a,b
3,1,2

C4: Re-format

Abbreviate the first word to its initial.

Original Text:

Mus musclus
Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia

Expected Output:

M. musclus
A. elegans
F. vitiazi
C. tottoni
S. galacifolia

C5: Remove multiple characters

Remove all characters except the degree symbol and comma.

Original Text:

Zürich 47.3667° N, 8.5500° E
Basel 47.5667° N, 7.6000° E
St.Gallen 47.4167° N, 9.3667° E
Lausanne 46.5198° N, 6.6335° E
Lugano 46.0000° N, 8.9500° E

Expected Output:

Zürich 47.3667°, 8.5500°
Basel 47.5667°, 7.6000°
St.Gallen 47.4167°, 9.3667°
Lausanne 46.5198°, 6.6335°
Lugano 46.0000°, 8.9500°

C6: Re-arrange coordinates

Re-arrange the coordinates into two columns.
Convert the west coordinates to negative.

Original Text:

 21 17'24.68"N
157 51'41.50"W
 38 30'36.62"N
 28 17'16.87"W
  8 59'53.30"S
157 58'13.70"W
 10 24'47.84"N
 51 21'54.61"E
 22 52'41.65"S
 48  9'46.62"E

Intermediate Output:

 21 17'24.68"N  157 51'41.50"W
 38 30'36.62"N   28 17'16.87"W
  8 59'53.30"S  157 58'13.70"W
 10 24'47.84"N   51 21'54.61"E
 22 52'41.65"S   48  9'46.62"E

Expected Output:

 21 17'24.68"N  -157 51'41.50"
 38 30'36.62"N   -28 17'16.87"
  -8 59'53.30"  -157 58'13.70"
 10 24'47.84"N   51 21'54.61"E
 -22 52'41.65"   48  9'46.62"E

C7: Format header (gb)

Extract the accession number (without version nummer) and species name.

Original Text:

>gi|608606245|gb|KF962059.1| Agalma elegans voucher XMAE1 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA
ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT
>gi|270271668|gb|GQ119987.1| Frillagalma sp. BO-2009 isolate Agfr06 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA
GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT
>gi|62866985|gb|AY937366.1| Cordagalma cordiforme cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG
GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT

Expected Output:

>KF962059 Agalma_elegans
GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA
ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT
>GQ119987 Frillagalma_sp
AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA
GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT
>AY937366 Cordagalma_cordiforme
AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG
GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT

C8: Remove empty lines

Remove empty lines between the sequences.

Original Text:

>Seq1
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq2
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq3
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq4
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

>Seq5
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

Expected Output:

>Seq1
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq2
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq3
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq4
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq5
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC

C9: Adjust decimal

Round the decimal to two places.

Original Text:

3.14159265359

Expected Output:

3.14

C10: Reduce

Reduce the number of ds to a single d.

Original Text:

d dd ddd dddd ddddd ddddd

Expected Output:

d d d d d d

C11: Space

Reduce multiple spaces/hyphens to a single space.

Original Text:

1 2  3   4    5     6      7
1-2--3---4----5-----6------7

Expected Output:

1 2 3 4 5 6 7
1-2 3 4 5 6 7

C12: Poly A

Remove the poly-A tail.

Original Text:

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

Expected Output:

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGC
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT

C13: Poly AA

Remove the poly-AA tail.

Original Text:

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

Expected Output:

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTG
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCT

Solutions

C1: Simple Replace

find: ^SEQ
replace: >SEQ_


C2: Move

find: (\d\d\d)(\w)
replace: $2$1

find: (\w)(\d)(\d+)
replace: $2$1$3

find: (\d)(\w)(\d)(\d)
replace: $1$3$2$4

find: (\d+)(\w)(\d)
replace: $1$3$2


C3: Re-arrange

find: (\w,\w),(\w)
replace: $2,$1

find: (\w),(\w),(\w)
replace: $3,$1,$2

C4: Re-format

find: (\w)\w+ (\w+)
replace: $1. $2

find: (\w)(\w+) (\w+)
replace: $1$2 -> $1. $3 -> $1_$3


C5: Remove multiple characters

find:  [NE]
replace:


C6: Re-arrange

find: (\"[NS])\n
replace: $1\t

# > Note: \n (=\r) end of line
#         \t tab

find: ([0-9]+ [0-9 \' \" \.]+)[WS]
replace: -$1

find: [NE]
replace:


C7: Format header

find: (>)gi\|\d+\|gb\|(\w+).1\| (\w+) (\w+).*
replace: $1$2_$3_$4


C8: Remove empty lines

search: ^$\r
replace:


C9: Adjust decimal

find: (\d\.)(\d{2})\d+
replace: $1$2


C10: Reduce

find: d{2,}
replace: d


C11: Space

find: d[d]{1,7}
replace: d


C12: Poly A

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

find: (\w+[TGC])A*
replace: $1

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGC
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT


C13: Poly AA

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA

find: (\w+[TGC])A*[TGC]A*
replace: $1

>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTG
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCT

A - Examples

A1: \w (word character)

A2: \w+ (one or more word characters)

A3: \W (no word character)

A4: \W+ (one or more non-word characters)

B - Explore

B1: . (dot)

B2: {n} (curly brackets)

B3: {x,y} (range with curly brackets)

B4: [x] (square brackets for character sets)

B5: [^X] (negation inside square brackets)

B6: [X-Y] (ranges inside square brackets)

B7: ^ (caret for start of string)

B8: $ (dollar for end of string)

C - Challenges

C1: Simple Replace

C2: Move

C3: Re-arrange

C4: Re-format

C5: Remove multiple characters

C6: Re-arrange coordinates

C7: Format header (gb)

C8: Remove empty lines

C9: Adjust decimal

C10: Reduce

C11: Space

C12: Poly A

C13: Poly AA

A1: `\w` (word character)

A2: `\w+` (one or more word characters)

A3: `\W` (no word character)

A4: `\W+` (one or more non-word characters)

B1: `.` (dot)

B2: `{n}` (curly brackets)

B3: `{x,y}` (range with curly brackets)

B4: `[x]` (square brackets for character sets)

B5: `[^X]` (negation inside square brackets)

B6: `[X-Y]` (ranges inside square brackets)

B7: `^` (caret for start of string)

B8: `$` (dollar for end of string)