Learning Objectives
Understand the basics of regular expressions (regex) Learn the syntax and structure of regular expressions, including common symbols and their meanings.
Master pattern matching Develop the ability to create patterns that match specific sequences of characters and learn how to apply them to different types of text data.
Use quantifiers and capture groups Understand how to use quantifiers to specify repetition of matches and capture groups to extract specific parts of a match.
Regex (short for Regular Expressions) is a powerful pattern matching and text manipulation tool. It allows users to search, match, and extract specific sequences of characters from text. In this tutorial, you'll learn how to use Regex through examples, interactive exercises, and tasks designed to build your confidence and creativity in working with text patterns.
The tutorial is structured into the following sections:
- [A] Examples: Start by reviewing some simple examples to get a feel for how Regex works in practice.
- [B] Explore: In this section, you’ll investigate how different pieces (input, search term, replacement, and output) interact. Use the examples as a guide to help you understand the mechanics.
- [C] Your turn: Here, you'll apply what you've learned to solve challenges. You'll be given inputs (what you have) and outputs (what you want). Your task is to figure out the correct "find" and "replace" terms. Some exercises are multi-step, requiring you to convert text progressively—use each result as input for the next step until you reach the final output.
Enable RegEx
This tutorial is designed with the Atom text editor in mind but works with other editors as well. Just ensure that Regex, Case Sensitive, and Within Current Selection options are enabled in your editor.
End-of-Line (EOL) Encoding
Different systems interpret the end of a line (newline) in varying ways, which is important when working with text files:
- LF (n): Line Feed, used by Unix-based systems as a newline character.
- CR (r): Carriage Return, used by older Mac OS versions as a newline.
- CR + LF (rn): A combination used by Windows to indicate the end of a line.
Understanding these differences is crucial when applying Regex to text files, especially across different operating systems.
A - Examples
A1: \w
(word character)
Input: 123 456 ABC def ,,..--++__??!!
find: \w
replace: x
Output: xxx xxx xxx xxx ,,..--++xx??!!
What did we learn from this example?
\w
matches any word character (letters, digits, or underscores), but it ignores special characters (e.g., -, +, ?) and whitespace.
A2: \w+
(one or more word characters)
Input: 123 456 ABC def ,,..--++__??!!
find: \w+
replace: x
Output: x x x x ,,..--++x??!!
What did I learn from this example?
\w+
matches groups of word characters (one or more consecutive letters, digits, or underscores), but still ignores special characters and whitespace.
A3: \W
(no word character)
Input: 123 456 ABC def ,,..--++__??!!
find: \W
replace: x
Output: 123x456xABCxdefxxxxxxxxx__xxxx
What did I learn from this example:
\W
matches non-word characters, including special characters (e.g., +, ?), punctuation, and whitespace, but it ignores letters, digits, and underscores.
A4: \W+
(one or more non-word characters)
Input: 123 456 ABC def ,,..--++__??!!
find: \W+
replace: x
Output: 123x456xABCxdefx__x
What did we learn from this example?
\W+
matches groups of one or more non-word characters, including special characters and whitespace, but it ignores letters, digits, and underscores.
Summary
\w
: Matches a single word character (letter, digit, or underscore).\w+
: Matches one or more consecutive word characters.\W
: Matches a single non-word character (special character, whitespace).\W+
: Matches one or more consecutive non-word characters.
B - Explore
B1: .
(dot)
1 2 3 4 5
1,2,3,4,5
1-2-3-4-5
find: .
replace: x
xxxxxxxxx
xxxxxxxxx
xxxxxxxxx
What did we learn: The dot (.)
matches any character except a newline.
- Can we extend this with
.+
(dot followed by plus)? What would that do?
B2: {n}
(curly brackets)
abc aabc aaabc
find: a{2}
replace: x
abc xbc xabc
What did we learn: a{2}
matches exactly two consecutive a characters.
- What happens when you change
{2}
to{3}
or{1}
?
B3: {x,y}
(range with curly brackets)
abc aabc aaabc aaaabc
find: a{2,4}
replace: a
abc abc abc abc
a{2,4}
matches between 2 and 4 consecutive a characters.
- What happens if you increase or decrease the range?
B4: [x]
(square brackets for character sets)
>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTCGTTGCA
>Seq4
ATGCACGTTGGTTGCA
find: TT[ATCG]GTT
replace: NNNNNN
What did we learn: [ATCG]
matches any single character within the set (A, T, C, or G).
- How would the output change if you used
[AG]
instead?
B5: [^X]
(negation inside square brackets)
>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTTGTTGCA
>Seq4
ATGCACGTTGGTTGCA
find: TT[^A]GTT
replace: NNNNNN
What did we learn: [^A]
matches any character except A.
- How would this behave if you used
[^CT]
?
B6: [X-Y]
(ranges inside square brackets)
>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATGCACGTTCGTTGCA
>Seq3
ATGCACGTTCGTTGCA
>Seq4
ATGCACGTTGGTTGCA
find: TT[A-Z]GTT
replace: NNNNNN
What did we learn: [A-Z]
matches any uppercase letter from A to Z.
- How would this behave if you used [a-Z] instead?
B7: ^
(caret for start of string)
>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATTCACGTTCGTTGCA
>Seq3
ATCCACGTTCGTTGCA
>Seq4
ATCCACGTTGGTTGCA
find: ^ATG
replace: NNN
What did we learn: The caret (^
) matches the start of a string.
- What happens if you try
^ATT
?
B8: $
(dollar for end of string)
>Seq1
ATGCACGTTAGTTGCA
>Seq2
ATTCACGTTCGTTATG
>Seq3
ATCCACGTTCGTTGCA
>Seq4
ATCCACGTTGGTTATG
find: ATG$
replace: NNN
What did we learn: The dollar sign ($
) matches the end of a string.
- What happens if you try to find a string that starts and ends with
ATG
?
C - Challenges
Do not worry, solutions are at the bottom!
C1: Simple Replace
- Find a pattern that allows you to modify the text so the sequence names are in the format
>SEQ_01
and so on. Write your find and replace commands.
Original Text:
SEQ 01
AAAAAAAAAAAA
SEQ 02
CCCCCCCCCCCC
SEQ 03
GGGGGGGGGGGG
Expected Output:
>SEQ_01
AAAAAAAAAAAA
>SEQ_02
CCCCCCCCCCCC
>SEQ_03
GGGGGGGGGGGG
C2: Move
- Move X to the beginning of each line.
Original Text:
123X
456X
789X
Expected Output:
X123
X456
X789
C3: Re-arrange
- Re-arrange the order of elements to be C,A,B, c,a,b, and 3,1,2.
Original Text:
A,B,C
a,b,c
1,2,3
Expected Output:
C,A,B
c,a,b
3,1,2
C4: Re-format
- Abbreviate the first word to its initial.
Original Text:
Mus musclus
Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Expected Output:
M. musclus
A. elegans
F. vitiazi
C. tottoni
S. galacifolia
C5: Remove multiple characters
- Remove all characters except the degree symbol and comma.
Original Text:
Zürich 47.3667° N, 8.5500° E
Basel 47.5667° N, 7.6000° E
St.Gallen 47.4167° N, 9.3667° E
Lausanne 46.5198° N, 6.6335° E
Lugano 46.0000° N, 8.9500° E
Expected Output:
Zürich 47.3667°, 8.5500°
Basel 47.5667°, 7.6000°
St.Gallen 47.4167°, 9.3667°
Lausanne 46.5198°, 6.6335°
Lugano 46.0000°, 8.9500°
C6: Re-arrange coordinates
- Re-arrange the coordinates into two columns.
- Convert the west coordinates to negative.
Original Text:
21 17'24.68"N
157 51'41.50"W
38 30'36.62"N
28 17'16.87"W
8 59'53.30"S
157 58'13.70"W
10 24'47.84"N
51 21'54.61"E
22 52'41.65"S
48 9'46.62"E
Intermediate Output:
21 17'24.68"N 157 51'41.50"W
38 30'36.62"N 28 17'16.87"W
8 59'53.30"S 157 58'13.70"W
10 24'47.84"N 51 21'54.61"E
22 52'41.65"S 48 9'46.62"E
Expected Output:
21 17'24.68"N -157 51'41.50"
38 30'36.62"N -28 17'16.87"
-8 59'53.30" -157 58'13.70"
10 24'47.84"N 51 21'54.61"E
-22 52'41.65" 48 9'46.62"E
C7: Format header (gb)
- Extract the accession number (without version nummer) and species name.
Original Text:
>gi|608606245|gb|KF962059.1| Agalma elegans voucher XMAE1 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA
ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT
>gi|270271668|gb|GQ119987.1| Frillagalma sp. BO-2009 isolate Agfr06 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA
GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT
>gi|62866985|gb|AY937366.1| Cordagalma cordiforme cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial
AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG
GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT
Expected Output:
>KF962059 Agalma_elegans
GATTATAAGACTTGAGTTAGCAGGACCTGGAACAATGTTAGGAGATGATCATATTTATAACGTCGTAGTA
ACAGCCCATGCTTTTGTTATGATATTTTTCCTAGTTATGCCAGTCTTAATAGGGGGTTTTGGTAATTGAT
>GQ119987 Frillagalma_sp
AACTTTATATTTGGTTTTTGGTTTTTTTTCTGGTATGGTGGGAACTGCTTTGAGTATGTTAATTAGATTA
GAATTATCTAGTTCAGGTTCGATGTTTTGTGATGATCATTTATATAACGTAATTGTTACAGCACATGCTT
>AY937366 Cordagalma_cordiforme
AACATTATATATTATTTTCGGTTTATTTTCTGGTATGATAGGTACTAGTTTAAGTATGATTATTAGATTG
GAGTTGAGTAGTCCAGGAACAATGCTTGGAGATGATCATTTGTATAATGTTATTGTTACTGCCCACGCTT
C8: Remove empty lines
- Remove empty lines between the sequences.
Original Text:
>Seq1
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq2
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq3
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq4
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq5
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
Expected Output:
>Seq1
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq2
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq3
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq4
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
>Seq5
ATCGATCGATCGATCGATCGATCGATCGATCTGTTTTATCGACTGATGAC
C9: Adjust decimal
- Round the decimal to two places.
Original Text:
3.14159265359
Expected Output:
3.14
C10: Reduce
- Reduce the number of ds to a single d.
Original Text:
d dd ddd dddd ddddd ddddd
Expected Output:
d d d d d d
C11: Space
- Reduce multiple spaces/hyphens to a single space.
Original Text:
1 2 3 4 5 6 7
1-2--3---4----5-----6------7
Expected Output:
1 2 3 4 5 6 7
1-2 3 4 5 6 7
C12: Poly A
- Remove the poly-A tail.
Original Text:
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA
Expected Output:
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGC
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT
C13: Poly AA
- Remove the poly-AA tail.
Original Text:
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA
Expected Output:
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTG
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCT
Solutions
C1: Simple Replace
find: ^SEQ
replace: >SEQ_
C2: Move
find: (\d\d\d)(\w)
replace: $2$1
find: (\w)(\d)(\d+)
replace: $2$1$3
find: (\d)(\w)(\d)(\d)
replace: $1$3$2$4
find: (\d+)(\w)(\d)
replace: $1$3$2
C3: Re-arrange
find: (\w,\w),(\w)
replace: $2,$1
find: (\w),(\w),(\w)
replace: $3,$1,$2
C4: Re-format
find: (\w)\w+ (\w+)
replace: $1. $2
find: (\w)(\w+) (\w+)
replace: $1$2 -> $1. $3 -> $1_$3
C5: Remove multiple characters
find: [NE]
replace:
C6: Re-arrange
find: (\"[NS])\n
replace: $1\t
# > Note: \n (=\r) end of line
# \t tab
find: ([0-9]+ [0-9 \' \" \.]+)[WS]
replace: -$1
find: [NE]
replace:
C7: Format header
find: (>)gi\|\d+\|gb\|(\w+).1\| (\w+) (\w+).*
replace: $1$2_$3_$4
C8: Remove empty lines
search: ^$\r
replace:
C9: Adjust decimal
find: (\d\.)(\d{2})\d+
replace: $1$2
C10: Reduce
find: d{2,}
replace: d
C11: Space
find: d[d]{1,7}
replace: d
C12: Poly A
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA
find: (\w+[TGC])A*
replace: $1
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGC
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAAT
C13: Poly AA
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTGCAAAAAAAAAAAAAAAA
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCTAAATAAAAAAAAAAAAA
find: (\w+[TGC])A*[TGC]A*
replace: $1
>SEQ01
ATGCGATCGACTGATCGATCGTGACTAGCTG
>SEQ02
ATCCCGATGGAATGATCGATCAAAACTAGCT