Learning Objectives
◇ Learn why remote servers are essential for large-scale data processing, collaboration, and secure access to shared computational resources.
◇ Gain the skills to securely transfer files and folders between a local machine and a remote server using the scp command.
◇ Develop the ability to perform basic operations like viewing, editing, and analyzing nucleotide sequence files on a remote server using command-line tools.
File Exchange
In this section you'll learn how to send files between your local computer and a remote server. You can have multiple Terminal windows open at the same time, allowing you to easily switch between working on your local computer and the remote server.
(1) Open a terminal on your local machine and create a text file.
echo "Let me see the world" > go.txt
(2) Now, use the scp
command to securely copy the file to your home directory on the remote server.
scp go.txt guest??@gdc-vserver.ethz.ch:/home/guest??
Important: Replace guest?? with your actual guest username on the server.
(3) Once the file is sent, switch back to your remote terminal and check if the file arrived successfully.
ls -al ${HOME}
cat ${HOME}/go.txt
(4) Let’s add some text to the file while still on the remote server:
echo "I visited the GDC server and I liked it" >> ${HOME}/go.txt
(5) Now, let's download the updated file from the remote server to your local computer, but save it under a different name to avoid overwriting the original:
scp guest??@gdc-vserver.ethz.ch:/home/guest??/go.txt back.txt
Again, replace guest?? with your actual username.
(6) Check the contents of both the original and the renamed file on your local machine:
cat go.txt back.txt
You should see that back.txt contains the additional line you added on the remote server.
SCP Client
While scp
is a great way to transfer files between remote systems via the terminal, some users may prefer a graphical interface for simplicity. Tools like Cyberduck offer a user-friendly, GUI-based solution for file transfers, making it easier to navigate files visually without having to use command-line commands. Cyberduck supports a range of protocols, including SFTP (which is based on SSH, just like scp), making it a secure and reliable option for those who prefer graphical tools. It's really helpful if you're not familiar with terminal commands or just prefer the ease of drag-and-drop file management.
SCP Usage Guidance
There are a few good reasons to archive (e.g., zip) files or entire folders before using scp
to transfer them. Firstly, compression makes the file smaller, which means it can be transferred faster, especially when you're dealing with large files or lots of small files. This also helps make the most of your bandwidth, which can be really important when you're working with limited network capacity or transferring data over long distances. Plus, zipping keeps the directory structure, including subdirectories and file permissions, so everything stays the same during the transfer. It also makes the process simpler, as you only need to transfer one file, which reduces the risk of losing or corrupting any individual files. On top of that, putting all the files into one archive makes the transfer more reliable by reducing the chance of incomplete data transmission. Finally, even though scp already has encryption, you can make it even more secure by adding password protection or encryption to the zip file itself, which gives you another layer of protection for sensitive data.
Spreadsheet Manipulations
In biology and other scientific fields, data is often stored in spreadsheets. While lots of people are happy using programs like Microsoft Excel to work with this data, relying on spreadsheet editors can be risky. Spreadsheets aren't always the best for working with large datasets. They can be slow, prone to errors from manual manipulation, and lack the ability to be reproduced. In research, reproducibility is really important, and spreadsheet software doesn't always support this. It's difficult to track every action and ensure that others can replicate your analysis exactly. This is where the terminal offers a more reliable and efficient alternative, providing tools that ensure your data handling is both transparent and reproducible.
In this session, we'll take a look at the basics of working with spreadsheet data using a tool designed for this purpose: csvtk. It is a great toolkit for working with CSV files (Comma-Separated Values), which is the standard format for tabular data on the terminal. Csvtk
can convert Excel files to CSV, extract specific data, merge tables, create plots, and even run basic statistical analyses, all from the terminal. I'll show you how it works with an example.
(1) To get started, just download a dummy spreadsheet file. This file has all the data we'll be working with throughout the session. In the terminal, head to your home directory and download the file with these commands:
cd ${HOME}
curl -O https://www.gdc-docs.ethz.ch/UniBS/EvolutionaryGenetics/BioInf/TestFile.xlsx
You can inspect the file, but it won’t be human-readable in its current format:
head TestFile.xlsx
Unfortunately, this won't give you much insight as the file is in Excel format. csvtk
is the tool for the job.
(2) To make the data easier to work with, we’ll convert the Excel file to CSV format using csvtk:
csvtk xlsx2csv -a TestFile.xlsx
This command shows you all the sheets in the Excel file, with each sheet representing a different set of data. For instance, you might see something like:
index sheet
1 weight
2 size
3 count
(3) Now, let’s extract each sheet as a separate CSV file. This will allow us to work with the data in a more manageable format:
csvtk xlsx2csv -i 1 -o TestFile_GW.csv TestFile.xlsx
csvtk xlsx2csv -i 2 -o TestFile_S.csv TestFile.xlsx
csvtk xlsx2csv -i 3 -o TestFile_C.csv TestFile.xlsx
This will create three separate files: one for group (i.e., sex) and weight, one for size, and one for counts.
(4) Once the sheets are extracted, we can combine them into a single file. This is really useful if you want to analyse or visualise the data together.
csvtk join TestFile_GW.csv TestFile_S.csv TestFile_C.csv > TestFile_GWSC.csv
We've now got a single file that combines sex, weight, size and count data.
(5) You can get a clear, well-formatted view of the data using the csvtk
pretty command:
csvtk pretty -S 3line -m Sample TestFile_GWSC.csv
(6) To get a better idea of what the data looks like, we can create a simple box plot. I'll show you how to plot the differences in body size by sex.
csvtk plot box TestFile_GWSC.csv -g Sex -f Size --height 3 --width 5 --horiz --title "Body Size Difference" --xlab "Size" --ylab "Sex" > box_size.png
# Get a local copy of the plot:
# scp guest??@gdc-vserver.ethz.ch:/home/guest??/box_size.png .
(7) If you prefer a scatter plot showing the relationship between weight and size:
csvtk plot line TestFile_GWSC.csv -x "Weight" -y "Size" -g Sex --title "Scatter" --scatter > scatter_weight.png
csvtk
can also crunch numbers for you. For instance, you can work out the relationships between different variables.
csvtk corr -f 2,3 TestFile_GWSC.csv
This will show you the correlation between weight and size, which might come back as 0.9272, indicating a strong positive link.
(8) You can also check the correlation between weight and count, or size and count.
csvtk corr -f 2,4 TestFile_GWSC.csv
csvtk corr -f 3,4 TestFile_GWSC.csv
These steps show just a few of the great features of csvtk
. Using terminal tools like this helps you handle spreadsheet data more safely, efficiently, and—most importantly—in a way that's easy to reproduce. This makes sure that your data analysis is clear and can be repeated by others, which is really important in biological research and beyond. There are lots of other tools out there for working with spreadsheets, but csvtk
is a great place to start if you want to work with tabular data on the command line.
Basic Sequence Examples
In this session, we'll take a look at how to use some simple but really effective commands to explore nucleotide sequences. These examples will show you how to interact with large sequence files in FASTA format, which is a common file type used to store nucleotide or protein sequences. This is really useful in bioinformatics, as it allows you to quickly analyse and manipulate sequence data.
(1) First, we'll use the curl
command, which lets you transfer data from or to a server, to download a sample nucleotide sequence file. The file we're working with is called RDP_16S_Archaea_Subset.fasta. It contains 16S ribosomal RNA sequences of Archaea.
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/RDP_16S_Archaea_Subset.fasta
This command gets the file from the given URL and puts it in your current directory.
(2) Once you've downloaded the file, it's a good idea to check that it's downloaded correctly. You can check this by listing the file with detailed information, such as its size.
ls -lh RDP_16S_Archaea_Subset.fasta
The -lh
options display the file size in a human-readable format, along with other useful metadata like permissions, date, and time of the download.
(3) Next, let's quickly go over what the file contains. To save time, we'll first take a quick look at the beginning and end of the file. We can view the first 10 lines using the head
command.
head -n 10 RDP_16S_Archaea_Subset.fasta
Similarly, to check the last 10 lines, use the tail
command:
tail -n 10 RDP_16S_Archaea_Subset.fasta
This gives us a snapshot of the file’s structure without overwhelming us with too much information at once.
(4) To get a better handle on the file, we can use the wc (word count) command to crunch some numbers and get a feel for the file's size and other stats. We'll also work out the number of lines, characters and the length of the longest line.
wc -lmL RDP_16S_Archaea_Subset.fasta
-l
counts the number of lines.-m
counts the number of characters.-L
gives you the length of the longest line in the file.
You can use the man wc
command to learn more about these options and others.
(5) In FASTA files, headers (which begin with the > character) provide descriptions of the sequences that follow. To extract all the headers from the file, we can use the grep command, which searches for patterns within text:
grep ">" RDP_16S_Archaea_Subset.fasta
This will return all the lines that start with >
, allowing you to quickly see the list of sequences and their descriptions.
❖ Challenge: Can you figure out how to count the number of sequences in the file using grep
?
Insights
Remember, there is never just one solution to a problem.
# (1a) Extract header and count the lines
grep ">" RDP_16S_Archaea_Subset.fasta | wc -l
# (1b) Count with <code>grep</code>
grep ">" -c RDP_16S_Archaea_Subset.fasta
❖ Challenge: Merge all the nucleotide sequences from RDP_16S_Archaea_Subset.fasta into one fasta sequence.
What we have:
>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA
What we need:
>Sequence
ATCGACGTCCCGT
ATCCACGTCTCGTTTTACTG
AACATCAC
ATCCACGTCTCGTNNTA
Insights
# We need a header for the new sequence
echo ">Sequence" > Sequences_Merged.fa
grep ">" -v RDP_16S_Archaea_Subset.fasta >> Sequences_Merged.fa
# -v (--invert-match) select non-matching lines
By using simple terminal commands, you can efficiently explore and manipulate sequence files without relying on specialized software. The commands demonstrated here-ls, head, tail, wc, and grep—are incredibly versatile and can be applied to many types of data beyond nucleotide sequences.
Extended Sequence Examples
In this section, we'll take a closer look at nucleotide sequences using a wider range of tools and commands. One really useful thing you can do is search for short sequence motifs, like stop codons, in a FASTA file. In this hands-on session, we'll show you how to work with motifs using basic terminal commands.
Motifs are short, recurring patterns in DNA or protein sequences that often have a biological function. For example, stop codons such as TAG
, TGA
, and TAA
signal the end of a protein-coding sequence. Being able to find these motifs in your sequence data is essential for various analyses.
(1) To begin, let’s focus on a single sequence within our multi-sequence file. We’ll extract the first FASTA record from the file to create a more manageable dataset:
head -n 13 RDP_16S_Archaea_Subset.fasta > S000444351.fa
This command grabs the first 13 lines of RDP_16S_Archaea_Subset.fasta
, which corresponds to the first FASTA sequence. The resulting file, S000444351.fa
, will be the one we work with for this example.
(2) Now that we have our first sequence, let’s search for a specific motif— the stop codon tag. We can use the grep command to search for occurrences of this motif in our extracted sequence:
grep "tag" -c S000444351.fa
The -c
option counts the number of lines that contain the motif. so, the output tells us that there are 9 lines containing the string "tag", but it does not tell us how many actual occurrences of the motif are present if multiple motifs appear on the same line.
(3) To better understand where and how the motif occurs within the sequence, use the following grep command to highlight all the matches in color:
grep "tag" --color S000444351.fa
This visual output makes it easier to spot each occurrence of the motif within the sequence.
(4) As noted earlier, using grep -c
only counts the number of matching lines, not the total number of motifs within the sequence. If a line contains multiple instances of the motif, they won’t be counted individually. To accurately count all occurrences, we need a different approach:
grep -o "tag" S000444351.fa | wc -l
Here, the -o
option tells grep
to output only the matching parts of each line, and wc -l
counts the number of lines (i.e., the number of individual matches). The output could be:
The output indicates that there are 10 occurrences of the tag motif, regardless of how many appear on each line.
(5) In biological sequences, it’s common to search for multiple motifs at once. For instance, if we want to find all stop codons (tag, tga, and taa), we can modify the grep command to search for multiple patterns:
grep -e "tag" -e "tga" -e "taa" -o S000444351.fa | wc -l
This command uses the -e
option to specify multiple patterns and counts all occurrences of any of the stop codons. Alternatively, you can use egrep or modify the grep pattern to achieve the same result:
egrep "tag|tga|taa" -o S000444351.fa | wc -l
Or:
grep "tag\|tga\|taa" -o S000444351.fa | wc -l
These commands are functionally equivalent, and the output will show the total number of occurrences of all specified motifs.
❖ Challenge: We can use character classes with square brackets []
in grep
to search for variations in patterns. But why would the following command not work as expected?
grep -e "t[ag][ag]" -o S000444351.fa | wc -l
Insights
The search term would look for tag
, tga
, taa
, but also include tgg
, which we're not interested in.
Searching for Primer Sites
In this session, we’ll extend our exploration of nucleotide sequences by searching for potential primer binding sites within a FASTA file. This is a crucial step in many molecular biology workflows, such as PCR or sequencing, where primers are designed to bind specific regions of DNA.
In the previous session, we focused on short 3-nucleotide motifs within a single sequence. Now, we will search for a longer, more specific primer sequence across all sequences in a multi-sequence FASTA file.
(1) Before searching for the primer site, it’s important to establish how many sequences are present in the file. This gives us a baseline for evaluating how many sequences contain the primer.
grep -c ">" RDP_16S_Archaea_Subset.fasta
The output indicates that there are 129 sequences in the FASTA file. Our goal is to determine how many of these sequences contain a full-length primer binding site without any mismatches.
(2) Let’s search for a potential primer site 5'-GGCGTTAGTGCCCATCTAGT-'3
to see how many sequences in the file contain this exact sequence. We will use grep to search the file:
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset.fasta
No matches were found. Why? The reason is that grep
is case-sensitive by default, and nucleotide sequences in FASTA files are often represented in lowercase. So, unless the sequence in the file exactly matches the case of the primer sequence, it won’t be counted.
(3) To fix this, we need to make the entire file either uppercase or lowercase so that the case-sensitivity issue is no longer a factor. We can use the tr command to convert all lowercase characters to uppercase:
cat RDP_16S_Archaea_Subset.fasta | tr a-z A-Z > RDP_16S_Archaea_Subset_Caps.fasta
Text Manipulation
There are many command-line utilities for text manipulation, and tr
(translate) is one of them. It is part of the core utilities and available in all Linux distributions. The tr
command reads a byte stream from standard input (stdin
), translates or deletes characters, and then writes the result to standard output (stdout
).
## Change / Remove with tr
echo "AUG" | tr U T # change Us (mRNA) into Ts (cDNA)
echo "TAGCT ATCTT" | tr [:space:] '\n' # replace space with newline
echo "ATCGA TAGAA" | tr [:space:] '\t' # replace space with tabs
echo "Tue 16.06.2020" | tr [:punct:] '/' # replace . with /
echo "Tue 16/06/2020" | tr -d [:alpha:] # remove letters
Tip: Have a look at the manual page (man tr
) to get more details.
We can then check the first two lines of the modified file to confirm the change:
head -n 2 RDP_16S_Archaea_Subset_Caps.fasta
(4) With all sequences converted to uppercase, we can now re-run the search:
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta
We now get 27 hits. However, what if some sequences are broken across multiple lines? Since grep searches line by line, any occurrences of the primer site that span multiple lines will be missed.
(5) To address this, we need to reformat the FASTA file so that all sequence data is on a single line for each record. We can use a script to do this:
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/SingleFasta.sh
chmod a+x SingleFasta.sh
bash SingleFasta.sh RDP_16S_Archaea_Subset_Caps.fasta > RDP_16S_Archaea_Subset_Caps_Single.fa
Now, compare the first two lines of both the original and reformatted files:
head -n 2 RDP_16S_Archaea_Subset_Caps.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps_Single.fa
(6) With the file now in single-line format, we can re-run the primer search:
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa
We now have 29 hits, which is slightly more than before, indicating that some primer sites were previously missed because they spanned multiple lines.
In many cases, it’s useful to allow for some variability in the primer sequence. For example, some positions in the primer may tolerate different nucleotides. We can introduce "wobble bases" by using square brackets []
to specify alternatives.
(7) Let’s modify the search to account for a wobble base at the 10th position, allowing for either T
or G
:
grep "GGCGTTAG[TG]GCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa
By relaxing the search criteria slightly, we now find 32 sequences that contain the primer site, allowing for one mismatch.
Using Primer3 for Finding PCR Primers
In this section, we reverse our approach by seeking out potential primer sequences rather than matching existing ones. We use Primer3 to identify these primer sequences.
(1) Start by downloading the Primer3 settings template:
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/PCR.settings.template
cat PCR.settings.template
echo -n "SEQUENCE_TEMPLATE=" > add.tmp # Add ID tag
bash SingleFasta.sh S000444351.fa > S000444351.fasta # Remove eols
grep -v ">" S000444351.fasta >> add.tmp # Extract sequence without fasta header
echo "=" >> add.tmp # Ensure the file ends with an equal sign
(3) Add your query sequence to the Primer3 settings:
cat PCR.settings.template add.tmp > PCR.settings
(4) Execute Primer3 to find PCR primers:
primer3_core --default_version=1 --format_output --output primers.txt < PCR.settings
(5) Check the output file for primer sequences:
less primers.txt
# Use [q] to quit the less viewer
Using infoseq
for Sequence Information
The infoseq
tool from the EMBOSS package provides detailed information about sequence data.
(1) Ensure infoseq is available on your system and display basic usage information:
which infoseq
infoseq --help
(2) Report length of sequences:
infoseq -only -length -nocolumns RDP_16S_Archaea_Subset.fasta > L.tmp
less L.tmp
# Note: "only -length" restricts the output to sequence lengths
(3) Length range:
sort -n L.tmp | head # lower range
sort -n L.tmp | tail # upper range
csvtk plot box L.tmp > box_size.png
❖ Challenge: Find the range of %GC content for the RDP sequences.
Insights
## Calculate %GC Content
infoseq -only -pgc -nocolumns RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp
# -pgc: Calculate percent GC content (see infoseq --help for details)
# -n: Sort numerically
## Find Max and Min %GC
head -n 2 pGC.tmp # Display the smallest %GC content
tail -n 1 pGC.tmp # Display the largest %GC content
csvtk plot box pGC.tmp > box_pgc.png
Fun Time
The Terminal doesn't have to be all work and no play! Here are some fun tricks and tweaks to break up the routine and make your command line experience more enjoyable.
One-Liner Tweak
Do you love linux?
yes "I love Linux!"
Roll a Dice:
shuf -i 1-6 -n 1
Prime factors of the numbers 10, 50, and 100:
factor 10 50 100
Progress Bar:
echo "Some long task in progress..." | pv -qL 5
Take a string and reverse the order of its characters:
echo "123456789" | rev
echo "a man, a plan, a canal, panama" | tr -d '[:punct:]' | tr -d ' ' | rev
Reverse and swap the nucleotide bases (A ↔ T, G ↔ C):
echo "ATGCAT" | rev | tr 'ATGC' 'TACG'
Tick .. tick
# Print the current time every 5 seconds in an infinite loop.
while true; do echo "$(date '+%T')"; sleep 3; done
# Stop it by pressing [Ctrl] + [C].
Count Down
for i in {60..1}; do echo "$i seconds remaining"; sleep 1; done; echo "Time is up"
A place by the fire
aafire
Weather Check
# Get the current weather for a city.
curl wttr.in/zurich
Fun fact: There are several cities named Zurich worldwide, like Zurich in Rooks County, Kansas! If you don’t specify, you might see the weather for a different Zurich depending on your location.
Steam Engine
Everyone knows 'ls' lists files, but have you tried 'sl' for a fun surprise?
sl # Try typing 'sl' by mistake and see what happens!
What is the cow saying?
Cowsay generates ASCII art of a cow with your message in a speech bubble.
# Give a direct message to the cow.
cowsay "Hello there!"
# Make the cow say text from a file.
echo "tfjzf uif npp-nfou!" | tr 'b-x' 'a-y' > moo.txt
cowsay `cat moo.txt`
# Customize the cow! Meet your tuxedo cow.
ls | cowsay -f tux # There are many more cow characters to explore!
# Get a fortune from a cow.
clear ; fortune | cowsay -f eyes
# See all the different cows available.
for i in $(cowsay -l); do cowsay -f $i "$i"; done
## Source:
# https://github.com/tnalpgge/rank-amateur-cowsay
# https://en.wikipedia.org/wiki/Cowsay
Snake Game
# A classic snake game.
nsnake