Linux 102 - Genetic Diversity Centre (GDC)

Learning Objectives

◇ Learn why remote servers are essential for large-scale data processing, collaboration, and secure access to shared computational resources.
◇ Gain the skills to securely transfer files and folders between a local machine and a remote server using the scp command.
◇ Develop the ability to perform basic operations like viewing, editing, and analyzing nucleotide sequence files on a remote server using command-line tools.

File Exchange

In this section you'll learn how to send files between your local computer and a remote server. You can have multiple Terminal windows open at the same time, allowing you to easily switch between working on your local computer and the remote server.

(1) Open a terminal on your local machine and create a text file.

echo "Let me see the world" > go.txt

(2) Now, use the scp command to securely copy the file to your home directory on the remote server.

scp go.txt guest??@gdc-vserver.ethz.ch:/home/guest??

Important: Replace guest?? with your actual guest username on the server.

(3) Once the file is sent, switch back to your remote terminal and check if the file arrived successfully.

ls -al ${HOME}
cat ${HOME}/go.txt

(4) Let’s add some text to the file while still on the remote server:

echo "I visited the GDC server and I liked it" >> ${HOME}/go.txt

(5) Now, let's download the updated file from the remote server to your local computer, but save it under a different name to avoid overwriting the original:

scp guest??@gdc-vserver.ethz.ch:/home/guest??/go.txt back.txt

Again, replace guest?? with your actual username.

(6) Check the contents of both the original and the renamed file on your local machine:

cat go.txt back.txt

You should see that back.txt contains the additional line you added on the remote server.

SCP Usage Guidance

Faster transfers: Compressing files reduces their size, which can significantly speed up transfers - especially for large data sets or folders containing many small files .
Preserves structure and metadata: Archiving preserves the original directory structure, file hierarchy and permissions, ensuring that nothing is lost or altered in transit.
Streamlined process: Bundling everything into a single archive simplifies transfer and minimises the risk of missing, partial or corrupted files.
Increased reliability: Transferring one file instead of many reduces the chance of errors or interruptions that can occur with bulk file transfers.
Improved security: While scp encrypts files during transfer, adding encryption or a password to the archive itself provides an extra layer of protection - particularly useful for sensitive or confidential data.

A Note on Security

When you connect to a remote server with scp or ssh, you're sending data over the internet. Using SSH keys instead of passwords is a safer and more convenient way to authenticate. Want to know how SSH keys work or how to set them up? Try here.

Smarter File Transfers

Copying files with scp works well for quick, one-off transfers. But as your projects grow, you might want tools that are a bit more flexible. Two useful alternatives are rsync and sftp:

- rsync is great when you are copying large directories or updating files. It only transfers the parts that have changed, which saves time and bandwidth. You can also use it to mirror folders between your computer and a server.

- sftp (Secure File Transfer Protocol) works like a file browser over the terminal. You can connect to a server, list files, navigate directories, and upload or download files interactively—similar to using a graphical tool like Cyberduck or FileZilla.

Graphical Alternatives - SCP Client

While scp is a great way to transfer files between remote systems via the terminal, some users may prefer a graphical interface for simplicity. Tools such as Cyberduck or FileZilla provide an easy-to-use GUI-based solution for file transfers, making it easier to navigate files visually without having to rely on command line switches. Cyberduck supports multiple protocols, including SFTP (which uses SSH, like scp), providing a secure and reliable option for users who prefer drag-and-drop file management or are less familiar with the Terminal.

Spreadsheet Manipulations

In biology and other scientific fields, data is often stored in spreadsheets. While lots of people are happy using programs like Microsoft Excel to work with this data, relying on spreadsheet editors can be risky. Spreadsheets aren't always the best for working with large datasets. They can be slow, prone to errors from manual manipulation, and lack the ability to be reproduced. In research, reproducibility is really important, and spreadsheet software doesn't always support this. It's difficult to track every action and ensure that others can replicate your analysis exactly. This is where the terminal offers a more reliable and efficient alternative, providing tools that ensure your data handling is both transparent and reproducible.

In this session, we'll take a look at the basics of working with spreadsheet data using a tool designed for this purpose: csvtk. It is a great toolkit for working with CSV files (Comma-Separated Values), which is the standard format for tabular data on the terminal. Csvtk can convert Excel files to CSV, extract specific data, merge tables, create plots, and even run basic statistical analyses, all from the terminal. I'll show you how it works with an example.

(1) To get started, just download a dummy spreadsheet file. This file has all the data we'll be working with throughout the session. In the terminal, head to your home directory and download the file with these commands:

cd ${HOME}
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/TestFile.xlsx

You can inspect the file, but it won’t be human-readable in its current format:

head TestFile.xlsx

Unfortunately, this won't give you much insight as the file is in Excel format. csvtk is the tool for the job.

(2) To make the data easier to work with, we’ll convert the Excel file to CSV format using csvtk:

csvtk xlsx2csv -a TestFile.xlsx

This command shows you all the sheets in the Excel file, with each sheet representing a different set of data. For instance, you might see something like:

index   sheet
1   weight
2   size
3   count

(3) Now, let’s extract each sheet as a separate CSV file. This will allow us to work with the data in a more manageable format:

csvtk xlsx2csv -i 1 -o TestFile_GW.csv TestFile.xlsx
csvtk xlsx2csv -i 2 -o TestFile_S.csv  TestFile.xlsx
csvtk xlsx2csv -i 3 -o TestFile_C.csv  TestFile.xlsx

This will create three separate files: one for group (i.e., sex) and weight, one for size, and one for counts.

(4) Once the sheets are extracted, we can combine them into a single file. This is really useful if you want to analyse or visualise the data together.

csvtk join TestFile_GW.csv TestFile_S.csv TestFile_C.csv > TestFile_GWSC.csv

We've now got a single file that combines sex, weight, size and count data.

(5) You can get a clear, well-formatted view of the data using the csvtk pretty command:

csvtk pretty -S 3line -m Sample TestFile_GWSC.csv

(6) To get a better idea of what the data looks like, we can create a simple box plot. I'll show you how to plot the differences in body size by sex.

csvtk plot box TestFile_GWSC.csv -g Sex -f Size --height 3 --width 5 --horiz --title "Body Size Difference" --xlab "Size" --ylab "Sex" > box_size.png
# Get a local copy of the plot:
# scp guest??@gdc-vserver.ethz.ch:/home/guest??/box_size.png .

(7) If you prefer a scatter plot showing the relationship between weight and size:

csvtk plot line TestFile_GWSC.csv -x "Weight" -y "Size" -g Sex --title "Scatter" --scatter > scatter_weight.png

csvtk can also crunch numbers for you. For instance, you can work out the relationships between different variables.

csvtk corr -f 3,4 TestFile_GWSC.csv

This will show you the correlation between weight and size, which might come back as 0.9272, indicating a strong positive link.

(8) You can also check the correlation between weight and count, or size and count.

csvtk corr -f 4,5 TestFile_GWSC.csv
csvtk corr -f 3,5 TestFile_GWSC.csv

These steps show just a few of the great features of csvtk. Using terminal tools like this helps you handle spreadsheet data more safely, efficiently, and—most importantly—in a way that's easy to reproduce. This makes sure that your data analysis is clear and can be repeated by others, which is really important in biological research and beyond. There are lots of other tools out there for working with spreadsheets, but csvtk is a great place to start if you want to work with tabular data on the command line.

Basic Sequence Examples

In this session, we'll take a look at how to use some simple but really effective commands to explore nucleotide sequences. These examples will show you how to interact with large sequence files in FASTA format, which is a common file type used to store nucleotide or protein sequences. This is really useful in bioinformatics, as it allows you to quickly analyse and manipulate sequence data.

(1) First, we'll use the curl command, which lets you transfer data from or to a server, to download a sample nucleotide sequence file. The file we're working with is called RDP_16S_Archaea_Subset.fasta. It contains 16S ribosomal RNA sequences of Archaea.

curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/RDP_16S_Archaea_Subset.fasta

This command gets the file from the given URL and puts it in your current directory.

(2) Once you've downloaded the file, it's a good idea to check that it's downloaded correctly. You can check this by listing the file with detailed information, such as its size.

ls -lh RDP_16S_Archaea_Subset.fasta

The -lh options display the file size in a human-readable format, along with other useful metadata like permissions, date, and time of the download.

(3) Next, let's quickly go over what the file contains. To save time, we'll first take a quick look at the beginning and end of the file. We can view the first 10 lines using the head command.

head -n 10 RDP_16S_Archaea_Subset.fasta

Similarly, to check the last 10 lines, use the tail command:

tail -n 10 RDP_16S_Archaea_Subset.fasta

This gives us a snapshot of the file’s structure without overwhelming us with too much information at once.

(4) To get a better handle on the file, we can use the wc (word count) command to crunch some numbers and get a feel for the file's size and other stats. We'll also work out the number of lines, characters and the length of the longest line.

wc -lmL RDP_16S_Archaea_Subset.fasta

-l counts the number of lines.
-m counts the number of characters.
-L gives you the length of the longest line in the file.

You can use the man wc command to learn more about these options and others.

(5) In FASTA files, headers (which begin with the > character) provide descriptions of the sequences that follow. To extract all the headers from the file, we can use the grep command, which searches for patterns within text:

grep ">" RDP_16S_Archaea_Subset.fasta

This will return all the lines that start with >, allowing you to quickly see the list of sequences and their descriptions.

❖ Challenge: Can you figure out how to count the number of sequences in the file using grep?

Insights

Remember, there is never just one solution to a problem.

# (1a) Extract header and count the lines
grep ">" RDP_16S_Archaea_Subset.fasta | wc -l
# (1b) Count with <code>grep</code>
grep ">" -c RDP_16S_Archaea_Subset.fasta

❖ Challenge: Merge all the nucleotide sequences from RDP_16S_Archaea_Subset.fasta into one fasta sequence.

What we have:

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA

What we need:

>Sequence
ATCGACGTCCCGT
ATCCACGTCTCGTTTTACTG
AACATCAC
ATCCACGTCTCGTNNTA

Insights

# We need a header for the new sequence
echo ">Sequence" > Sequences_Merged.fa
grep ">" -v RDP_16S_Archaea_Subset.fasta >> Sequences_Merged.fa
# -v (--invert-match) select non-matching lines

By using simple terminal commands, you can efficiently explore and manipulate sequence files without relying on specialized software. The commands demonstrated here-ls, head, tail, wc, and grep—are incredibly versatile and can be applied to many types of data beyond nucleotide sequences.

Extended Sequence Examples

In this section, we'll take a closer look at nucleotide sequences using a wider range of tools and commands. One really useful thing you can do is search for short sequence motifs, like stop codons, in a FASTA file. In this hands-on session, we'll show you how to work with motifs using basic terminal commands.

Motifs are short, recurring patterns in DNA or protein sequences that often have a biological function. For example, stop codons such as TAG, TGA, and TAA signal the end of a protein-coding sequence. Being able to find these motifs in your sequence data is essential for various analyses.

(1) To begin, let’s focus on a single sequence within our multi-sequence file. We’ll extract the first FASTA record from the file to create a more manageable dataset:

head -n 13 RDP_16S_Archaea_Subset.fasta > S000444351.fa

This command grabs the first 13 lines of RDP_16S_Archaea_Subset.fasta, which corresponds to the first FASTA sequence. The resulting file, S000444351.fa, will be the one we work with for this example.

(2) Now that we have our first sequence, let’s search for a specific motif— the stop codon tag. We can use the grep command to search for occurrences of this motif in our extracted sequence:

grep "tag" -c S000444351.fa

The -c option counts the number of lines that contain the motif. so, the output tells us that there are 9 lines containing the string "tag", but it does not tell us how many actual occurrences of the motif are present if multiple motifs appear on the same line.

(3) To better understand where and how the motif occurs within the sequence, use the following grep command to highlight all the matches in color:

grep "tag" --color S000444351.fa

This visual output makes it easier to spot each occurrence of the motif within the sequence.

(4) As noted earlier, using grep -c only counts the number of matching lines, not the total number of motifs within the sequence. If a line contains multiple instances of the motif, they won’t be counted individually. To accurately count all occurrences, we need a different approach:

grep -o "tag" S000444351.fa | wc -l

Here, the -o option tells grep to output only the matching parts of each line, and wc -l counts the number of lines (i.e., the number of individual matches). The output could be:

The output indicates that there are 10 occurrences of the tag motif, regardless of how many appear on each line.

(5) In biological sequences, it’s common to search for multiple motifs at once. For instance, if we want to find all stop codons (tag, tga, and taa), we can modify the grep command to search for multiple patterns:

grep -e "tag" -e "tga" -e "taa" -o S000444351.fa | wc -l

This command uses the -e option to specify multiple patterns and counts all occurrences of any of the stop codons. Alternatively, you can use egrep or modify the grep pattern to achieve the same result:

egrep "tag|tga|taa" -o S000444351.fa | wc -l

Or:

grep "tag\|tga\|taa" -o S000444351.fa | wc -l

These commands are functionally equivalent, and the output will show the total number of occurrences of all specified motifs.

❖ Challenge: We can use character classes with square brackets [] in grep to search for variations in patterns. But why would the following command not work as expected?

grep -e "t[ag][ag]" -o S000444351.fa | wc -l

Insights

The search term would look for tag, tga, taa, but also include tgg, which we're not interested in.

Searching for Primer Sites

In this session, we’ll extend our exploration of nucleotide sequences by searching for potential primer binding sites within a FASTA file. This is a crucial step in many molecular biology workflows, such as PCR or sequencing, where primers are designed to bind specific regions of DNA.

In the previous session, we focused on short 3-nucleotide motifs within a single sequence. Now, we will search for a longer, more specific primer sequence across all sequences in a multi-sequence FASTA file.

(1) Before searching for the primer site, it’s important to establish how many sequences are present in the file. This gives us a baseline for evaluating how many sequences contain the primer.

grep -c ">" RDP_16S_Archaea_Subset.fasta

The output indicates that there are 129 sequences in the FASTA file. Our goal is to determine how many of these sequences contain a full-length primer binding site without any mismatches.

(2) Let’s search for a potential primer site 5'-GGCGTTAGTGCCCATCTAGT-'3 to see how many sequences in the file contain this exact sequence. We will use grep to search the file:

grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset.fasta

No matches were found. Why? The reason is that grep is case-sensitive by default, and nucleotide sequences in FASTA files are often represented in lowercase. So, unless the sequence in the file exactly matches the case of the primer sequence, it won’t be counted.

(3) To fix this, we need to make the entire file either uppercase or lowercase so that the case-sensitivity issue is no longer a factor. We can use the tr command to convert all lowercase characters to uppercase:

cat RDP_16S_Archaea_Subset.fasta | tr a-z A-Z > RDP_16S_Archaea_Subset_Caps.fasta

Text Manipulation

There are many command-line utilities for text manipulation, and tr (translate) is one of them. It is part of the core utilities and available in all Linux distributions. The tr command reads a byte stream from standard input (stdin), translates or deletes characters, and then writes the result to standard output (stdout).

## Change / Remove with tr
echo "AUG" | tr U T  # change Us (mRNA) into Ts (cDNA)

We can also use built-in character set aliases for the translation:

echo "TAGCT ATCTT"    | tr [:space:] '\n' # replace space with newline
echo "ATCGA TAGAA"    | tr [:space:] '\t' # replace space with tabs
echo "Tue 16.06.2020" | tr [:punct:] '/'  # replace . with /
echo "Tue 16/06/2020" | tr -d [:alpha:]   # remove letters

Tip: Have a look at the manual page (man tr) to get more details.

We can then check the first two lines of the modified file to confirm the change:

head -n 2 RDP_16S_Archaea_Subset_Caps.fasta

(4) With all sequences converted to uppercase, we can now re-run the search:

grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta

We now get 27 hits. However, what if some sequences are broken across multiple lines? Since grep searches line by line, any occurrences of the primer site that span multiple lines will be missed.

(5) To address this, we need to reformat the FASTA file so that all sequence data is on a single line for each record. We can use a script to do this:

curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/SingleFasta.sh
chmod a+x SingleFasta.sh
bash SingleFasta.sh RDP_16S_Archaea_Subset_Caps.fasta > RDP_16S_Archaea_Subset_Caps_Single.fa

Now, compare the first two lines of both the original and reformatted files:

head -n 2 RDP_16S_Archaea_Subset_Caps.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps_Single.fa

(6) With the file now in single-line format, we can re-run the primer search:

grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa

We now have 29 hits, which is slightly more than before, indicating that some primer sites were previously missed because they spanned multiple lines.

In many cases, it’s useful to allow for some variability in the primer sequence. For example, some positions in the primer may tolerate different nucleotides. We can introduce "wobble bases" by using square brackets [] to specify alternatives.

(7) Let’s modify the search to account for a wobble base at the 10^th position, allowing for either T or G:

grep "GGCGTTAG[TG]GCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa

By relaxing the search criteria slightly, we now find 32 sequences that contain the primer site, allowing for one mismatch.

Using Primer3 for Finding PCR Primers

In this section, we reverse our approach by seeking out potential primer sequences rather than matching existing ones. We use Primer3 to identify these primer sequences.

(1) Start by downloading the Primer3 settings template:

curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/PCR.settings.template
cat PCR.settings.template

(2) Create a file with your query sequence formatted for Primer3:

echo -n "SEQUENCE_TEMPLATE="      > add.tmp           # Add ID tag
bash SingleFasta.sh S000444351.fa > S000444351.fasta  # Remove eols
grep -v ">" S000444351.fasta      >> add.tmp          # Extract sequence without fasta header
echo "=" >> add.tmp                                   # Ensure the file ends with an equal sign

(3) Add your query sequence to the Primer3 settings:

cat PCR.settings.template add.tmp > PCR.settings

(4) Execute Primer3 to find PCR primers:

primer3_core --default_version=1 --format_output --output primers.txt < PCR.settings

(5) Check the output file for primer sequences:

less primers.txt
# Use [q] to quit the less viewer

Using `infoseq` for Sequence Information

The infoseq tool from the EMBOSS package provides detailed information about sequence data.

(1) Ensure infoseq is available on your system and display basic usage information:

which infoseq
infoseq --help

(2) Report length of sequences:

infoseq -only -length -nocolumns RDP_16S_Archaea_Subset.fasta > L.tmp
less L.tmp
# Note: "only -length" restricts the output to sequence lengths

(3) Length range:

sort -n L.tmp | head # lower range
sort -n L.tmp | tail # upper range
csvtk plot box L.tmp > box_size.png

❖ Challenge: Find the range of %GC content for the RDP sequences.

Insights

## Calculate %GC Content
infoseq -only -pgc -nocolumns RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp
# -pgc: Calculate percent GC content (see infoseq --help for details)
# -n: Sort numerically

## Find Max and Min %GC
head -n 2 pGC.tmp  # Display the smallest %GC content
tail -n 1 pGC.tmp  # Display the largest %GC content
csvtk plot box pGC.tmp > box_pgc.png

Fun Time

The Terminal doesn't have to be all work and no play! Here are some fun tricks and tweaks to break up the routine and make your command line experience more enjoyable.

One-Liner Tweak

Do you love linux?

yes "I love Linux!"

Roll a Dice:

shuf -i 1-6 -n 1

Prime factors of the numbers 10, 50, and 100:

factor 10 50 100

Progress Bar:

echo "Some long task in progress..." | pv -qL 5

Take a string and reverse the order of its characters:

echo "123456789" | rev
echo "a man, a plan, a canal, panama" | tr -d '[:punct:]' | tr -d ' ' | rev

Reverse and swap the nucleotide bases (A ↔ T, G ↔ C):

echo "ATGCAT" | rev | tr 'ATGC' 'TACG'

Tick .. tick

# Print the current time every 5 seconds in an infinite loop.
while true; do echo "$(date '+%T')"; sleep 3; done
# Stop it by pressing [Ctrl] + [C].

Count Down

for i in {60..1}; do echo "$i seconds remaining"; sleep 1; done; echo "Time is up"

A place by the fire

aafire

Weather Check

# Get the current weather for a city.
curl wttr.in/zurich

Fun fact: There are several cities named Zurich worldwide, like Zurich in Rooks County, Kansas! If you don’t specify, you might see the weather for a different Zurich depending on your location.

Steam Engine

Everyone knows 'ls' lists files, but have you tried 'sl' for a fun surprise?

sl  # Try typing 'sl' by mistake and see what happens!

What is the cow saying?

Cowsay generates ASCII art of a cow with your message in a speech bubble.

# Give a direct message to the cow.
cowsay "Hello there!"

# Make the cow say text from a file.
echo "tfjzf uif npp-nfou!" | tr 'b-x' 'a-y' > moo.txt
cowsay `cat moo.txt`

# Customize the cow! Meet your tuxedo cow.
ls | cowsay -f tux  # There are many more cow characters to explore!

# Get a fortune from a cow.
clear ; fortune | cowsay -f eyes

# See all the different cows available.
for i in $(cowsay -l); do cowsay -f $i "$i"; done

## Source:
# https://github.com/tnalpgge/rank-amateur-cowsay
# https://en.wikipedia.org/wiki/Cowsay

Snake Game

# A classic snake game.
nsnake

File Exchange

Graphical Alternatives - SCP Client

Spreadsheet Manipulations

Basic Sequence Examples

Extended Sequence Examples

Searching for Primer Sites

Using Primer3 for Finding PCR Primers

Using infoseq for Sequence Information

Fun Time

Using `infoseq` for Sequence Information