Ssh - Genetic Diversity Centre (GDC)

Learning Objectives

Main
◇ Understand how and why we would connect to a (remote) server.
◇ Manage the transfer of data files to and from a server, including validation.
Minor
◇ Understand and learn how to work with archive files.

Lecture Notes

⬇︎ Remote Terminal

Remote Server

To handle the data of NGS projects, we often reach limits with the private computer. What we need is access to servers that have more processors, more storage space, and more memory. Using the Secure Shell (SSH) protocol, you can access a remote server to perform shell commands.

Guest Rules

For the remote terminal tutorial, you will be a guest on the GDC vm server. Please follow a few simple rules!

GDC VM SERVER RULES

➽ Do not share your login detail with anyone.
➽ Only use the guest account that has been assigned to you.
➽ Do not use your account for anything other than your course assignments.
➽ Not to store any important or confidential data.
➽ If you have any problems or need help, ask!

Note: Your student account will expire at the end of the course.

Access Remote (VM) Server

Login to GDC VM (virual machine) server at ETH using ssh:

ssh -X guest??@gdc-vserver.ethz.ch # replace ?? with your guest number

After you run the ssh command, the server will ask you for a password. For security reasons, you will not see what you type. You can copy and paste your password to avoid typos.

Once connected, repeat some of the terminal commands you have already learnt. Here are some ideas:

pwd                    # You should be in your home
echo ${HOME}           # This is the path to your home  
echo ${USER}           # Username of guest account
mkdir TestDir          # Create Folder
touch TestDir/test.tmp # Create a file

Terminate Remote Access

When you are done, use exit to disconnect from the server. Closing the terminal would also work but it would be less nice.

Explore Remote Server

Have a quick look around:

## New Resources
df -h               # Disk usage
lscpu | head -n 9   # System information e.g. number of CPUs
free -m             # Information about the available memory
cat /etc/os-release | head -n 2 # Linux OS

You might have new and/or different commands available:

## New command `tree`
mkdir -p ${HOME}/TEST1/TEST2/TEST3/TEST4
tree

You are also not alone anymore:

## Monitor traffic
users # Who is all here?
top   # press [q] to quit top
htop  # simlar to top - press [q] to quit htop.

File Exchange

Open a new local terminal. Yes, you can have multiple terminals open at the same time. Create a local file and send it to your remote home directory on the vm server. Check your remote directory if the file arrived.

## Create a text files - on your computer (local terminal)
echo "Let me see the world" > go.txt

## Send the file to the server
scp go.txt guest??@gdc-vserver.ethz.ch:/home/guest??
# Do not forget to change <guest??> accordingly

Go back to the remote terminal, find the file, and change it.

## Find the file (remote terminal)
ls -al ${HOME}
cat ${HOME}/go.txt
echo "I visited the GDC server and I liked it" >> ${HOME}/go.txt

Next, we download the uploaded file from the GDC server to our local working directory but change the file name. This is important to avoid the original file from being overwritten.

## Get the file back but rename it (local terminal) 
scp guest??@gdc-vserver.ethz.ch:/home/guest??/go.txt back.txt

## Let us have a look
cat go.txt back.txt

(S)FTP Client

A convenient GUI based alternative to up- or download (exchange) files from or to a remote server is via a (S)FTP client like Cyberduck.

Applied Sequence Examples

We use curl (a tool to transfer data from or to a server) to download the multi-nucleotide sequence file RDP_16S_Archaea_Subset.fasta and explore the file.

## (Remote) Working Directory
mkdir -p ${HOME}/ssh
cd ${HOME}/ssh

## Download the fasta file:
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/RDP_16S_Archaea_Subset.fasta
ls -lh RDP_16S_Archaea_Subset.fasta

## Have a look at the fasta file:
head -n 10 RDP_16S_Archaea_Subset.fasta
tail -n 10 RDP_16S_Archaea_Subset.fasta

## How big is the file
ls -lh RDP_16S_Archaea_Subset.fasta

## Some statistics
wc -lmL RDP_16S_Archaea_Subset.fasta
# Wonder what you see? Try: man wc

The sequence file is a text file in fasta format. A header starts with > and ends with end of line. There is one header line per sequences but the number and/or length of the sequence is not fixed.

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
atccacgtctcgtnnta

❖ Challenge #1: We can use the grep command to extract all the header lines:

grep ">" RDP_16S_Archaea_Subset.fasta

The command grep will find and print lines matching the specified search pattern. In our case, we grabbed all fasta headers. Now, can you find a way to count the number of sequences in the multi-fasta file using grep?

Suggestion #1

Remember, there is never just one solution to a problem.


    ## 1a - Extract header and count the lines
    grep ">" RDP_16S_Archaea_Subset.fasta | wc -l
    ## 1b - Count with grep
    grep ">" -c RDP_16S_Archaea_Subset.fasta

❖ Challenge #2: Merge all the fasta sequences from RDP_16S_Archaea_Subset.fasta into one fasta sequence.

What we have:

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA

What we need:

>Sequence
ATCGACGTCCCGT
ATCCACGTCTCGTTTTACTG
AACATCAC
ATCCACGTCTCGTNNTA

Suggestion #2


    # We need a header for the new sequence
    echo ">Sequence" > Sequences_Merged.fa
    grep ">" -v RDP_16S_Archaea_Subset.fasta >> Sequences_Merged.fa
    # -v (--invert-match) select non-matching lines

Fasta Manipulations

Following some fasta specific manipulations. They are a bit more advanced but also more applied.

## -----------------------------
## Find Forward Primer Site
## -----------------------------

## Count the number of sequences **starting with ">"** 
grep -c "^>" RDP_16S_Archaea_Subset.fasta
# ⇨ N: 129 | There are 129 sequences in the fasta file
# Note: The caret (^) specifies the search and finds only lines that start with the search pattern. 

## Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset.fasta
# ⇨ N: 0 | That is not possible 
# Problem: The search is case sensitive and we also have small (lower case) letters in the sequence that will be igonred.

We can fix the problem by making all sequences either lowercase or uppercase.

There are many command line utilities for text manipulation at your disposal and tr (translate) is one of them. It is a member of the core utilities and available in all Linux distros. The tr command reads a byte stream from standard input (stdin), translates or deletes characters, then writes the result to the standard output (stdout).

## Change / Remove with tr
echo "AUG" | tr U T  # change Us (mRNA) into Ts (cDNA)

We can also use built-in character set aliases for the translation:

echo "TAGCT ATCTT"    | tr [:space:] '\n' # replace space with newline
echo "ATCGA TAGAA"    | tr [:space:] '\t' # replace space with tabs
echo "Tue 16.06.2020" | tr [:punct:] '/'  # replace . with /
echo "Tue 16/06/2020" | tr -d [:alpha:]   # remove letters

Note: Have a look at the manual page (man tr) to get more details.

Since tr does not read a file directly, we need to pipe the file content to tr or redirect the file to stdin if we want to apply tr to a text file.

## All CAPS (actg -< ACTG) with piping
cat RDP_16S_Archaea_Subset.fasta | tr a-z A-Z > RDP_16S_Archaea_Subset_Caps.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps.fasta

## All CAPS (actg -> ACTG) with redirecting
tr a-z A-Z < RDP_16S_Archaea_Subset.fasta > RDP_16S_Archaea_Subset_Caps2.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps2.fasta

# Note:
# Do not forget to redirect "<" the file. The tr command transforms a string or 
# deletes characters from a string. It works on the content but not the file itself.  
# A valid alternative would be: cat RDP_16S_Archaea_Subset.fa | tr a-z A-Z

Again back to the actual problem of finding sequence motifs.

## Let us count again - Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta
# ⇨ N: 27 | Better but what about multiple-line sequences?
# Limit: Grep will only find matches per line.
#        It is possible some primer sites are interrupted by a new line (wrapped). 

## Convert to a single-line fasta 
# get the script:
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/SingleFasta.sh
# make it executable
chmod a+x SingleFasta.sh
# run the script
bash SingleFasta.sh RDP_16S_Archaea_Subset_Caps.fasta > RDP_16S_Archaea_Subset_Caps_Single.fa 
# SingleFasta.sh is simple script to remove new lines at the end of sequences
# in fasta files.

## Compare
head -n 2 RDP_16S_Archaea_Subset_Caps.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps_Single.fa

## Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa
# ⇨ N: 29 sequences carry the primer site.

## What about degenerated primer sites?
grep "GGCGTTAG[TG]GCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta
# ⇨ N: 32 sequences carry primer site with one wobble base

# Note: [] is a container with options, in this examples it means
#       it is either a T or a G

## -----------------------------
## Find Sequence Motif
## -----------------------------

## Extract a specific sequence
grep "S003477566" RDP_16S_Archaea_Subset_Caps_Single.fa -A 1 > S003477566.fa
# Grab header and one extra line below

## Find stop codon
grep "TAG" --color S003477566.fa

## Find multiple stop codons
grep -e "TAG" -e "TGA" -e "TAA" --color S003477566.fa

# Note: The container works on single charater but not on strings.
#       We have to use multiple search terms. 

## Alternative solution for multiple string searches:

## Exdended grep (if installed)
egrep "TAG|TGA|TAA" --color S003477566.fa
# Meaning: "TAG" OR "TGA" OR "TAA"

## Grep but with escape characters 
grep "TAG|TAG|TAA" --color S003477566.fa
# Meaning: "TAG|TAG|TAA"
grep "TAG\|TAG\|TAA" --color S003477566.fa
# Meaning: "TAG" OR "TGA" OR "TAA"

## -----------------------------
## Find PCR Primer 
## -----------------------------

# We use PRIMER3 to find PCR primer sites
# primer3_core -help
# Source: https://primer3.org

# 1. Download PRIMER3 Settings
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/PCR.settings.template
cat PCR.settings.template

# 2. Create Query-Sequence Entry
echo -n "SEQUENCE_TEMPLATE=" >> add.tmp # ID tag
grep    ">" -v S003477566.fa >> add.tmp # Add only sequence without fasta header
echo    "="                  >> add.tmp # Important - file has to end with the equal sign

# 3. Add Sequence to Setting
cat PCR.settings.template add.tmp > PCR.settings

# 4. Run PRIMER3
primer3_core -default_version=1 -output primers.txt < PCR.settings

# 5. Pick Primers
less primers.txt
# remember: quit less with [q]

## -----------------------------
## Sequences info
## -----------------------------

## We will use infoseq from the EMBOSS package
which infoseq

## Basic information about infoseq
infoseq --help

## Report Sequence Length
infoseq -only -length RDP_16S_Archaea_Subset.fasta | less

# Note: Be default infoseq would print a lot of infomation 
#       with "only and length" we restrict it to sequence length only. 

## Sort the Length
infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n | less

# for help: infoseq -help
which infoseq # infoseq is part of the EMBOSS package

Challenge 3: Can you determine the %GC content range of the RDP sequences?

Suggestion #3


    ## Again, there is not just one solution but many. Here is one:

    ## Get %GC Content
    infoseq -only -pgc RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp
    # pgc - percent GC content (see infoseq --help)
    # n - we sort the infoseq output numerically

    ## Max/Min %GC
    head -n 2 pGC.tmp; tail -n 1 pGC.tmp
    # Note: Because we sorted the output, min and max are at the top or the bottom

Graphics

Agreed the terminal is not the best friend if it comes to graphics but with a bit of help it might work.

## Sequence Length Sorted
infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n > L.tmp

## Remove first Line 
grep "Length" -v L.tmp > L_clean.tmp

## Now we can use this file to plot some simple histograms
bash textHistogram.sh L_clean.tmp

# Note: textHistogram.sh is a simple bash script installed on our servers.
#       Help: bash textHistogram.sh

Challenge 4: Create a text histogram for %GC?

Suggestion #4


    # Remove Header
    grep "%GC" -v pGC.tmp > pGC_clean.tmp
    # Histogram
    bash textHistogram.sh pGC_clean.tmp

Fun Time

Terminal is not all boring!

## ----------------------
## One Liner Tweak
## ----------------------

## Good question
[ where is my brain?

## ----------------------
## Standard Funny
## ----------------------

## Reverse cat
echo -e "1\n2\n3" > 123.tmp
cat 123.tmp
tac 123.tmp

## Reverse string
echo "123456789" | rev
## Reverse complement sequence
echo "ATGCAT" | rev | tr [ATGC] [TACG]

## Prime factors of a number
factor 10 50 100

## Tick .. tick
while true; do echo "$(date '+%T')"; sleep 5; done
# stop it with [ctrl] + [c]  

## Weather
curl wttr.in/zurich

# Fun fact: There a different cities in the world called Zurich (e.g. city in Rooks County, Kansas). You might get a different one depending on your location.

## ----------------------
## Extra Funny
## ----------------------

## We know now the command ls (list) but what about sl (steam locomotive)?
sl

## What does the cow say?

# provide text directly
cowsay "Helloooo"

# Text from file
echo "tfjzf uif npp-nfou!" | tr 'b-x' 'a-y'> moo.txt
cowsay `cat moo.txt`

# Your Tux
ls | cowsay -f tux

# Get a fortune
clear ; fortune | cowsay -f eyes

# Show all cowfiles
for i in $(cowsay -l); do cowsay -f $i "$i"; done

# Source:
# https://github.com/tnalpgge/rank-amateur-cowsay
# https://en.wikipedia.org/wiki/Cowsay