Remote Terminal

Lecture Notes

To process NGS projects, the own computer is often not sufficient. We depend on compute and data servers with more power, larger storage space and secure data management. With an access to a server we extend and improve the data processing. We use SSH (secure shell) to connect to a server via the terminal. SSH is a protocol through which you can access a remote server and run shell commands. SSH is encrypted with Secure Sockets Layer (SSL), which makes it difficult for these communications to be intercepted and read.

For the remote terminal tutorial you will be a guest on the GDC compute server. Please respect a few simple rules!

GDC SERVER RULES

➽ Do not share your login information with anybody.
➽ Use only the student guest account assigned to you.
➽ To not use your account for anything else then your course assignments.
➽ Do not store any important or confidential data.
➽ If you have problems or need help, ask!

Note: Your student account will expire by the end of the course.

Access Remote Server GDCSRV2

Login to GDC compute server at ETH using ssh:

ssh -X student??@gdcsrv2.ethz.ch # replace ?? with your student number

After you executed the ssh command the server will ask for a password. For security reasons, you will not see what you type. You can copy & paste your password to prevent typos.

Once you are connected, repeat a few terminal commands you have already learned:

pwd                    # You should be in your home
echo ${HOME}           # This is the path to your home  
echo ${USER}           # Username of guest account
mkdir TestDir          # Create Folder
touch TestDir/test.tmp # Create a file

Remote Server

Have a quick look around:

## New Resources
df -h              # Disk usage
lscpu              # System information e.g. number of CPUs
free -m            # Information about the available memory
ll /usr/local/bin  # Installed applications

You might have new and/or different commands available:

## New command `tree`
mkdir -p TEST1/TEST2/TEST3/TEST4
tree
## Colorful disk usage summary (https://github.com/muesli/duf)
duf  # Disk usage (https://github.com/muesli/duf)

You are also not alone anymore:

## Monitor traffic
top   # press q to quit top

File Exchange

Open a new local terminal. Yes, you can have multiple terminals open at the same time. Create a local file and send it to your remote home directory on the gdcsrv2 server. Check your remote directory if the file arrived.

## Create a text files - on your computer (local terminal)
echo "Let me see the world" > go.txt

## Send the file to the server
scp go.txt student??@gdcsrv2.ethz.ch:/gdc_home/student??
# Do not forget to change <student??> accordingly

Go back to the remote terminal, find the file, and change it.

## Find the file (remote terminal)
ls -al ${HOME}
cat ${HOME}/go.txt
echo "I visited the GDC server and I liked it" >> ${HOME}/go.txt

Next, we download the uploaded file from the GDC server to our local working directory but change the file name. This is important to avoid the original file from being overwritten.

## Get the file back but rename it (local terminal) 
scp student01@gdcsrv2.ethz.ch:/gdc_home/student01/go.txt back.txt

## Let us have a look
cat go.txt back.txt

(S)FTP Client

A convenient GUI based alternative to up- or download (exchange) files from or to a remote server is via a (S)FTP client like Cyberduck.

More Terminal

We use curl (a tool to transfer data from or to a server) to download the multi-nucleotide sequence file RDP_16S_Archaea_Subset.fasta and explore the file.

## Working Directory
mkdir -p ${HOME}/GDA/ssh
cd ${HOME}/GDA/ssh

## Download the fasta file:
curl -O http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/data/RDP_16S_Archaea_Subset.fasta
ls -lh RDP_16S_Archaea_Subset.fasta

## Have a look at the fasta file:
head -n 10 RDP_16S_Archaea_Subset.fasta
tail -n 10 RDP_16S_Archaea_Subset.fasta

## How big is the file
ls -lh RDP_16S_Archaea_Subset.fasta

## Some statistics
wc -lmL RDP_16S_Archaea_Subset.fasta
# Wonder what you see? Try: man wc

The sequence file is a text file in fasta format. A header starts with > and ends with end of line. There is one header line per sequences but the number and/or length of the sequence is not fixed.

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
atccacgtctcgtnnta

❖ Challenge #1: We can use the grep command to extract all the header lines:

grep ">" RDP_16S_Archaea_Subset.fasta

The command grep will find and print lines matching the specified search pattern. In our case, we grabbed all fasta headers. Now, can you find a way to count the number of sequences in the multi-fasta file using grep?

Suggestion #1

Remember, there is never just one solution to a problem.


    ## 1a - Extract header and count the lines
    grep ">" RDP_16S_Archaea_Subset.fasta | wc -l
    ## 1b - Count with grep
    grep ">" -c RDP_16S_Archaea_Subset.fasta

❖ Challenge #2: Merge all the fasta sequences from RDP_16S_Archaea_Subset.fasta into one fasta sequence.

What we have:

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA

What we need:

>Sequence
ATCGACGTCCCGT
ATCCACGTCTCGTTTTACTG
AACATCAC
ATCCACGTCTCGTNNTA

Suggestion #2


    # We need a header for the new sequence
    echo ">Sequence" > Sequences_Merged.fa
    grep ">" -v RDP_16S_Archaea_Subset.fasta >> Sequences_Merged.fa
    # -v (--invert-match) select non-matching lines

Fasta Manipulations

Following some fasta specific manipulations. They are a bit more advanced but also more applied.

## -----------------------------
## Find Forward Primer Site
## -----------------------------

## Count the number of sequences **starting with ">"** 
grep -c "^>" RDP_16S_Archaea_Subset.fasta
# ⇨ N: 129 | There are 129 sequences in the fasta file
# Note: The caret (^) specifies the search and finds only lines that start with the search pattern. 

## Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset.fasta
# ⇨ N: 0 | That is not possible 
# Problem: The search is case sensitive and we have small letters in the sequence.  

## Change / Remove with tr
# A few examples:
echo "AUG" | tr U T                       # change Us (mRNA) into Ts (cDNA)
echo "TAGCT ATCTT"    | tr [:space:] '\n' # replace space with newline
echo "ATCGA TAGAA"    | tr [:space:] '\t' # replace space with tabs
echo "Tue 16.06.2020" | tr [:punct:] '/'  # replace . with /
echo "Tue 16/06/2020" | tr -d [:alpha:]   # remove letters

## All CAPS (actg -> ACTG)
tr a-z A-Z < RDP_16S_Archaea_Subset.fasta > RDP_16S_Archaea_Subset_Caps.fasta
head RDP_16S_Archaea_Subset_Caps.fasta -n 2

# Note:
# Do not forget to redirect "<" the file. The tr command transforms a string or 
# deletes characters from a string. It works on the content but not the file itself.  
# A valid alternative would be: cat RDP_16S_Archaea_Subset.fa | tr a-z A-Z

## Let us count again - Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta
# ⇨ N: 27 | Better but what about multiple-line sequences?
# Limit: Grep will only find matches per line.
#        It is possible some primer sites are interrupted by a new line. 

## Convert to a single-line fasta 
bash SingleFasta.sh RDP_16S_Archaea_Subset_Caps.fasta > RDP_16S_Archaea_Subset_Caps_Single.fa 
# SingleFasta.sh is simple script to remove new lines at the end of sequences
# in fasta files.

## Compare
head -n 2 RDP_16S_Archaea_Subset_Caps.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps_Single.fa

## Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa
# ⇨ N: 29 sequences carry the primer site.

## What about degenerated primer sites?
grep "GGCGTTAG[TG]GCCCATCTA" -c RDP_16S_Archaea_Subset_Caps.fasta
# ⇨ N: 34 sequences carry primer site with one wobble base

# Note: [] is a container with options, in this examples it means
#       it is either a T or a G

## -----------------------------
## Find Sequence Motif
## -----------------------------

## Extract a specific sequence
grep "S003477566" RDP_16S_Archaea_Subset_Caps_Single.fa -A 1 > S003477566.fa
# Grab header and one extra line below

## Find stop codon
grep "TAG" --color S003477566.fa

## Find multiple stop codons
grep -e "TAG" -e "TGA" -e "TAA" --color S003477566.fa

# Note: The container works on single charater but not on strings.
#       We have to use multiple search terms. 

## Alternative solution for multiple string searches:

## Exdended grep (if installed)
egrep "TAG|TGA|TAA" --color S003477566.fa
# Meaning: "TAG" OR "TGA" OR "TAA"

## Grep but with escape characters 
grep "TAG|TAG|TAA" --color S003477566.fa
# Meaning: "TAG|TAG|TAA"
grep "TAG\|TAG\|TAA" --color S003477566.fa
# Meaning: "TAG" OR "TGA" OR "TAA"

## -----------------------------
## Find PCR Primer 
## -----------------------------

# We use PRIMER3 to find PCR primer sites
# primer3_core -help

# 1. Download PRIMER3 Settings
curl -O http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/PCR.settings.template
cat PCR.settings.template

# 2. Create Query-Sequence Entry
echo -n "SEQUENCE_TEMPLATE=" >> add.tmp # ID tag
grep    ">" -v S003477566.fa >> add.tmp # Add only sequence without fasta header
echo    "="                  >> add.tmp # Important - file has to end with the equal sign

# 3. Add Sequence to Setting
cat PCR.settings.template add.tmp > PCR.settings

# 4. Run PRIMER3
primer3_core -default_version=1 -output primers.txt < PCR.settings

# 5. Pick Primers
less primers.txt

## -----------------------------
## Sequences info
## -----------------------------

## We will use infoseq from the EMBOSS package
which infoseq

## Basic information about infoseq
infoseq --help

## Report Sequence Length
infoseq -only -length RDP_16S_Archaea_Subset.fasta | less

# Note: Be default infoseq would print a lot of infomation 
#       with "only and length" we restrict it to sequence length only. 

## Sort the Length
infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n | less

# for help: infoseq -help
which infoseq # infoseq is part of the EMBOSS package

Challenge 3: Can you determine the %GC content range of the RDP sequences?

Suggestion #3


    ## Again, there is not just one solution but many. Here is one:

    ## Get %GC Content
    infoseq -only -pgc RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp
    # pgc - percent GC content (see infoseq --help)
    # n - we sort the infoseq output numerically

    ## Max/Min %GC
    head -n 2 pGC.tmp; tail -n 1 pGC.tmp
    # Note: Because we sorted the output, min and max are at the top or the bottom

Graphics

Agreed the terminal is not the best friend if it comes to graphics but with a bit of help it might work.

## Sequence Length Sorted
infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n > L.tmp

## Remove first Line 
grep "Length" -v L.tmp > L_clean.tmp

## Now we can use this file to plot some simple histograms
textHistogram -binSize=50 -maxBinCount=100 L_clean.tmp

# Note: textHistogram is a nice little script installed on our servers.
#       Help: /bin/textHistogram

Challenge 4: Create a text histogram for %GC?

Suggestion #4


    # Remove Header
    grep "%GC" -v pGC.tmp > pGC_clean.tmp
    # Histogram
    textHistogram -binSize=2 -maxBinCount=40 pGC_clean.tmp

Fun Time

Terminal is not all boring!

## ----------------------
## One Liner Tweak
## ----------------------

## Good question
[ where is my brain?

## ----------------------
## Standard Funny
## ----------------------

## Reverse cat
echo -e "1\n2\n3" > 123.tmp
cat 123.tmp
tac 123.tmp

## Reverse string
echo "123456789" | rev
## Reverse complement sequence
echo "ATGCAT" | rev | tr [ATGC] [TACG]

## Prime factors of a number
factor 10 50 100

## Tick .. tick
while true; do echo "$(date '+%T')"; sleep 5; done
# stop it with [ctrl] + [c]  

## Weather
curl wttr.in/zurich

## ----------------------
## Extra Funny
## ----------------------

## We know now the command ls (list) but what about sl (steam locomotive)?
sl

## What does the cow say?

# provide text directly
cowsay "Helloooo"

# Text from file
cowsay `cat text.txt`

# Your Tux
ls | cowsay -f tux

# Get a fortune
clear ; fortune | cowsay -f eyes

# Show all cowfiles
for i in $(cowsay -l); do cowsay -f $i "$i"; done

# Source:
# https://github.com/tnalpgge/rank-amateur-cowsay
# https://en.wikipedia.org/wiki/Cowsay