Remote Terminal

Lecture Notes¶

⬇︎ SSH

SSH is a protocol through which you can access a remote server and run shell commands. SSH is encrypted with Secure Sockets Layer (SSL), which makes it difficult for these communications to be intercepted and read.

For the SSH tutorial you will be a guest on the GDC compute server. Please respect a few simple rules!

GDC SERVER RULES

➽ Do not share your login information with anybody.
➽ Use only the student guest account assigned to you.
➽ To not use your account for anything else then your course assignments.
➽ Do not safe any important or confidential data.
➽ If you have problems or need help, ask!

Note: Your student account will expire by the end of the course.

Access Remote Server GDCSRV2¶

Login to GDC compute server at ETH using ssh:

ssh -X student??@gdcsrv2.ethz.ch # replace ?? with your student number

After you executed the ssh command the server will ask for a password. For security reasons, you will not see what you type. You can copy & paste your password to prevent typos.

Once you are connected, try a few terminal commands you learned before:

pwd              # You should be in your home
echo ${HOME}     # This is the path to your home  
echo ${USER}     # Username of guest account

Have A Look Around¶

## New Resources
df -h              # Report disk space usage
lscpu; free -m     # Information about the CPU architecture
ll /usr/local/bin  # Installed applications

## New Commands
mkdir -p TEST1/TEST2/TEST3/TEST4
tree

## More Traffic
top   # press q to quit top

File Exchange¶

Open a new local terminal, you can have multiple terminals open at the same time. Create a local file and send it to your remote home directory on the gdcsrv2 server. Check your remote directory if the file arrived.

## Create a text files - on your (local) computer
echo "Let me see the world" > go.txt

## Send the file to the server
scp go.txt student01@gdcsrv2.ethz.ch:/gdc_home/student01
# Do not forget to change <student01> accordingly

Next, we download the uploaded file from the remore server to our local working directory but change the file name. This is important to avoid the original file from being over-written.

# Get the file back but rename it 
scp student01@gdcsrv2.ethz.ch:/gdc_home/student01/go.txt back.txt

# Let us have a look
cat back.txt

(S)FTP Client

A convenient alternative to upload or download (exchange) files from or to a remote server is via a (S)FTP client like Cyberduck.

More Terminal¶

Use curl again to download the multi-nucleotide sequence file RDP_16S_Archaea_Subset.fasta and explore the file.

## Working Directory
mkdir -p ${HOME}/GDA20/ssh
cd ${HOME}/GDA20/ssh

## Download the fasta file:
curl -O http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/RDP_16S_Archaea_Subset.fasta
ls -lh RDP_16S_Archaea_Subset.fasta

## Have a look at the fasta file:
head -n 10 RDP_16S_Archaea_Subset.fasta
tail -n 10 RDP_16S_Archaea_Subset.fasta

## How big is the file
ls -lh RDP_16S_Archaea_Subset.fasta

## Some statistics
wc -lmL RDP_16S_Archaea_Subset.fasta
# Wonder what you see? Try: man wc

The sequence file is a text file in fasta format. A header starts with > and ends with end of line (\n). There is one header line per sequences but the number and length of the sequence is not fixed.

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA

❖ Challenge #1: We can use the grep command to extract all the header lines:

grep ">" RDP_16S_Archaea_Subset.fasta

The command grep will find and print lines matching the specified search pattern. In our case, we grabbed all fasta headers. Now, can you find a way to count the number of sequences in the multi-fasta file using grep?

Suggestion #1

Remember, there is never just one solution to a problem.


    ## 1a - Extract header and count the lines
    grep ">" RDP_16S_Archaea_Subset.fasta | wc -l
    ## 1b - Count with grep
    grep ">" -c RDP_16S_Archaea_Subset.fasta

❖ Challenge #2: Merge all the fasta sequences from RDP_16S_Archaea_Subset.fasta into one fasta sequence.

What we have:

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA

What we need:

>Sequence
ATCGACGTCCCGT
ATCCACGTCTCGTTTTACTG
AACATCAC
ATCCACGTCTCGTNNTA

Suggestion #2


    # We need a header for the new sequence
    echo ">Sequence" > Sequences_Merged.fa
    grep ">" -v RDP_16S_Archaea_Subset.fasta >> Sequences_Merged.fa
    # -v (--invert-match) select non-matching lines

Fasta Manipulations¶

Following some fasta specific manipulations. They are a bit more advanced but also more applied.

## -----------------------------
## Find Forward Primer Site
## -----------------------------

## Count the number of sequences
grep -c "^>" RDP_16S_Archaea_Subset.fasta
# ⇨ N: 129 | There are 129 sequences in the fasta file

## Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset.fasta
# ⇨ N: 0 | That is not possible 
# Problem: The search is case sensitive and we have small letters in the sequence.  

## Change / Remove with tr
# A few examples:
echo "AUG" | tr U T                       # change Us (mRNA) into Ts (cDNA)
echo "TAGCT ATCTT"    | tr [:space:] '\n' # replace space with newline
echo "ATCGA TAGAA"    | tr [:space:] '\t' # replace space with tabs
echo "Tue 16.06.2020" | tr [:punct:] '/'  # replace . with /
echo "Tue 16/06/2020" | tr -d [:alpha:]   # remove letters

## All CAPS (actg -> ACTG)
tr a-z A-Z < RDP_16S_Archaea_Subset.fasta > RDP_16S_Archaea_Subset_Caps.fasta
head RDP_16S_Archaea_Subset_Caps.fasta -n 2

# Note:
# Do not forget to redirect "<" the file. The tr command transforms a string or 
# deletes characters from a string. It works on the content but not the file itself.  
# A valid alternative would be: cat RDP_16S_Archaea_Subset.fa | tr a-z A-Z

## Let us count again - Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta
# ⇨ N: 27 | Better but what about multiple-line sequences?
# Limit: Grep will only find matches per line.
#        It is possible some primer sites are interrupted by a new line. 

## Convert to a single-line fasta 
bash SingleFasta.sh RDP_16S_Archaea_Subset_Caps.fasta > RDP_16S_Archaea_Subset_Caps_Single.fa 
# SingleFasta.sh is simple script to remove new lines at the end of sequences
# in fasta files.

## Compare
head -n 2 RDP_16S_Archaea_Subset_Caps.fasta
head -n 2 RDP_16S_Archaea_Subset_Caps_Single.fa

## Number of sequences with primer sites
grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa
# ⇨ N: 29 sequences carry the primer site.

## What about degenerated primer sites?
grep "GGCGTTAG[TG]GCCCATCTA" -c RDP_16S_Archaea_Subset_Caps.fasta
# ⇨ N: 34 sequences carry primer site with one wobble base

# Note: [] is a container with options, in this examples it means
#       it is either a T or a G

## -----------------------------
## Find Sequence Motif
## -----------------------------

## Extract a specific sequence
grep "S003477566" RDP_16S_Archaea_Subset_Caps_Single.fa -A 1 > S003477566.fa
# Grab header and one extra line below

## Find stop codon
grep "TAG" --color S003477566.fa

## Find multiple stop codons
grep -e "TAG" -e "TGA" -e "TAA" --color S003477566.fa

# Note: The container works on single charater but not on strings.
#       We have to use multiple search terms. 

## Alternative solution for multiple string searches:

## Exdended grep (if installed)
egrep "TAG|TGA|TAA" --color S003477566.fa
# Meaning: "TAG" OR "TGA" OR "TAA"

## Grep but with escape characters 
grep "TAG|TAG|TAA" --color S003477566.fa
# Meaning: "TAG|TAG|TAA"
grep "TAG\|TAG\|TAA" --color S003477566.fa
# Meaning: "TAG" OR "TGA" OR "TAA"

## -----------------------------
## Find PCR Primer 
## -----------------------------

# We use PRIMER3 to find PCR primer sites
# primer3_core -help

# 1. Download PRIMER3 Settings
curl -O http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/PCR.settings.template
cat PCR.settings.template

# 2. Create Query-Sequence Entry
echo -n "SEQUENCE_TEMPLATE=" >> add.tmp # ID tag
grep    ">" -v S003477566.fa >> add.tmp # Add only sequence without fasta header
echo    "="                  >> add.tmp # Important - file has to end with the equal sign

# 3. Add Sequence to Setting
cat PCR.settings.template add.tmp > PCR.settings

# 4. Run PRIMER3
primer3_core -default_version=1 -output primers.txt < PCR.settings

# 5. Pick Primers
less primers.txt

## -----------------------------
## Sequences info
## -----------------------------

## We will use infoseq from the EMBOSS package
which infoseq

## Basic information about infoseq
infoseq --help

## Report Sequence Length
infoseq -only -length RDP_16S_Archaea_Subset.fasta | less

# Note: Be default infoseq would print a lot of infomation 
#       with "only and length" we restrict it to sequence length only. 

## Sort the Length
infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n | less

# for help: infoseq -help
which infoseq # infoseq is part of the EMBOSS package

Challenge 3: Can you determine the %GC content range of the RDP sequences?

Suggestion #3


    ## Again, there is not just one solution but many. Here is one:

    ## Get %GC Content
    infoseq -only -pgc RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp
    # pgc - percent GC content (see infoseq --help)
    # n - we sort the infoseq output numerically

    ## Max/Min %GC
    head -n 2 pGC.tmp; tail -n 1 pGC.tmp
    # Note: Because we sorted the output, min and max are at the top or the bottom

Graphics¶

Agreed the terminal is not the best friend if it comes to graphics but with a bit of help it might work.

## Sequence Length Sorted
infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n > L.tmp

## Remove first Line 
grep "Length" -v L.tmp > L_clean.tmp

## Now we can use this file to plot some simple histograms
textHistogram -binSize=50 -maxBinCount=100 L_clean.tmp

# Note: textHistogram is a nice little script installed on our servers.
#       Help: /bin/textHistogram

Challenge 4: Create a text histogram for %GC?

Suggestion #4


    # Remove Header
    grep "%GC" -v pGC.tmp > pGC_clean.tmp
    # Histogram
    textHistogram -binSize=2 -maxBinCount=40 pGC_clean.tmp

Fun Time¶

Terminal is not all boring!

## ----------------------
## One Liner Tweak
## ----------------------

## Good question
[ where is my brain?

## ----------------------
## Standard Funny
## ----------------------

## Reverse cat
echo -e "1\n2\n3" > 123.tmp
cat 123.tmp
tac 123.tmp

## Reverse string
echo "123456789" | rev
## Reverse complement sequence
echo "ATGCAT" | rev | tr [ATGC] [TACG]

## Prime factors of a number
factor 10 50 100

## Tick .. tick
while true; do echo "$(date '+%T')"; sleep 5; done
# stop it with [ctrl] + [c]  

## Weather
curl wttr.in/zurich

## ----------------------
## Extra Funny
## ----------------------

## We know now the command ls (list) but what about sl?
sl

## What does the cow say?

# provide text directly
cowsay "Helloooo"

# Text from file
cowsay `cat text.txt`

# Your Tux
ls | cowsay -f tux

# Get a fortune
clear ; fortune | cowsay -f eyes

# Show all cowfiles
for i in $(cowsay -l); do cowsay -f $i "$i"; done

# Source:
# https://github.com/tnalpgge/rank-amateur-cowsay
# https://en.wikipedia.org/wiki/Cowsay