Remote Terminal

You will be a guest on the GDC compute server. Please respect a few simple rules!

  • Your student account will expire by the end of the month.
  • Do not share your login information with anybody.
  • Use only the student guest account assigned to you.
  • To not use your account for anything else than your course assignments.
  • If you have problems or need help, ask!

SSH Info

SSH is a protocol through which you can access a remote server and run shell commands. SSH is encrypted with Secure Sockets Layer (SSL), which makes it difficult for these communications to be intercepted and read.

Remote Server GDCSRV2

Login to GDC compute server at ETH using ssh:

ssh -X student??@gdcsrv2.ethz.ch # replace ?? with your student number

After you executed the ssh command the server will ask for a password. You will not see what you type for security reasons. You could copy & paste your password to prevent typos.

Once you are in, try a few commands you learned before:

echo $USER
echo $PWD
mkdir TEST

You will notice that you have other and more applications availble as before.

infoseq -help

Now, download the RDP_16S_Archaea_Subset.fasta file again and run infoseq:

## Download the fasta file:
curl -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/RDP_16S_Archaea_Subset.fasta

## Have a look at the fasta file:
head RDP_16S_Archaea_Subset.fasta
tail RDP_16S_Archaea_Subset.fasta

## How big is the file
ls -ahl RDP_16S_Archaea_Subset.fasta

## Get some statistics:
wc -lmL RDP_16S_Archaea_Subset.fasta
# Wonder what you see? Try: man wc

Exercise #1: RDP_16S_Archaea_Subset.fasta is a multi nucletide sequence file in fasta format. It is fasta format because each sequence starts with a header line follow by the nucleotide sequence.

>Header
Sequence

Example for a multi-sequence file in fasta format:

>Sequence #1
ATCGACGTCCCGT
>Sequence #2
ATCCACGTCTCGTTTTACTG
AACATCAC
>Sequence #3
ATCCACGTCTCGTNNTA

A header starts with a > and ends with a end of line (\n). There is one header line per sequences but the number and length of the sequence can vary.

head -n 20 RDP_16S_Archaea_Subset.fasta

How many sequences are in the file?

We can use the command grep to extract all the header lines:

grep ">" RDP_16S_Archaea_Subset.fasta

Now, can you find a way to count the number of sequences in the multi-fasta sequence file?

Solution #1

### There is not one solution to the probem.

## (a) WordCount
# get header lines
grep ">" RDP_16S_Archaea_Subset.fasta > header_lines.txt
# count number of lines
wc -l header_lines.txt

## (b) Alternaitve WordCount
grep ">" RDP_16S_Archaea_Subset.fasta | wc -l

## (c) Grep Count
grep ">" -c RDP_16S_Archaea_Subset.fasta

Bash for fasta files

## Count the number of sequences
grep -c "^>" RDP_16S_Archaea_Subset.fasta

## Simplify NCBI headers
awk '{print $1}' RDP_16S_Archaea_Subset.fasta > RDP_16S_Archaea_Subset.fa

## Add label/ID to header
sed 's/>.*/&_NewID/' RDP_16S_Archaea_Subset.fa > RDP_16S_Archaea_Subset_ID.fa

## Extract headers
grep -o -E "^>\w+" RDP_16S_Archaea_Subset_ID.fa | tr -d ">" > RDP_16S_Archaea_Subset_ID.list

## Linearize your sequences - first line header, second line sequence, ...
sed -e 's/\(^>.*$\)/#\1#/' RDP_16S_Archaea_Subset_ID.fa | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d' > RDP_16S_Archaea_Subset_ID_HS.fa

## All caps
tr a-z A-Z < RDP_16S_Archaea_Subset_ID_HS.fa > RDP_16S_Archaea_Subset_ID_HS_caps.fa

## Extract a specific sequence
grep "S003477566" RDP_16S_Archaea_Subset_ID_HS_caps.fa -A 1 > S003477566.fa

## Find ATG stop codon
cat RDP_16S_Archaea_Subset_ID_HS_caps.fa | sed -e 's/ATG/*/g'

## Find stop codons TAA, TGA, TAG
cat RDP_16S_Archaea_Subset_ID_HS_caps.fa | sed -e 's/TAA/*/g' -e 's/TGA/*/g' -e 's/TAG/*/g'

Infoseq - An Emboss Tool

Info about the sequences in the fasta file:

infoseq -only -length -pgc RDP_16S_Archaea_Subset.fasta
# for help try: infoseq -help again

Play with the output

infoseq -only -length -pgc RDP_16S_Archaea_Subset.fasta

Exercise 2: Can you determine the length and %GC content range?

Solution #2

## Again, there is not just one solution but many. Here is one:

infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n > L.tmp
infoseq -only -pgc RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp

# Max/Min Length
head -n 2 L.tmp; tail -n 1 L.tmp

# Max/Min %GC
head -n 2 pGC.tmp; tail -n 1 pGC.tmp

Text Histograms - Graphics with bash

## Remove first Line 

# Both Infoseq output files start with the header line.
# We remove this first line
tail -n +2 L.tmp > L_no_header.tmp
tail -n +2 pGC.tmp > pGC_no_header.tmp

## Now we can use this file to plot some simple histograms
textHistogram -binSize=50 -maxBinCount=100 L_no_header.tmp
textHistogram -binSize=1 -maxBinCount=80 pGC_no_header.tmp

Fun with shell

## Tick .. tick
while true; do echo "$(date '+%T')"; sleep 1; done
# stop it with [ctrl] + [c]  

## Weather
curl wttr.in/basel

## We know the command ls (list) but what about sl (steamlock) 
sl

## reverse every string
echo "123456789" | rev

## some mathematics
factor 500

## What does the cow say # works only if installed

# provide text directly
cowsay "Helloooo"

# Text from file
cowsay `cat text.txt`

# Your Tux
ls | cowsay -f tux

# Get a fortune
fortune | cowsay

# show all cowfiles
for i in $(cowsay -l); do cowsay -f $i "$i"; done

# Source:
# https://github.com/tnalpgge/rank-amateur-cowsay
# https://en.wikipedia.org/wiki/Cowsay

File Sharing

Example Setting (Cyberduck)