Remote Terminal
Remote Terminal
Lecture Notes¶
⬇︎ SSH
SSH is a protocol through which you can access a remote server and run shell commands. SSH is encrypted with Secure Sockets Layer (SSL), which makes it difficult for these communications to be intercepted and read.
For the SSH tutorial you will be a guest on the GDC compute server. Please respect a few simple rules!
GDC SERVER RULES
➽ Do not share your login information with anybody.
➽ Use only the student guest account assigned to you.
➽ To not use your account for anything else then your course assignments.
➽ Do not safe any important or confidential data.
➽ If you have problems or need help, ask!
Note: Your student account will expire by the end of the course.
Access Remote Server GDCSRV2¶
Login to GDC compute server at ETH using ssh:
ssh -X student??@gdcsrv2.ethz.ch # replace ?? with your student number
After you executed the ssh command the server will ask for a password. For security reasons, you will not see what you type. You can copy & paste your password to prevent typos.
Once you are connected, try a few terminal commands you learned before:
pwd # You should be in your home echo ${HOME} # This is the path to your home echo ${USER} # Username of guest account
Have A Look Around¶
## New Resources df -h # Report disk space usage lscpu; free -m # Information about the CPU architecture ll /usr/local/bin # Installed applications ## New Commands mkdir -p TEST1/TEST2/TEST3/TEST4 tree ## More Traffic top # press q to quit top
File Exchange¶
Open a new local terminal, you can have multiple terminals open at the same time. Create a local file and send it to your remote home directory on the gdcsrv2 server. Check your remote directory if the file arrived.
## Create a text files - on your (local) computer echo "Let me see the world" > go.txt ## Send the file to the server scp go.txt student01@gdcsrv2.ethz.ch:/gdc_home/student01 # Do not forget to change <student01> accordingly
Next, we download the uploaded file from the remore server to our local working directory but change the file name. This is important to avoid the original file from being over-written.
# Get the file back but rename it scp student01@gdcsrv2.ethz.ch:/gdc_home/student01/go.txt back.txt # Let us have a look cat back.txt
(S)FTP Client
A convenient alternative to upload or download (exchange) files from or to a remote server is via a (S)FTP client like Cyberduck.
More Terminal¶
Use curl
again to download the multi-nucleotide sequence file RDP_16S_Archaea_Subset.fasta and explore the file.
## Working Directory mkdir -p ${HOME}/GDA20/ssh cd ${HOME}/GDA20/ssh ## Download the fasta file: curl -O http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/RDP_16S_Archaea_Subset.fasta ls -lh RDP_16S_Archaea_Subset.fasta ## Have a look at the fasta file: head -n 10 RDP_16S_Archaea_Subset.fasta tail -n 10 RDP_16S_Archaea_Subset.fasta ## How big is the file ls -lh RDP_16S_Archaea_Subset.fasta ## Some statistics wc -lmL RDP_16S_Archaea_Subset.fasta # Wonder what you see? Try: man wc
The sequence file is a text file in fasta format. A header starts with > and ends with end of line (\n). There is one header line per sequences but the number and length of the sequence is not fixed.
>Sequence #1 ATCGACGTCCCGT >Sequence #2 ATCCACGTCTCGTTTTACTG AACATCAC >Sequence #3 ATCCACGTCTCGTNNTA
❖ Challenge #1: We can use the grep
command to extract all the header lines:
grep ">" RDP_16S_Archaea_Subset.fasta
The command grep
will find and print lines matching the specified search pattern. In our case, we grabbed all fasta headers. Now, can you find a way to count the number of sequences in the multi-fasta file using grep
?
Suggestion #1
Remember, there is never just one solution to a problem.
## 1a - Extract header and count the lines
grep ">" RDP_16S_Archaea_Subset.fasta | wc -l
## 1b - Count with grep
grep ">" -c RDP_16S_Archaea_Subset.fasta
❖ Challenge #2: Merge all the fasta sequences from RDP_16S_Archaea_Subset.fasta into one fasta sequence.
What we have:
>Sequence #1 ATCGACGTCCCGT >Sequence #2 ATCCACGTCTCGTTTTACTG AACATCAC >Sequence #3 ATCCACGTCTCGTNNTA
What we need:
>Sequence ATCGACGTCCCGT ATCCACGTCTCGTTTTACTG AACATCAC ATCCACGTCTCGTNNTA
Suggestion #2
# We need a header for the new sequence
echo ">Sequence" > Sequences_Merged.fa
grep ">" -v RDP_16S_Archaea_Subset.fasta >> Sequences_Merged.fa
# -v (--invert-match) select non-matching lines
Fasta Manipulations¶
Following some fasta specific manipulations. They are a bit more advanced but also more applied.
## ----------------------------- ## Find Forward Primer Site ## ----------------------------- ## Count the number of sequences grep -c "^>" RDP_16S_Archaea_Subset.fasta # ⇨ N: 129 | There are 129 sequences in the fasta file ## Number of sequences with primer sites grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset.fasta # ⇨ N: 0 | That is not possible # Problem: The search is case sensitive and we have small letters in the sequence. ## Change / Remove with tr # A few examples: echo "AUG" | tr U T # change Us (mRNA) into Ts (cDNA) echo "TAGCT ATCTT" | tr [:space:] '\n' # replace space with newline echo "ATCGA TAGAA" | tr [:space:] '\t' # replace space with tabs echo "Tue 16.06.2020" | tr [:punct:] '/' # replace . with / echo "Tue 16/06/2020" | tr -d [:alpha:] # remove letters ## All CAPS (actg -> ACTG) tr a-z A-Z < RDP_16S_Archaea_Subset.fasta > RDP_16S_Archaea_Subset_Caps.fasta head RDP_16S_Archaea_Subset_Caps.fasta -n 2 # Note: # Do not forget to redirect "<" the file. The tr command transforms a string or # deletes characters from a string. It works on the content but not the file itself. # A valid alternative would be: cat RDP_16S_Archaea_Subset.fa | tr a-z A-Z ## Let us count again - Number of sequences with primer sites grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps.fasta # ⇨ N: 27 | Better but what about multiple-line sequences? # Limit: Grep will only find matches per line. # It is possible some primer sites are interrupted by a new line. ## Convert to a single-line fasta bash SingleFasta.sh RDP_16S_Archaea_Subset_Caps.fasta > RDP_16S_Archaea_Subset_Caps_Single.fa # SingleFasta.sh is simple script to remove new lines at the end of sequences # in fasta files. ## Compare head -n 2 RDP_16S_Archaea_Subset_Caps.fasta head -n 2 RDP_16S_Archaea_Subset_Caps_Single.fa ## Number of sequences with primer sites grep "GGCGTTAGTGCCCATCTAGT" -c RDP_16S_Archaea_Subset_Caps_Single.fa # ⇨ N: 29 sequences carry the primer site. ## What about degenerated primer sites? grep "GGCGTTAG[TG]GCCCATCTA" -c RDP_16S_Archaea_Subset_Caps.fasta # ⇨ N: 34 sequences carry primer site with one wobble base # Note: [] is a container with options, in this examples it means # it is either a T or a G ## ----------------------------- ## Find Sequence Motif ## ----------------------------- ## Extract a specific sequence grep "S003477566" RDP_16S_Archaea_Subset_Caps_Single.fa -A 1 > S003477566.fa # Grab header and one extra line below ## Find stop codon grep "TAG" --color S003477566.fa ## Find multiple stop codons grep -e "TAG" -e "TGA" -e "TAA" --color S003477566.fa # Note: The container works on single charater but not on strings. # We have to use multiple search terms. ## Alternative solution for multiple string searches: ## Exdended grep (if installed) egrep "TAG|TGA|TAA" --color S003477566.fa # Meaning: "TAG" OR "TGA" OR "TAA" ## Grep but with escape characters grep "TAG|TAG|TAA" --color S003477566.fa # Meaning: "TAG|TAG|TAA" grep "TAG\|TAG\|TAA" --color S003477566.fa # Meaning: "TAG" OR "TGA" OR "TAA" ## ----------------------------- ## Find PCR Primer ## ----------------------------- # We use PRIMER3 to find PCR primer sites # primer3_core -help # 1. Download PRIMER3 Settings curl -O http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/PCR.settings.template cat PCR.settings.template # 2. Create Query-Sequence Entry echo -n "SEQUENCE_TEMPLATE=" >> add.tmp # ID tag grep ">" -v S003477566.fa >> add.tmp # Add only sequence without fasta header echo "=" >> add.tmp # Important - file has to end with the equal sign # 3. Add Sequence to Setting cat PCR.settings.template add.tmp > PCR.settings # 4. Run PRIMER3 primer3_core -default_version=1 -output primers.txt < PCR.settings # 5. Pick Primers less primers.txt ## ----------------------------- ## Sequences info ## ----------------------------- ## We will use infoseq from the EMBOSS package which infoseq ## Basic information about infoseq infoseq --help ## Report Sequence Length infoseq -only -length RDP_16S_Archaea_Subset.fasta | less # Note: Be default infoseq would print a lot of infomation # with "only and length" we restrict it to sequence length only. ## Sort the Length infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n | less # for help: infoseq -help which infoseq # infoseq is part of the EMBOSS package
Challenge 3: Can you determine the %GC content range of the RDP sequences?
Suggestion #3
## Again, there is not just one solution but many. Here is one:
## Get %GC Content
infoseq -only -pgc RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp
# pgc - percent GC content (see infoseq --help)
# n - we sort the infoseq output numerically
## Max/Min %GC
head -n 2 pGC.tmp; tail -n 1 pGC.tmp
# Note: Because we sorted the output, min and max are at the top or the bottom
Graphics¶
Agreed the terminal is not the best friend if it comes to graphics but with a bit of help it might work.
## Sequence Length Sorted infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n > L.tmp ## Remove first Line grep "Length" -v L.tmp > L_clean.tmp ## Now we can use this file to plot some simple histograms textHistogram -binSize=50 -maxBinCount=100 L_clean.tmp # Note: textHistogram is a nice little script installed on our servers. # Help: /bin/textHistogram
Challenge 4: Create a text histogram for %GC?
Suggestion #4
# Remove Header
grep "%GC" -v pGC.tmp > pGC_clean.tmp
# Histogram
textHistogram -binSize=2 -maxBinCount=40 pGC_clean.tmp
Fun Time¶
Terminal is not all boring!
## ---------------------- ## One Liner Tweak ## ---------------------- ## Good question [ where is my brain? ## ---------------------- ## Standard Funny ## ---------------------- ## Reverse cat echo -e "1\n2\n3" > 123.tmp cat 123.tmp tac 123.tmp ## Reverse string echo "123456789" | rev ## Reverse complement sequence echo "ATGCAT" | rev | tr [ATGC] [TACG] ## Prime factors of a number factor 10 50 100 ## Tick .. tick while true; do echo "$(date '+%T')"; sleep 5; done # stop it with [ctrl] + [c] ## Weather curl wttr.in/zurich ## ---------------------- ## Extra Funny ## ---------------------- ## We know now the command ls (list) but what about sl? sl ## What does the cow say? # provide text directly cowsay "Helloooo" # Text from file cowsay `cat text.txt` # Your Tux ls | cowsay -f tux # Get a fortune clear ; fortune | cowsay -f eyes # Show all cowfiles for i in $(cowsay -l); do cowsay -f $i "$i"; done # Source: # https://github.com/tnalpgge/rank-amateur-cowsay # https://en.wikipedia.org/wiki/Cowsay