Remote Terminal
You will be a guest on the GDC compute server. Please respect a few simple rules!
- Your student account will expire by the end of the month.
- Do not share your login information with anybody.
- Use only the student guest account assigned to you.
- To not use your account for anything else than your course assignments.
- If you have problems or need help, ask!
SSH Info¶
SSH is a protocol through which you can access a remote server and run shell commands. SSH is encrypted with Secure Sockets Layer (SSL), which makes it difficult for these communications to be intercepted and read.
Remote Server GDCSRV2¶
Login to GDC compute server at ETH using ssh:
ssh -X student??@gdcsrv2.ethz.ch # replace ?? with your student number
After you executed the ssh command the server will ask for a password. You will not see what you type for security reasons. You could copy & paste your password to prevent typos.
Once you are in, try a few commands you learned before:
echo $USER echo $PWD mkdir TEST
You will notice that you have other and more applications availble as before.
infoseq -help
Now, download the RDP_16S_Archaea_Subset.fasta file again and run infoseq:
## Download the fasta file: curl -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/RDP_16S_Archaea_Subset.fasta ## Have a look at the fasta file: head RDP_16S_Archaea_Subset.fasta tail RDP_16S_Archaea_Subset.fasta ## How big is the file ls -ahl RDP_16S_Archaea_Subset.fasta ## Get some statistics: wc -lmL RDP_16S_Archaea_Subset.fasta # Wonder what you see? Try: man wc
Exercise #1: RDP_16S_Archaea_Subset.fasta is a multi nucletide sequence file in fasta format. It is fasta format because each sequence starts with a header line follow by the nucleotide sequence.
>Header Sequence
Example for a multi-sequence file in fasta format:
>Sequence #1 ATCGACGTCCCGT >Sequence #2 ATCCACGTCTCGTTTTACTG AACATCAC >Sequence #3 ATCCACGTCTCGTNNTA
A header starts with a > and ends with a end of line (\n). There is one header line per sequences but the number and length of the sequence can vary.
head -n 20 RDP_16S_Archaea_Subset.fasta
How many sequences are in the file?
We can use the command grep to extract all the header lines:
grep ">" RDP_16S_Archaea_Subset.fasta
Now, can you find a way to count the number of sequences in the multi-fasta sequence file?
Solution #1
### There is not one solution to the probem. ## (a) WordCount # get header lines grep ">" RDP_16S_Archaea_Subset.fasta > header_lines.txt # count number of lines wc -l header_lines.txt ## (b) Alternaitve WordCount grep ">" RDP_16S_Archaea_Subset.fasta | wc -l ## (c) Grep Count grep ">" -c RDP_16S_Archaea_Subset.fasta
Bash for fasta files¶
## Count the number of sequences grep -c "^>" RDP_16S_Archaea_Subset.fasta ## Simplify NCBI headers awk '{print $1}' RDP_16S_Archaea_Subset.fasta > RDP_16S_Archaea_Subset.fa ## Add label/ID to header sed 's/>.*/&_NewID/' RDP_16S_Archaea_Subset.fa > RDP_16S_Archaea_Subset_ID.fa ## Extract headers grep -o -E "^>\w+" RDP_16S_Archaea_Subset_ID.fa | tr -d ">" > RDP_16S_Archaea_Subset_ID.list ## Linearize your sequences - first line header, second line sequence, ... sed -e 's/\(^>.*$\)/#\1#/' RDP_16S_Archaea_Subset_ID.fa | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d' > RDP_16S_Archaea_Subset_ID_HS.fa ## All caps tr a-z A-Z < RDP_16S_Archaea_Subset_ID_HS.fa > RDP_16S_Archaea_Subset_ID_HS_caps.fa ## Extract a specific sequence grep "S003477566" RDP_16S_Archaea_Subset_ID_HS_caps.fa -A 1 > S003477566.fa ## Find ATG stop codon cat RDP_16S_Archaea_Subset_ID_HS_caps.fa | sed -e 's/ATG/*/g' ## Find stop codons TAA, TGA, TAG cat RDP_16S_Archaea_Subset_ID_HS_caps.fa | sed -e 's/TAA/*/g' -e 's/TGA/*/g' -e 's/TAG/*/g'
Infoseq - An Emboss Tool¶
Info about the sequences in the fasta file:
infoseq -only -length -pgc RDP_16S_Archaea_Subset.fasta
# for help try: infoseq -help again
Play with the output
infoseq -only -length -pgc RDP_16S_Archaea_Subset.fasta
Exercise 2: Can you determine the length and %GC content range?
Solution #2
## Again, there is not just one solution but many. Here is one: infoseq -only -length RDP_16S_Archaea_Subset.fasta | sort -n > L.tmp infoseq -only -pgc RDP_16S_Archaea_Subset.fasta | sort -n > pGC.tmp # Max/Min Length head -n 2 L.tmp; tail -n 1 L.tmp # Max/Min %GC head -n 2 pGC.tmp; tail -n 1 pGC.tmp
Text Histograms - Graphics with bash¶
## Remove first Line # Both Infoseq output files start with the header line. # We remove this first line tail -n +2 L.tmp > L_no_header.tmp tail -n +2 pGC.tmp > pGC_no_header.tmp ## Now we can use this file to plot some simple histograms textHistogram -binSize=50 -maxBinCount=100 L_no_header.tmp textHistogram -binSize=1 -maxBinCount=80 pGC_no_header.tmp
Fun with shell¶
## Tick .. tick while true; do echo "$(date '+%T')"; sleep 1; done # stop it with [ctrl] + [c] ## Weather curl wttr.in/basel ## We know the command ls (list) but what about sl (steamlock) sl ## reverse every string echo "123456789" | rev ## some mathematics factor 500 ## What does the cow say # works only if installed # provide text directly cowsay "Helloooo" # Text from file cowsay `cat text.txt` # Your Tux ls | cowsay -f tux # Get a fortune fortune | cowsay # show all cowfiles for i in $(cowsay -l); do cowsay -f $i "$i"; done # Source: # https://github.com/tnalpgge/rank-amateur-cowsay # https://en.wikipedia.org/wiki/Cowsay