BLAST
BLAST (Basic Local Alignment Search Tool) command line applications developed at the National Center for Biotechnology Information (NCBI).
For your BLAST searches, you can either use the web BLAST provided by NCBI or nucleotide similarity search at European Bioinformatics Institute (EMBL-EBI). Alternatively, you can instal BLAST+ (a suite of command-line tools to run BLAST provided by NCBI) locally. This allows you to perform BLAST searches on your own server without size, volume, and database restrictions. For the following exercises: we use the standalone BLAST installed on the GDC server.
Get Ready¶
Login to the server and create a working directory.
## Login to gdcsrv2 ssh studentXX@gdcsrv2.ethz.ch ## Prepare Directories mkdir -p ${HOME}/BLAST/db # Folder for sequences and index files wd="${HOME}/BLAST" # My working directory cd ${wd}/db # Start here
BLAST @ gdcsrv2¶
It is always a good idea to get familiar with the application and look at the various options.
## Help - Nucleotide-Nucleotide BLAST blastn -h ## Help - Protein-Protein BLAST blastp -h ## What version are we using? blastn -version ## List active databases blastdbcmd -list $BLASTDB ## Statistical information about a BLAST database blastdbcmd -db refseq_genomic -info
Online Help¶
Run a BLAST¶
We need a query (input) sequence in fasta format. For this reason, we create a random 29nt long nucleotide sequence:
## Create a random fasta file cd ${wd} rm MyQuery.fa touch MyQuery.fa echo ">RandomSequence 29nt" >> MyQuery.fa cat /dev/urandom | tr -dc 'ACTG' | fold -w 29 | head -n 1 >> MyQuery.fa cat MyQuery.fa
Now, we add the following sequence to the file.
## Create a fasta file echo ">MysterySequence 29nt" >> MyQuery.fa echo "AATGATACGGCGACCACCGAGATCTACAC" >> MyQuery.fa cat MyQuery.fa
Question #1: How many different 29nt-long sequences can we generate? In other words, what is the probability that we generate two identical random 29nt-long sequences?
Solution #1
Number of possibilities: 4^29 = 288,230,376,151,711,744 (2.88 * 10^17)
For our first BLAST, we are using the NCBI nt (nucleotide) database, 4 threads, and filter the results using a e-value threshold of 0.0001. The output will be a table. The output format is explained below and more output options a listed further below.
## Run a BLAST blastn -db nt -evalue 0.0001 -query MyQuery.fa -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out MyQuery.blast
This might take a while but should finish under two minutes.
head -n 5 MyQuery.blast
Output table format: format 6
parameter | meaning |
---|---|
qseqid | Query (input) sequence |
sacc | Subject Accession number |
qlen | Sequence length of query |
slen | Sequence length of subject |
pident | %-identity between S and Q |
length | Alignment length |
qstart-qstart | Query alignment start and end |
sstart-send | Subject alignment start and end |
evalue | E-values (the smaller the better) |
bitscoure | Bit Score (the higher the better) |
For more help and options see: Display BLAST search results with custom output format
Question #2: First, check if the random sequence produced any hits.
Solution #2
## We expect to have none or only a few hits: grep "RandomSequence" -c MyQuery.blast # N = 0 in my case
Question #3: What about the Mystery sequences? Are there any good hits?
Solution #3
## If it is a non-random sequence we can expact hits: grep "MysterySequence" -c MyQuery.blast # N = 330 (in my case) ## Hits for Mystery Sequence grep "MysterySequence" MyQuery.blast | less # => %-identity is 100% in all cases. # => The local alignment ranges from nt 1 to 29 in most hits (full length). # => E-values are very low. The lower the e-vaules the better the hit. # => Bit score are around 50 which is not high but considering the query length not bad. # => In summaty, most of the hits are good hits! It is not a random sequence. But what is it?
Question #4: What are these hits?
Solution #4
## We need the NCBI accession numbers cut -f2 -d$'\t' < MyQuery.blast | sort -u > MyQuery_BlastHits_Acc.txt ## Download the file either using FileZilla or command line
## !!! Open a new local terminal and execute the following:
scp -r studentXX@gdcsrv2.ethz.ch:/gdc_home/studentXX/BLAST/MyQuery_BlastHits_Acc.txt.
The table output format might not be to your liking. Here a summary of the possible output formats to play with.
Illumina Adaptor Contaminations¶
We have to add adaptors to the end of the DNA fragments prior to Illumina sequencing. We call this step library preparation. Different library prep kits with different adaptors are available. The adaptors can be part of the sequenced fragment. We have to check and remove or trim sequences with adaptor sequences before we further process the data (e.g., assembly).
## Get the next Mystery sequence curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/adaptor.fa ## Explore the fasta file grep ">" -c adaptor.fa grep ">" adaptor.fa ## Run a BLAST blastn -db nt -evalue 0.0001 -query adaptor.fa -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out adaptor.blast ## Explore the BLAST results less adaptor.blast # quit with [q] ## Total hits wc -l adaptor.blast ## Unique hists cut -f2 -d$'\t' < adaptor.blast | sort -u | wc -l ## Have a look at the Hits # -> See recipe above for help! # Top Organisms # Cyprinus carpio (271) # Dibothriocephalus latus (65) # Nippostrongylus brasiliensis (13) # Fasciola hepatica (7) # Camelus ferus (2) # Dicrocoelium dendriticum (2) # Trichobilharzia regenti (1) # Angiostrongylus cantonensis (1) # Strongyloides stercoralis (1) # Tityus serrulatus (1) # Mytilus galloprovincialis (1) # Triticum aestivum (1) # uncultured bacterium (1) # Streptococcus suis (1) # Methanobacterium formicicum (1) # Cloning vector pFosill-3 (1) # Cloning vector pFosill-4 (1) # Cloning vector pFosill-1 (1) # Cloning vector pFosill-2 (1)
BLAST output format options¶
# Formatting options # -outfmt <String> # alignment view options: # 0 = pairwise, # 1 = query-anchored showing identities, # 2 = query-anchored no identities, # 3 = flat query-anchored, show identities, # 4 = flat query-anchored, no identities, # 5 = XML Blast output, # 6 = tabular, # 7 = tabular with comment lines, # 8 = Text ASN.1, # 9 = Binary ASN.1, # 10 = Comma-separated values, # 11 = BLAST archive format (ASN.1)
Create your own BLAST database¶
## Download Drosophila melanogaster genome wget http://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/drosoph.nt.gz # wget is a non-interactive network downloader # The file is no longer directly available from the NCBI ftp-site (07.01.2020) # wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz ## Unzip the DB gunzip drosoph.nt.gz ## Have a look at the fasta file ls -lh drosoph.nt head drosoph.nt grep ">" -c drosoph.nt ## Index fasta file #makeblastdb -h makeblastdb -dbtype nucl -in ${wd}/db/drosoph.nt -out ${wd}/db/drosoph.nt -parse_seqids -title "Drosophila melanogaster Genome" -logfile drosoph.log ## DMEL database info cd ${wd} blastdbcmd -db ${wd}/db/drosoph.nt -info ## Get the next Mystery sequence curl -O http://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/Mystery2.fasta # As an alternaitve to wget (see above) we could use `curl` to tranfer files # [For a more detail comparison](https://daniel.haxx.se/docs/curl-vs-wget.html) ## Inspect the file ls -ah Mystery2.fasta grep ">" -c Mystery2.fasta head Mystery2.fasta
Compare BLAST searches¶
We will run two BLAST searches using the same query and the same parameters but two different databases. We place the command time
in front of the blast command to get an idea about the speed of the search.
# (a) NCBI nt database time blastn -db nt -evalue 0.0001 -query Mystery2.fasta -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out Mystery2_NCBInt.blast # (b) The new Drosophila melanogaster genome database time blastn -db ${wd}/db/drosoph.nt -evalue 0.0001 -query Mystery2.fasta -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out Mystery2_myDB.blast
Question #5: Compare the two BLAST results. What are the differences?
Solution #5
## Number of hits wc -l Mystery2_*.blast # => The BLAST search against the Dmel database resulted in less hits. ## Top 5 Hits head -n 5 Mystery2_NCBInt.blast Mystery2_myDB.blast
# NCBInt # AE014297: Drosophila melanogaster chromosome 3R # DMEL # AE003693: New AE014297 - Drosophila melanogaster chromosome 3R # AE003696: New AE014297 - Drosophila melanogaster chromosome 3R # AE003746: New AE014297 - Drosophila melanogaster chromosome 3R # AE003698: New AE014297 - Drosophila melanogaster chromosome 3R
Interpretation¶
Interpreting BLAST output tables is not a trivial task. A clear question can help to optimse the search and help with the interpretation. Lets have a look at the top 20 hits of the sequence against the NCBI nt database.
head -n 20 Mystery2_NCBInt.blast Mystery2_myDB.blast
No | Query | Subject | qlen | slen | pident | len | qs | qe | sstart | send | e-value | bit |
---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | NM_169441.2 | AE014297 | 2382 | 32079331 | 100.000 | 2382 | 1 | 2382 | 11956544 | 11954163 | 0.0 | 4399 |
02 | NM_169441.2 | AE014297 | 2382 | 32079331 | 99.874 | 2384 | 1 | 2382 | 11958534 | 11960917 | 0.0 | 4385 |
03 | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.708 | 2190 | 2 | 2191 | 12505791 | 12507967 | 0.0 | 3511 |
04 | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.614 | 2189 | 2 | 2190 | 12509074 | 12511249 | 0.0 | 3498 |
05 | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.571 | 2190 | 2 | 2191 | 12502508 | 12504684 | 0.0 | 3494 |
06 | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.121 | 2193 | 2 | 2194 | 12467794 | 12465622 | 0.0 | 3439 |
07 | NM_169441.2 | AE014297 | 2382 | 32079331 | 78.755 | 1831 | 256 | 2078 | 24057104 | 24055285 | 0.0 | 1208 |
08 | NM_169441.2 | AE014297 | 2382 | 32079331 | 75.104 | 1438 | 417 | 1838 | 13045581 | 13047002 | 3.33e-180 | 643 |
09 | NM_169441.2 | NM_169441 | 2382 | 2382 | 100.000 | 2382 | 1 | 2382 | 1 | 2382 | 0.0 | 4399 |
10 | NM_169441.2 | AC007889 | 2382 | 182182 | 100.000 | 2382 | 1 | 2382 | 52295 | 49914 | 0.0 | 4399 |
11 | NM_169441.2 | AC007889 | 2382 | 182182 | 99.874 | 2384 | 1 | 2382 | 54285 | 56668 | 0.0 | 4385 |
12 | NM_169441.2 | AC007725 | 2382 | 144152 | 100.000 | 2382 | 1 | 2382 | 133101 | 130720 | 0.0 | 4399 |
13 | NM_169441.2 | AC007725 | 2382 | 144152 | 99.874 | 2384 | 1 | 2382 | 135091 | 137474 | 0.0 | 4385 |
14 | NM_169441.2 | NM_080059 | 2382 | 2375 | 99.958 | 2375 | 2 | 2376 | 1 | 2375 | 0.0 | 4381 |
15 | NM_169441.2 | AF295941 | 2382 | 2561 | 99.790 | 2384 | 1 | 2382 | 129 | 2512 | 0.0 | 4373 |
16 | NM_169441.2 | BT021339 | 2382 | 2381 | 100.000 | 2363 | 20 | 2382 | 1 | 2363 | 0.0 | 4364 |
17 | NM_169441.2 | AF295937 | 2382 | 2606 | 99.664 | 2382 | 1 | 2382 | 189 | 2570 | 0.0 | 4355 |
18 | NM_169441.2 | AF350457 | 2382 | 2389 | 99.915 | 2362 | 21 | 2382 | 1 | 2362 | 0.0 | 4351 |
19 | NM_169441.2 | BT126347 | 2382 | 2371 | 99.872 | 2349 | 36 | 2382 | 1 | 2349 | 0.0 | 4320 |
20 | NM_169441.2 | AF295940 | 2382 | 2549 | 99.371 | 2386 | 1 | 2382 | 129 | 2514 | 0.0 | 4320 |
Question #6: Any idea what the mystery sequence could be?
Solution #6
It needs time and experience to understand it full meaning. In our example, we have a perfect best blast hit with chromosome 3R of the *Drosophila melanogaster* genome. * NM_169441: *Drosophila melanogaster* **Heat-shock-protein-70Aa (Hsp70Aa)** * NM_080059: *Drosophila melanogaster* **Heat-shock-protein-70Ab (Hsp70Ab)**
Question #7: Is there anything else you extract from the BLAST results?
Solution #6
There are multiple copies of the Hsp70 gene on chromosome 3R. Have a look at the UCSC Genome Browser (https://genome.ucsc.edu) or FlyBase (https://flybase.org)
Hsp70Bc at chr3R:12509076-12511461 - (NM_141952) Heat-shock-protein-70Bc Hsp70Bb at chr3R:12505793-12508139 - (NM_080188) Heat-shock-protein-70Bb Hsp70Ba at chr3R:12465453-12467776 - (NM_169469) Heat-shock-protein-70Ba Hsp70Ab at chr3R:11958535-11960909 - (NM_080059) Heat-shock-protein-70Ab Hsp70Ab at chr3R:11954169-11956543 - (NM_080059) Heat-shock-protein-70Ab Hsp70Aa at chr3R:11958534-11960907 - (NM_169441) Heat-shock-protein-70Aa Hsp70Aa at chr3R:11954163-11956544 - (NM_169441) Heat-shock-protein-70Aa chr3R: [<<<Hsp70Aa<<<]..[>>>Hsp70Ab>>>]...[>>>Hsp70Bbb>>>]..[>>>Hsp70Bb>>>]..[>>>Hsp70Bc>>>]
Speed is an important factor, but it is not the only one. Play with other parameters to increase speed and/or accuracy.
- word_size: use 28 and 128 * num_threads: increase from 4 to 8 (please not more)
- evalue: default versus 10
V190917 - Jean-Claude Walser for GDC, ETHZ