BLAST

BLAST (Basic Local Alignment Search Tool) command line applications developed at the National Center for Biotechnology Information (NCBI).

For your BLAST searches, you can either use the web BLAST provided by NCBI or nucleotide similarity search at European Bioinformatics Institute (EMBL-EBI). Alternatively, you can instal BLAST+ (a suite of command-line tools to run BLAST provided by NCBI) locally. This allows you to perform BLAST searches on your own server without size, volume, and database restrictions. For the following exercises: we use the standalone BLAST installed on the GDC server.

Get Ready

Login to the server and create a working directory.

## Login to gdcsrv2
ssh studentXX@gdcsrv2.ethz.ch

## Prepare Directories
mkdir -p ${HOME}/BLAST/db   # Folder for sequences and index files
wd="${HOME}/BLAST"          # My working directory
cd ${wd}/db                 # Start here

BLAST @ gdcsrv2

It is always a good idea to get familiar with the application and look at the various options.

## Help - Nucleotide-Nucleotide BLAST
blastn -h

## Help - Protein-Protein BLAST
blastp -h

## What version are we using?
blastn -version

## List active databases
blastdbcmd -list $BLASTDB

## Statistical information about a BLAST database
blastdbcmd -db refseq_genomic -info

Online Help

Run a BLAST

We need a query (input) sequence in fasta format. For this reason, we create a random 29nt long nucleotide sequence:

## Create a random fasta file
cd ${wd}
rm MyQuery.fa
touch MyQuery.fa
echo ">RandomSequence 29nt" >> MyQuery.fa
cat /dev/urandom | tr -dc 'ACTG' | fold -w 29 | head -n 1 >> MyQuery.fa
cat MyQuery.fa

Now, we add the following sequence to the file.

## Create a fasta file
echo ">MysterySequence 29nt" >> MyQuery.fa
echo "AATGATACGGCGACCACCGAGATCTACAC" >> MyQuery.fa
cat MyQuery.fa

Question #1: How many different 29nt-long sequences can we generate? In other words, what is the probability that we generate two identical random 29nt-long sequences?

Solution #1

Number of possibilities: 4^29 = 288,230,376,151,711,744 (2.88 * 10^17)

For our first BLAST, we are using the NCBI nt (nucleotide) database, 4 threads, and filter the results using a e-value threshold of 0.0001. The output will be a table. The output format is explained below and more output options a listed further below.

## Run a BLAST
blastn -db nt -evalue 0.0001 -query MyQuery.fa -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out MyQuery.blast

This might take a while but should finish under two minutes.

head -n 5 MyQuery.blast

Output table format: format 6

parameter meaning
qseqid Query (input) sequence
sacc Subject Accession number
qlen Sequence length of query
slen Sequence length of subject
pident %-identity between S and Q
length Alignment length
qstart-qstart Query alignment start and end
sstart-send Subject alignment start and end
evalue E-values (the smaller the better)
bitscoure Bit Score (the higher the better)

For more help and options see: Display BLAST search results with custom output format

Question #2: First, check if the random sequence produced any hits.

Solution #2

## We expect to have none or only a few hits:
grep "RandomSequence" -c MyQuery.blast
# N = 0 in my case

Question #3: What about the Mystery sequences? Are there any good hits?

Solution #3

## If it is a non-random sequence we can expact hits:
grep "MysterySequence" -c MyQuery.blast
# N = 330 (in my case)

## Hits for Mystery Sequence
grep "MysterySequence" MyQuery.blast | less

# => %-identity is 100% in all cases.
# => The local alignment ranges from nt 1 to 29 in most hits (full length).
# => E-values are very low. The lower the e-vaules the better the hit.
# => Bit score are around 50 which is not high but considering the query length not bad.
# => In summaty, most of the hits are good hits! It is not a random sequence. But what is it?

Question #4: What are these hits?

Solution #4

## We need the NCBI accession numbers
cut -f2 -d$'\t' < MyQuery.blast | sort -u > MyQuery_BlastHits_Acc.txt
## Download the file either using FileZilla or command line
Command line solution:
## !!! Open a new local terminal and execute the following:   
scp -r studentXX@gdcsrv2.ethz.ch:/gdc_home/studentXX/BLAST/MyQuery_BlastHits_Acc.txt.
Now open your Browser and submit the list to (Batch Entrez)[https://www.ncbi.nlm.nih.gov/sites/batchentrez]. What to we see: => Hits with many *removed records* => Top organism: *Cyprinus carpio*

The table output format might not be to your liking. Here a summary of the possible output formats to play with.

Illumina Adaptor Contaminations

We have to add adaptors to the end of the DNA fragments prior to Illumina sequencing. We call this step library preparation. Different library prep kits with different adaptors are available. The adaptors can be part of the sequenced fragment. We have to check and remove or trim sequences with adaptor sequences before we further process the data (e.g., assembly).

## Get the next Mystery sequence
curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/adaptor.fa

## Explore the fasta file
grep ">" -c adaptor.fa
grep ">" adaptor.fa

## Run a BLAST
blastn -db nt -evalue 0.0001 -query adaptor.fa -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out adaptor.blast

## Explore the BLAST results
less adaptor.blast # quit with [q]
## Total hits
wc -l adaptor.blast
## Unique hists
cut -f2 -d$'\t' < adaptor.blast | sort -u | wc -l
## Have a look at the Hits
# -> See recipe above for help!
# Top Organisms
# Cyprinus carpio (271)
# Dibothriocephalus latus (65)
# Nippostrongylus brasiliensis (13)
# Fasciola hepatica (7)
# Camelus ferus (2)
# Dicrocoelium dendriticum (2)
# Trichobilharzia regenti (1)
# Angiostrongylus cantonensis (1)
# Strongyloides stercoralis (1)
# Tityus serrulatus (1)
# Mytilus galloprovincialis (1)
# Triticum aestivum (1)
# uncultured bacterium (1)
# Streptococcus suis (1)
# Methanobacterium formicicum (1)
# Cloning vector pFosill-3 (1)
# Cloning vector pFosill-4 (1)
# Cloning vector pFosill-1 (1)
# Cloning vector pFosill-2 (1)

BLAST output format options

# Formatting options
# -outfmt <String>
#   alignment view options:
#     0 = pairwise,
#     1 = query-anchored showing identities,
#     2 = query-anchored no identities,
#     3 = flat query-anchored, show identities,
#     4 = flat query-anchored, no identities,
#     5 = XML Blast output,
#     6 = tabular,
#     7 = tabular with comment lines,
#     8 = Text ASN.1,
#     9 = Binary ASN.1,
#    10 = Comma-separated values,
#    11 = BLAST archive format (ASN.1) 

Create your own BLAST database

## Download Drosophila melanogaster genome
wget http://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/drosoph.nt.gz
# wget is a non-interactive network downloader

# The file is no longer directly available from the NCBI ftp-site (07.01.2020) 
# wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz

## Unzip the DB
gunzip drosoph.nt.gz

## Have a look at the fasta file
ls -lh  drosoph.nt
head drosoph.nt
grep ">" -c drosoph.nt

## Index fasta file
#makeblastdb -h
makeblastdb -dbtype nucl -in ${wd}/db/drosoph.nt -out ${wd}/db/drosoph.nt -parse_seqids -title "Drosophila melanogaster Genome" -logfile drosoph.log

## DMEL database info
cd ${wd}
blastdbcmd -db ${wd}/db/drosoph.nt -info

## Get the next Mystery sequence
curl -O http://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/Mystery2.fasta
# As an alternaitve to wget (see above) we could use `curl` to tranfer files
# [For a more detail comparison](https://daniel.haxx.se/docs/curl-vs-wget.html) 

## Inspect the file 
ls -ah Mystery2.fasta
grep ">" -c Mystery2.fasta
head Mystery2.fasta

Compare BLAST searches

We will run two BLAST searches using the same query and the same parameters but two different databases. We place the command time in front of the blast command to get an idea about the speed of the search.

# (a) NCBI nt database
time blastn -db nt -evalue 0.0001 -query Mystery2.fasta -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out Mystery2_NCBInt.blast

# (b) The new Drosophila melanogaster genome database
time blastn -db ${wd}/db/drosoph.nt -evalue 0.0001 -query Mystery2.fasta -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out Mystery2_myDB.blast

Question #5: Compare the two BLAST results. What are the differences?

Solution #5

## Number of hits
wc -l Mystery2_*.blast 
# => The BLAST search against the Dmel database resulted in less hits.

## Top 5 Hits
head -n 5 Mystery2_NCBInt.blast Mystery2_myDB.blast
Mystery2_**NCBInt**.blast | Query | Subject | qlen | slen | pident | len | qs | qe | sstart | send | e-value | bit | |-------------------------------|----------|------|----------|---------|------|------|------|----------|----------|-----------|------| | NM_169441.2 | AE014297 | 2382 | 32079331 | 100.000 | 2382 | 1 | 2382 | 11956544 | 11954163 | 0.0 | 4399 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 99.874 | 2384 | 1 | 2382 | 11958534 | 11960917 | 0.0 | 4385 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.708 | 2190 | 2 | 2191 | 12505791 | 12507967 | 0.0 | 3511 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.614 | 2189 | 2 | 2190 | 12509074 | 12511249 | 0.0 | 3498 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.571 | 2190 | 2 | 2191 | 12502508 | 12504684 | 0.0 | 3494 | Mystery2_**myDB**.blast | Query | Subject | qlen | slen | pident | len | qs | qe | sstart | send | e-value | bit | |-------------------------------|----------|------|----------|---------|------|------|------|----------|----------|-----------|------| | NM_169441.2 | AE003693 | 2382 | 226483 | 97.649 | 2382 | 2 | 2382 | 205195 | 202824 | 0.0 | 4078 | | NM_169441.2 | AE003693 | 2382 | 226483 | 99.420 | 345 | 2040 | 2382 | 209191 | 209535 | 1.70e-177 | 625 | | NM_169441.2 | AE003696 | 2382 | 228143 | 95.535 | 2195 | 2 | 2194 | 38368 | 36187 | 0.0 | 3496 | | NM_169441.2 | AE003746 | 2382 | 218097 | 78.755 | 1831 | 256 | 2078 | 25899 | 24080 | 0.0 | 1208 | | NM_169441.2 | AE003698 | 2382 | 225827 | 75.104 | 1438 | 417 | 1838 | 124811 | 126232 | 0.0 | 643 | => The top 5 hits for both searches are good. Meaning high (>300) bit score values. => The top 5 hits are all for accession AE014297 with the nt search. => The top 5 hits show 4 different accession (AE003693, AE003696, AE00374, and AE00369). => Careful: The alignment length for myDB hit #2 is short!
# NCBInt
# AE014297: Drosophila melanogaster chromosome 3R
# DMEL
# AE003693: New AE014297 - Drosophila melanogaster chromosome 3R
# AE003696: New AE014297 - Drosophila melanogaster chromosome 3R
# AE003746: New AE014297 - Drosophila melanogaster chromosome 3R
# AE003698: New AE014297 - Drosophila melanogaster chromosome 3R

Interpretation

Interpreting BLAST output tables is not a trivial task. A clear question can help to optimse the search and help with the interpretation. Lets have a look at the top 20 hits of the sequence against the NCBI nt database.

head -n 20 Mystery2_NCBInt.blast Mystery2_myDB.blast
No Query Subject qlen slen pident len qs qe sstart send e-value bit
01 NM_169441.2 AE014297 2382 32079331 100.000 2382 1 2382 11956544 11954163 0.0 4399
02 NM_169441.2 AE014297 2382 32079331 99.874 2384 1 2382 11958534 11960917 0.0 4385
03 NM_169441.2 AE014297 2382 32079331 95.708 2190 2 2191 12505791 12507967 0.0 3511
04 NM_169441.2 AE014297 2382 32079331 95.614 2189 2 2190 12509074 12511249 0.0 3498
05 NM_169441.2 AE014297 2382 32079331 95.571 2190 2 2191 12502508 12504684 0.0 3494
06 NM_169441.2 AE014297 2382 32079331 95.121 2193 2 2194 12467794 12465622 0.0 3439
07 NM_169441.2 AE014297 2382 32079331 78.755 1831 256 2078 24057104 24055285 0.0 1208
08 NM_169441.2 AE014297 2382 32079331 75.104 1438 417 1838 13045581 13047002 3.33e-180 643
09 NM_169441.2 NM_169441 2382 2382 100.000 2382 1 2382 1 2382 0.0 4399
10 NM_169441.2 AC007889 2382 182182 100.000 2382 1 2382 52295 49914 0.0 4399
11 NM_169441.2 AC007889 2382 182182 99.874 2384 1 2382 54285 56668 0.0 4385
12 NM_169441.2 AC007725 2382 144152 100.000 2382 1 2382 133101 130720 0.0 4399
13 NM_169441.2 AC007725 2382 144152 99.874 2384 1 2382 135091 137474 0.0 4385
14 NM_169441.2 NM_080059 2382 2375 99.958 2375 2 2376 1 2375 0.0 4381
15 NM_169441.2 AF295941 2382 2561 99.790 2384 1 2382 129 2512 0.0 4373
16 NM_169441.2 BT021339 2382 2381 100.000 2363 20 2382 1 2363 0.0 4364
17 NM_169441.2 AF295937 2382 2606 99.664 2382 1 2382 189 2570 0.0 4355
18 NM_169441.2 AF350457 2382 2389 99.915 2362 21 2382 1 2362 0.0 4351
19 NM_169441.2 BT126347 2382 2371 99.872 2349 36 2382 1 2349 0.0 4320
20 NM_169441.2 AF295940 2382 2549 99.371 2386 1 2382 129 2514 0.0 4320

Question #6: Any idea what the mystery sequence could be?

Solution #6

It needs time and experience to understand it full meaning. In our example, we have a perfect best blast hit with chromosome 3R of the *Drosophila melanogaster* genome. * NM_169441: *Drosophila melanogaster* **Heat-shock-protein-70Aa (Hsp70Aa)** * NM_080059: *Drosophila melanogaster* **Heat-shock-protein-70Ab (Hsp70Ab)**

Question #7: Is there anything else you extract from the BLAST results?

Solution #6

There are multiple copies of the Hsp70 gene on chromosome 3R. Have a look at the UCSC Genome Browser (https://genome.ucsc.edu) or FlyBase (https://flybase.org)

Hsp70Bc at chr3R:12509076-12511461 - (NM_141952) Heat-shock-protein-70Bc
Hsp70Bb at chr3R:12505793-12508139 - (NM_080188) Heat-shock-protein-70Bb
Hsp70Ba at chr3R:12465453-12467776 - (NM_169469) Heat-shock-protein-70Ba
Hsp70Ab at chr3R:11958535-11960909 - (NM_080059) Heat-shock-protein-70Ab
Hsp70Ab at chr3R:11954169-11956543 - (NM_080059) Heat-shock-protein-70Ab
Hsp70Aa at chr3R:11958534-11960907 - (NM_169441) Heat-shock-protein-70Aa
Hsp70Aa at chr3R:11954163-11956544 - (NM_169441) Heat-shock-protein-70Aa

chr3R: [<<<Hsp70Aa<<<]..[>>>Hsp70Ab>>>]...[>>>Hsp70Bbb>>>]..[>>>Hsp70Bb>>>]..[>>>Hsp70Bc>>>]

Speed is an important factor, but it is not the only one. Play with other parameters to increase speed and/or accuracy. 

  • word_size: use 28 and 128 * num_threads: increase from 4 to 8 (please not more)  
  • evalue: default versus 10

V190917 - Jean-Claude Walser for GDC, ETHZ