BLAST

BLAST (Basic Local Alignment Search Tool) command line applications developed at the National Center for Biotechnology Information (NCBI).

For your BLAST searches, you can either use the web BLAST provided by NCBI or nucleotide similarity search at European Bioinformatics Institute (EMBL-EBI). Alternatively, you can instal BLAST+ (a suite of command-line tools to run BLAST provided by NCBI) locally. This allows you to perform BLAST searches on your own server without size, volume, and database restrictions. For the following exercises: we use the standalone BLAST installed on the GDC server.

Get Ready¶

Login to the server and create a working directory.

## Login to gdcsrv2
ssh studentXX@gdcsrv2.ethz.ch

## Prepare Directories
mkdir -p ${HOME}/BLAST/db   # Folder for sequences and index files
wd="${HOME}/BLAST"          # My working directory
cd ${wd}/db                 # Start here

BLAST @ gdcsrv2¶

It is always a good idea to get familiar with the application and look at the various options.

## Help - Nucleotide-Nucleotide BLAST
blastn -h

## Help - Protein-Protein BLAST
blastp -h

## What version are we using?
blastn -version

## List active databases
blastdbcmd -list $BLASTDB

## Statistical information about a BLAST database
blastdbcmd -db refseq_genomic -info

Online Help¶

Run a BLAST¶

We need a query (input) sequence in fasta format. For this reason, we create a random 29nt long nucleotide sequence:

## Create a random fasta file
cd ${wd}
rm MyQuery.fa
touch MyQuery.fa
echo ">RandomSequence 29nt" >> MyQuery.fa
cat /dev/urandom | tr -dc 'ACTG' | fold -w 29 | head -n 1 >> MyQuery.fa
cat MyQuery.fa

Now, we add the following sequence to the file.

## Create a fasta file
echo ">MysterySequence 29nt" >> MyQuery.fa
echo "AATGATACGGCGACCACCGAGATCTACAC" >> MyQuery.fa
cat MyQuery.fa

Question #1: How many different 29nt-long sequences can we generate? In other words, what is the probability that we generate two identical random 29nt-long sequences?

Solution #1

Number of possibilities: 4^29 = 288,230,376,151,711,744 (2.88 * 10^17)

For our first BLAST, we are using the NCBI nt (nucleotide) database, 4 threads, and filter the results using a e-value threshold of 0.0001. The output will be a table. The output format is explained below and more output options a listed further below.

## Run a BLAST
blastn -db nt -evalue 0.0001 -query MyQuery.fa -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out MyQuery.blast

This might take a while but should finish under two minutes.

head -n 5 MyQuery.blast

Output table format: format 6

parameter	meaning
qseqid	Query (input) sequence
sacc	Subject Accession number
qlen	Sequence length of query
slen	Sequence length of subject
pident	%-identity between S and Q
length	Alignment length
qstart-qstart	Query alignment start and end
sstart-send	Subject alignment start and end
evalue	E-values (the smaller the better)
bitscoure	Bit Score (the higher the better)

For more help and options see: Display BLAST search results with custom output format

Question #2: First, check if the random sequence produced any hits.

Solution #2

## We expect to have none or only a few hits:
grep "RandomSequence" -c MyQuery.blast
# N = 0 in my case

Question #3: What about the Mystery sequences? Are there any good hits?

Solution #3

## If it is a non-random sequence we can expact hits:
grep "MysterySequence" -c MyQuery.blast
# N = 330 (in my case)

## Hits for Mystery Sequence
grep "MysterySequence" MyQuery.blast | less

# => %-identity is 100% in all cases.
# => The local alignment ranges from nt 1 to 29 in most hits (full length).
# => E-values are very low. The lower the e-vaules the better the hit.
# => Bit score are around 50 which is not high but considering the query length not bad.
# => In summaty, most of the hits are good hits! It is not a random sequence. But what is it?

Question #4: What are these hits?

Solution #4

## We need the NCBI accession numbers
cut -f2 -d$'\t' < MyQuery.blast | sort -u > MyQuery_BlastHits_Acc.txt
## Download the file either using FileZilla or command line

Command line solution:

## !!! Open a new local terminal and execute the following:   
scp -r studentXX@gdcsrv2.ethz.ch:/gdc_home/studentXX/BLAST/MyQuery_BlastHits_Acc.txt.

Now open your Browser and submit the list to (Batch Entrez)[https://www.ncbi.nlm.nih.gov/sites/batchentrez]. What to we see: => Hits with many *removed records* => Top organism: *Cyprinus carpio*

The table output format might not be to your liking. Here a summary of the possible output formats to play with.

Illumina Adaptor Contaminations¶

We have to add adaptors to the end of the DNA fragments prior to Illumina sequencing. We call this step library preparation. Different library prep kits with different adaptors are available. The adaptors can be part of the sequenced fragment. We have to check and remove or trim sequences with adaptor sequences before we further process the data (e.g., assembly).

## Get the next Mystery sequence
curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/adaptor.fa

## Explore the fasta file
grep ">" -c adaptor.fa
grep ">" adaptor.fa

## Run a BLAST
blastn -db nt -evalue 0.0001 -query adaptor.fa -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out adaptor.blast

## Explore the BLAST results
less adaptor.blast # quit with [q]
## Total hits
wc -l adaptor.blast
## Unique hists
cut -f2 -d$'\t' < adaptor.blast | sort -u | wc -l
## Have a look at the Hits
# -> See recipe above for help!
# Top Organisms
# Cyprinus carpio (271)
# Dibothriocephalus latus (65)
# Nippostrongylus brasiliensis (13)
# Fasciola hepatica (7)
# Camelus ferus (2)
# Dicrocoelium dendriticum (2)
# Trichobilharzia regenti (1)
# Angiostrongylus cantonensis (1)
# Strongyloides stercoralis (1)
# Tityus serrulatus (1)
# Mytilus galloprovincialis (1)
# Triticum aestivum (1)
# uncultured bacterium (1)
# Streptococcus suis (1)
# Methanobacterium formicicum (1)
# Cloning vector pFosill-3 (1)
# Cloning vector pFosill-4 (1)
# Cloning vector pFosill-1 (1)
# Cloning vector pFosill-2 (1)

BLAST output format options¶

# Formatting options
# -outfmt <String>
#   alignment view options:
#     0 = pairwise,
#     1 = query-anchored showing identities,
#     2 = query-anchored no identities,
#     3 = flat query-anchored, show identities,
#     4 = flat query-anchored, no identities,
#     5 = XML Blast output,
#     6 = tabular,
#     7 = tabular with comment lines,
#     8 = Text ASN.1,
#     9 = Binary ASN.1,
#    10 = Comma-separated values,
#    11 = BLAST archive format (ASN.1)

Create your own BLAST database¶

## Download Drosophila melanogaster genome
wget http://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/drosoph.nt.gz
# wget is a non-interactive network downloader

# The file is no longer directly available from the NCBI ftp-site (07.01.2020) 
# wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz

## Unzip the DB
gunzip drosoph.nt.gz

## Have a look at the fasta file
ls -lh  drosoph.nt
head drosoph.nt
grep ">" -c drosoph.nt

## Index fasta file
#makeblastdb -h
makeblastdb -dbtype nucl -in ${wd}/db/drosoph.nt -out ${wd}/db/drosoph.nt -parse_seqids -title "Drosophila melanogaster Genome" -logfile drosoph.log

## DMEL database info
cd ${wd}
blastdbcmd -db ${wd}/db/drosoph.nt -info

## Get the next Mystery sequence
curl -O http://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/Mystery2.fasta
# As an alternaitve to wget (see above) we could use `curl` to tranfer files
# [For a more detail comparison](https://daniel.haxx.se/docs/curl-vs-wget.html) 

## Inspect the file 
ls -ah Mystery2.fasta
grep ">" -c Mystery2.fasta
head Mystery2.fasta

Compare BLAST searches¶

We will run two BLAST searches using the same query and the same parameters but two different databases. We place the command time in front of the blast command to get an idea about the speed of the search.

# (a) NCBI nt database
time blastn -db nt -evalue 0.0001 -query Mystery2.fasta -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out Mystery2_NCBInt.blast

# (b) The new Drosophila melanogaster genome database
time blastn -db ${wd}/db/drosoph.nt -evalue 0.0001 -query Mystery2.fasta -num_threads 4 -outfmt "6 qseqid sacc qlen slen pident length qstart qend sstart send evalue bitscore" -out Mystery2_myDB.blast

Question #5: Compare the two BLAST results. What are the differences?

Solution #5

## Number of hits
wc -l Mystery2_*.blast 
# => The BLAST search against the Dmel database resulted in less hits.

## Top 5 Hits
head -n 5 Mystery2_NCBInt.blast Mystery2_myDB.blast

Mystery2_**NCBInt**.blast | Query | Subject | qlen | slen | pident | len | qs | qe | sstart | send | e-value | bit | |-------------------------------|----------|------|----------|---------|------|------|------|----------|----------|-----------|------| | NM_169441.2 | AE014297 | 2382 | 32079331 | 100.000 | 2382 | 1 | 2382 | 11956544 | 11954163 | 0.0 | 4399 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 99.874 | 2384 | 1 | 2382 | 11958534 | 11960917 | 0.0 | 4385 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.708 | 2190 | 2 | 2191 | 12505791 | 12507967 | 0.0 | 3511 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.614 | 2189 | 2 | 2190 | 12509074 | 12511249 | 0.0 | 3498 | | NM_169441.2 | AE014297 | 2382 | 32079331 | 95.571 | 2190 | 2 | 2191 | 12502508 | 12504684 | 0.0 | 3494 | Mystery2_**myDB**.blast | Query | Subject | qlen | slen | pident | len | qs | qe | sstart | send | e-value | bit | |-------------------------------|----------|------|----------|---------|------|------|------|----------|----------|-----------|------| | NM_169441.2 | AE003693 | 2382 | 226483 | 97.649 | 2382 | 2 | 2382 | 205195 | 202824 | 0.0 | 4078 | | NM_169441.2 | AE003693 | 2382 | 226483 | 99.420 | 345 | 2040 | 2382 | 209191 | 209535 | 1.70e-177 | 625 | | NM_169441.2 | AE003696 | 2382 | 228143 | 95.535 | 2195 | 2 | 2194 | 38368 | 36187 | 0.0 | 3496 | | NM_169441.2 | AE003746 | 2382 | 218097 | 78.755 | 1831 | 256 | 2078 | 25899 | 24080 | 0.0 | 1208 | | NM_169441.2 | AE003698 | 2382 | 225827 | 75.104 | 1438 | 417 | 1838 | 124811 | 126232 | 0.0 | 643 | => The top 5 hits for both searches are good. Meaning high (>300) bit score values. => The top 5 hits are all for accession AE014297 with the nt search. => The top 5 hits show 4 different accession (AE003693, AE003696, AE00374, and AE00369). => Careful: The alignment length for myDB hit #2 is short!

# NCBInt
# AE014297: Drosophila melanogaster chromosome 3R
# DMEL
# AE003693: New AE014297 - Drosophila melanogaster chromosome 3R
# AE003696: New AE014297 - Drosophila melanogaster chromosome 3R
# AE003746: New AE014297 - Drosophila melanogaster chromosome 3R
# AE003698: New AE014297 - Drosophila melanogaster chromosome 3R

Interpretation¶

Interpreting BLAST output tables is not a trivial task. A clear question can help to optimse the search and help with the interpretation. Lets have a look at the top 20 hits of the sequence against the NCBI nt database.

head -n 20 Mystery2_NCBInt.blast Mystery2_myDB.blast

No	Query	Subject	qlen	slen	pident	len	qs	qe	sstart	send	e-value	bit
01	NM_169441.2	AE014297	2382	32079331	100.000	2382	1	2382	11956544	11954163	0.0	4399
02	NM_169441.2	AE014297	2382	32079331	99.874	2384	1	2382	11958534	11960917	0.0	4385
03	NM_169441.2	AE014297	2382	32079331	95.708	2190	2	2191	12505791	12507967	0.0	3511
04	NM_169441.2	AE014297	2382	32079331	95.614	2189	2	2190	12509074	12511249	0.0	3498
05	NM_169441.2	AE014297	2382	32079331	95.571	2190	2	2191	12502508	12504684	0.0	3494
06	NM_169441.2	AE014297	2382	32079331	95.121	2193	2	2194	12467794	12465622	0.0	3439
07	NM_169441.2	AE014297	2382	32079331	78.755	1831	256	2078	24057104	24055285	0.0	1208
08	NM_169441.2	AE014297	2382	32079331	75.104	1438	417	1838	13045581	13047002	3.33e-180	643
09	NM_169441.2	NM_169441	2382	2382	100.000	2382	1	2382	1	2382	0.0	4399
10	NM_169441.2	AC007889	2382	182182	100.000	2382	1	2382	52295	49914	0.0	4399
11	NM_169441.2	AC007889	2382	182182	99.874	2384	1	2382	54285	56668	0.0	4385
12	NM_169441.2	AC007725	2382	144152	100.000	2382	1	2382	133101	130720	0.0	4399
13	NM_169441.2	AC007725	2382	144152	99.874	2384	1	2382	135091	137474	0.0	4385
14	NM_169441.2	NM_080059	2382	2375	99.958	2375	2	2376	1	2375	0.0	4381
15	NM_169441.2	AF295941	2382	2561	99.790	2384	1	2382	129	2512	0.0	4373
16	NM_169441.2	BT021339	2382	2381	100.000	2363	20	2382	1	2363	0.0	4364
17	NM_169441.2	AF295937	2382	2606	99.664	2382	1	2382	189	2570	0.0	4355
18	NM_169441.2	AF350457	2382	2389	99.915	2362	21	2382	1	2362	0.0	4351
19	NM_169441.2	BT126347	2382	2371	99.872	2349	36	2382	1	2349	0.0	4320
20	NM_169441.2	AF295940	2382	2549	99.371	2386	1	2382	129	2514	0.0	4320

Question #6: Any idea what the mystery sequence could be?

Solution #6

It needs time and experience to understand it full meaning. In our example, we have a perfect best blast hit with chromosome 3R of the *Drosophila melanogaster* genome. * NM_169441: *Drosophila melanogaster* **Heat-shock-protein-70Aa (Hsp70Aa)** * NM_080059: *Drosophila melanogaster* **Heat-shock-protein-70Ab (Hsp70Ab)**

Question #7: Is there anything else you extract from the BLAST results?

Solution #6

There are multiple copies of the Hsp70 gene on chromosome 3R. Have a look at the UCSC Genome Browser (https://genome.ucsc.edu) or FlyBase (https://flybase.org)

Hsp70Bc at chr3R:12509076-12511461 - (NM_141952) Heat-shock-protein-70Bc
Hsp70Bb at chr3R:12505793-12508139 - (NM_080188) Heat-shock-protein-70Bb
Hsp70Ba at chr3R:12465453-12467776 - (NM_169469) Heat-shock-protein-70Ba
Hsp70Ab at chr3R:11958535-11960909 - (NM_080059) Heat-shock-protein-70Ab
Hsp70Ab at chr3R:11954169-11956543 - (NM_080059) Heat-shock-protein-70Ab
Hsp70Aa at chr3R:11958534-11960907 - (NM_169441) Heat-shock-protein-70Aa
Hsp70Aa at chr3R:11954163-11956544 - (NM_169441) Heat-shock-protein-70Aa

chr3R: [<<<Hsp70Aa<<<]..[>>>Hsp70Ab>>>]...[>>>Hsp70Bbb>>>]..[>>>Hsp70Bb>>>]..[>>>Hsp70Bc>>>]

Speed is an important factor, but it is not the only one. Play with other parameters to increase speed and/or accuracy.

word_size: use 28 and 128 * num_threads: increase from 4 to 8 (please not more)
evalue: default versus 10

V190917 - Jean-Claude Walser for GDC, ETHZ