Here is an example of a bash project for the terminal to calculate each nucleotide (A, T, C, G) in a fasta sequence.
Dummy data
To test the code, we need a dummy sequence in fasta format. You can create these sequences by hand or use the script below:
cat dummy.fa
>RandomSequence
ATCGATCGAT
Script to generate a random fasta sequence
Here is a bash script to generate a random fasta sequence of a given length.
## Download the script(s)
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/generate_random_fasta.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/generate_random_fasta_sequences.sh
## Make the script executable
chmod a+x generate_random_fasta*.sh
## Usage
bash generate_random_fasta.sh 10
bash generate_random_fasta_sequences.sh 10 10
Count the nucleotides
The first step is to separate the header from the sequence. Next, we need a method to count each base in the sequence string.
tail -n 1 dummy.fa
We have learnt to search for oligos in fasta files using grep
, but we have also seen that grep -c
would count the number of lines in which the search term occurs, but not the number of occurrences.
tail -n 1 dummy.fa | grep "C" -c dummy.fa
tail -n 1 dummy.fa | grep "C" --color dummy.fa
But we could use the -o
option to print the matching part and then count the number of lines.
tail -n 1 dummy.fa | grep -o "C" dummy.fa
tail -n 1 dummy.fa | grep -o "C" dummy.fa | wc -l
Now we need to repeat this for the three remaining nucleotides. The easiest way to do this is to use a loop.
for nuc in A C T G
do
tail -n 1 dummy.fa | grep -o "${nuc}" dummy.fa | wc -l
done
Improve output
It works, but the output is not very nice. We add the nucleotode letter next to the count to improve readability.
for nuc in A C T G
do
count=`tail -n 1 dummy.fa | grep -o "${nuc}" | wc -l`
echo "${nuc}: ${count}"
done
Frequency instead of counts
Better, but let us change the count to frequency, for which we need floating-point arithmetic (e.g. bc
).
for nuc in A C T G
do
seq=`tail -n 1 dummy.fa` # sequence only
seqlen=${#seq} # sequence length
count=`echo -n ${seq} | grep "${nuc}" -o | wc -l` # count nucledites
echo -n "${nuc}:" # print
echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done
Enhanced usability
The code works for our dummy file. What happens if the sequence is on more than one line?
cat dummy2.fa
>RandomSequence
ATCGA
TCGAT
We would need to make a few adjustments in order to make it work:
- We need to replace
tail -n 1 dummy2.fa
withgrep -v "^>" dummay2.fa
- We have to corret ${#seq} because it would count the end of lines too.
seq=`grep -v "^>" dummy2.fa`
echo ${#seq} # n=11
echo ${#seq}-1 | bc # n=10
With these two changes, the code would also work for fasta sequences with more than two lines.
for nuc in A C T G
do
seq=`grep -v "^>" dummy2.fa` # sequence only
seqlen=`echo ${#seq}-1 | bc` # sequence length
count=`echo -n ${seq} | grep "${nuc}" -o | wc -l` # count nucledites
echo -n "${nuc}:" # print
echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done
Multiple sequences
Would it be possible to use this idea to process FASTA files with multiple sequences? Of course it is, but it would require a bit of bash knowledge. Here are two suggestions to calculate the nucleotide frequencies for each FASTA sequence or just the average over all sequences.
## Download bash scripts
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/calculate_nucleotide_percentage.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/calculate_mean_nucleotide_frequency.sh
## Make the scripts executable
chmod a*x *.sh
## Run the analysis
bash generate_random_fasta_sequences.sh 10 10 > dummy3.fa
bash calculate_nucleotide_percentage.sh dummy3.fa
bash calculate_mean_nucleotide_frequency.sh dummy3.fa