Introduction & Objective: Counting Nucleotides with Bash
In this first example, we use only standard command-line tools (grep
, wc
, and bc
) to demonstrate how to write a simple Bash script that reads a FASTA file and calculates the frequency of each nucleotide (A
, T
, C
, and G
).
Preparing Sample FASTA File
To test our Bash commands, we’ll need a small example (dummy) FASTA file. You can either create the sequence manually or use one of the two provided Bash scripts to generate a dummy sequence. In either case, save the result to a file called dummy.fa:
cat dummy.fa
>RandomSequence
ATCGATCGAT
Generating Random FASTA Sequence (Optional)
Script to generate a random fasta sequence
Here is a bash script to generate a random fasta sequence of a given length.
# Download the script(s)
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/generate_random_fasta.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/generate_random_fasta_sequences.sh
# Have a closer look at the scripts
cat generate_random_fasta.sh
cat generate_random_fasta_sequences.sh
# Make the script executable
chmod a+x generate_random_fasta*.sh
# Usage
bash generate_random_fasta.sh 10 # create one fasta sequence with length 10
bash generate_random_fasta_sequences.sh 10 10 # create 10 fasta sequences with length 10
Step 1: Extract Base-Only Sequence
The first step is to separate the header from the sequence. Next, we need a method to count each base in the sequence string.
tail -n 1 dummy.fa
We have learnt to search for oligos in fasta files using grep
, but we have also seen that grep -c
would count the number of lines in which the search term occurs, but not the number of occurrences.
tail -n 1 dummy.fa | grep "C" -c dummy.fa
tail -n 1 dummy.fa | grep "C" --color dummy.fa
But we could use the -o
option to print the matching part and then count the number of lines.
tail -n 1 dummy.fa | grep -o "C" dummy.fa
tail -n 1 dummy.fa | grep -o "C" dummy.fa | wc -l
Step 2: Looping Through A, T, C, G
Now we need to repeat this for the three remaining nucleotides. The easiest way to do this is to use a loop.
for nuc in A C T G
do
tail -n 1 dummy.fa | grep -o "${nuc}" dummy.fa | wc -l
done
Step 3: Pretty‑Printing Base Counts
It works, but the output is not very nice. We add the nucleotode letter next to the count to improve readability.
for nuc in A C T G
do
count=`tail -n 1 dummy.fa | grep -o "${nuc}" | wc -l`
echo "${nuc}: ${count}"
done
Step 4: Converting Counts to Percentages
Better, but let us change the count to frequency, for which we need floating-point arithmetic (e.g. bc
).
for nuc in A C T G
do
seq=`tail -n 1 dummy.fa` # sequence only
seqlen=${#seq} # sequence length
count=`echo -n ${seq} | grep "${nuc}" -o | wc -l` # count nucledites
echo -n "${nuc}:" # print
echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done
Step 5: Support Multi-Line FASTA Entries
The code works for our dummy file. What happens if the sequence is on more than one line?
cat dummy2.fa
>RandomSequence
ATCGA
TCGAT
We would need to make a few adjustments in order to make it work:
- We need to replace
tail -n 1 dummy2.fa
withgrep -v "^>" dummay2.fa
- We have to corret ${#seq} because it would count the end of lines too.
seq=`grep -v "^>" dummy2.fa`
echo ${#seq} # n=11
echo ${#seq}-1 | bc # n=10
With these two changes, the code would also work for fasta sequences with more than two lines.
for nuc in A C T G
do
seq=`grep -v "^>" dummy2.fa` # sequence only
seqlen=`echo ${#seq}-1 | bc` # sequence length
count=`echo -n ${seq} | grep "${nuc}" -o | wc -l` # count nucledites
echo -n "${nuc}:" # print
echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done
Extension: Analyzing Multi‑Sequence FASTA Files
Can we apply the same idea to FASTA files with multiple sequences? Absolutely—but it requires a bit more Bash know-how. Below are two example scripts: one calculates nucleotide frequencies for each individual sequence, and the other computes the average across all sequences.
# Download the Bash scripts
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/calculate_nucleotide_percentage.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/calculate_mean_nucleotide_frequency.sh
# Make the scripts executable
chmod a+x *.sh
# Generate a dummy FASTA file with 10 sequences of 10 nucleotides each
bash generate_random_fasta_sequences.sh 10 10 > dummy3.fa
# Run the analyses
bash calculate_nucleotide_percentage.sh dummy3.fa
bash calculate_mean_nucleotide_frequency.sh dummy3.fa
This example has shown how basic command-line tools can be combined to analyze FASTA files and calculate nucleotide frequencies. With a bit of scripting, these ideas can easily be extended to handle larger and more complex datasets.