Linux 103 - Example 1 - Genetic Diversity Centre (GDC)

Introduction & Objective: Counting Nucleotides with Bash

In this first example, we use only standard command-line tools (grep, wc, and bc) to demonstrate how to write a simple Bash script that reads a FASTA file and calculates the frequency of each nucleotide (A, T, C, and G).

Preparing Sample FASTA File

To test our Bash commands, we’ll need a small example (dummy) FASTA file. You can either create the sequence manually or use one of the two provided Bash scripts to generate a dummy sequence. In either case, save the result to a file called dummy.fa:

cat dummy.fa

>RandomSequence
ATCGATCGAT

Generating Random FASTA Sequence (Optional)

Script to generate a random fasta sequence

Here is a bash script to generate a random fasta sequence of a given length.


# Download the script(s)
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/generate_random_fasta.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/generate_random_fasta_sequences.sh

# Have a closer look at the scripts
cat generate_random_fasta.sh
cat generate_random_fasta_sequences.sh

# Make the script executable
chmod a+x generate_random_fasta*.sh

# Usage
bash generate_random_fasta.sh 10              # create one fasta sequence with length 10
bash generate_random_fasta_sequences.sh 10 10 # create 10 fasta sequences with length 10

Step 1: Extract Base-Only Sequence

The first step is to separate the header from the sequence. Next, we need a method to count each base in the sequence string.

tail -n 1 dummy.fa

We have learnt to search for oligos in fasta files using grep, but we have also seen that grep -c would count the number of lines in which the search term occurs, but not the number of occurrences.

tail -n 1 dummy.fa | grep "C" -c dummy.fa
tail -n 1 dummy.fa | grep "C" --color dummy.fa

But we could use the -o option to print the matching part and then count the number of lines.

tail -n 1 dummy.fa | grep -o "C" dummy.fa
tail -n 1 dummy.fa | grep -o "C" dummy.fa | wc -l

Step 2: Looping Through A, T, C, G

Now we need to repeat this for the three remaining nucleotides. The easiest way to do this is to use a loop.

for nuc in A C T G
do
  tail -n 1 dummy.fa | grep -o "${nuc}" dummy.fa | wc -l
done

Step 3: Pretty‑Printing Base Counts

It works, but the output is not very nice. We add the nucleotode letter next to the count to improve readability.

for nuc in A C T G
do
  count=`tail -n 1 dummy.fa | grep -o "${nuc}" | wc -l`
  echo "${nuc}: ${count}"
done

Step 4: Converting Counts to Percentages

Better, but let us change the count to frequency, for which we need floating-point arithmetic (e.g. bc).

for nuc in A C T G
do
  seq=`tail -n 1 dummy.fa`                           # sequence only
  seqlen=${#seq}                                     # sequence length
  count=`echo -n ${seq} | grep "${nuc}" -o | wc -l`  # count nucledites
  echo -n "${nuc}:"                                  # print
  echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done

Step 5: Support Multi-Line FASTA Entries

The code works for our dummy file. What happens if the sequence is on more than one line?

cat dummy2.fa

>RandomSequence
ATCGA
TCGAT

We would need to make a few adjustments in order to make it work:

We need to replace tail -n 1 dummy2.fa with grep -v "^>" dummy2.fa
We have to corret ${#seq} because it would count the end of lines too.

seq=`grep -v "^>" dummy2.fa`
echo ${#seq}         # n=11
echo ${#seq}-1 | bc  # n=10

With these two changes, the code would also work for fasta sequences with more than two lines.

for nuc in A C T G
do
  seq=`grep -v "^>" dummy2.fa`                       # sequence only
  seqlen=`echo ${#seq}-1 | bc`                       # sequence length
  count=`echo -n ${seq} | grep "${nuc}" -o | wc -l`  # count nucledites
  echo -n "${nuc}:"                                  # print
  echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done

Extension: Analyzing Multi‑Sequence FASTA Files

Can we apply the same idea to FASTA files with multiple sequences? Absolutely—but it requires a bit more Bash know-how. Below are two example scripts: one calculates nucleotide frequencies for each individual sequence, and the other computes the average across all sequences.

# Download the Bash scripts
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/calculate_nucleotide_percentage.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/calculate_mean_nucleotide_frequency.sh

# Make the scripts executable
chmod a+x *.sh

# Generate a dummy FASTA file with 10 sequences of 10 nucleotides each
bash generate_random_fasta_sequences.sh 10 10 > dummy3.fa

# Run the analyses
bash calculate_nucleotide_percentage.sh dummy3.fa
bash calculate_mean_nucleotide_frequency.sh dummy3.fa

This example has shown how basic command-line tools can be combined to analyze FASTA files and calculate nucleotide frequencies. With a bit of scripting, these ideas can easily be extended to handle larger and more complex datasets.