Terminal Example 1 - UniBas - Evolutionary Genetics

Here is an example of a bash project for the terminal to calculate each nucleotide (A, T, C, G) in a fasta sequence.

Dummy data

To test the code, we need a dummy sequence in fasta format. You can create these sequences by hand or use the script below:

cat dummy.fa

>RandomSequence
ATCGATCGAT

Script to generate a random fasta sequence

Here is a bash script to generate a random fasta sequence of a given length.


## Download the script(s)
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/generate_random_fasta.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/generate_random_fasta_sequences.sh

## Make the script executable
chmod a+x generate_random_fasta*.sh

## Usage
bash generate_random_fasta.sh 10
bash generate_random_fasta_sequences.sh 10 10

Count the nucleotides

The first step is to separate the header from the sequence. Next, we need a method to count each base in the sequence string.

tail -n 1 dummy.fa

We have learnt to search for oligos in fasta files using grep, but we have also seen that grep -c would count the number of lines in which the search term occurs, but not the number of occurrences.

tail -n 1 dummy.fa | grep "C" -c dummy.fa
tail -n 1 dummy.fa | grep "C" --color dummy.fa

But we could use the -o option to print the matching part and then count the number of lines.

tail -n 1 dummy.fa | grep -o "C" dummy.fa
tail -n 1 dummy.fa | grep -o "C" dummy.fa | wc -l

Now we need to repeat this for the three remaining nucleotides. The easiest way to do this is to use a loop.

for nuc in A C T G
do
  tail -n 1 dummy.fa | grep -o "${nuc}" dummy.fa | wc -l
done

Improve output

It works, but the output is not very nice. We add the nucleotode letter next to the count to improve readability.

for nuc in A C T G
do
  count=`tail -n 1 dummy.fa | grep -o "${nuc}" | wc -l`
  echo "${nuc}: ${count}"
done

Frequency instead of counts

Better, but let us change the count to frequency, for which we need floating-point arithmetic (e.g. bc).

for nuc in A C T G
do
  seq=`tail -n 1 dummy.fa`                           # sequence only
  seqlen=${#seq}                                     # sequence length
  count=`echo -n ${seq} | grep "${nuc}" -o | wc -l`  # count nucledites
  echo -n "${nuc}:"                                  # print
  echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done

Enhanced usability

The code works for our dummy file. What happens if the sequence is on more than one line?

cat dummy2.fa

>RandomSequence
ATCGA
TCGAT

We would need to make a few adjustments in order to make it work:

We need to replace tail -n 1 dummy2.fa with grep -v "^>" dummay2.fa
We have to corret ${#seq} because it would count the end of lines too.

seq=`grep -v "^>" dummy2.fa`
echo ${#seq}         # n=11
echo ${#seq}-1 | bc  # n=10

With these two changes, the code would also work for fasta sequences with more than two lines.

for nuc in A C T G
do
  seq=`grep -v "^>" dummy2.fa`                       # sequence only
  seqlen=`echo ${#seq}-1 | bc`                       # sequence length
  count=`echo -n ${seq} | grep "${nuc}" -o | wc -l`  # count nucledites
  echo -n "${nuc}:"                                  # print
  echo "scale=2; ${count}/ ${seqlen} * 100" | bc
done

Multiple sequences

Would it be possible to use this idea to process FASTA files with multiple sequences? Of course it is, but it would require a bit of bash knowledge. Here are two suggestions to calculate the nucleotide frequencies for each FASTA sequence or just the average over all sequences.

## Download bash scripts
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/calculate_nucleotide_percentage.sh
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/scripts/calculate_mean_nucleotide_frequency.sh

## Make the scripts executable
chmod a*x *.sh

## Run the analysis
bash generate_random_fasta_sequences.sh 10 10 > dummy3.fa
bash calculate_nucleotide_percentage.sh dummy3.fa
bash calculate_mean_nucleotide_frequency.sh dummy3.fa