Terminal Example 2 - UniBas - Evolutionary Genetics

Here is another example of a bash project for the terminal. It is about detecting bad records in a fastq sequence file. You could look for such a tool on the internet or write one yourself. I think both have their pros and cons, which we can discuss. Here is an example of a possible home-made solution with different improvement steps.

Project Description and Aim

The objective is to develop a bash script that can read a FASTQ file and identify the following issues:

Invalid header format: Ensures that each header line begins with '@'.
Empty sequence: Verifies that the sequence line is empty.
Empty quality score: Determines if the quality score row is empty.
Sequence and Quality Score length mismatch: Assesses if the length of the sequence and quality score lines are the same.

Because we often work with many very large files, it is important that the processing is very fast too. I am not saying that speed is everything, but it is an important aspect of working with high-throughput sequence data.

Dummy Data

To test possible solutions, we need one or more test or dummy files. In our case, we need a correct fastq file to test speed, and test files with different but known errors to test accuracy and speed.

The correct (dummy) file contains 1 million sequences in fastq format:

1M.fq -> fastq file with 1 million sequences, 4 million lines.

The defect files are now created based on the correct dummy.

1M_error_1.fq -> Invalid header at position 250'001 (@ missing).
1M_error_2.fq -> Empty sequences line at position 250'002.
1M_error_3.fq -> Empty qualityscore line at position 250'004.
1M_error_4.fq -> Quality score at position 250'004 is 5nt shorter than sequence length at position 250'002.

Here is the download link for the files:

curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/1M.zip

First Version

The following is a preliminary version of a working script. It reads the input file (FASTQ) line by line, keeping a record of the line and record numbers. It checks each record for the specified problems (invalid header format, empty sequence, empty quality score, and length mismatch) and generates an error message if a problem is found. If no errors or incomplete sequences are found, it prints the number of records processed along with a success message.

#!/bin/bash

# Name: check_fastq_v1.sh
# Aim: Check fastq Files
# Version: 17.01.19 JCW
#          23.03.19 JCW record lines

# Input
fastq_file="$1"

# Help
if [ -z "$fastq_file" ]; then
    echo "Usage: $0 <fastq_file>"
    exit 1
fi

# Initialize
line_num=0
record_num=0

# Read input file line by line
while IFS= read -r line; do
    ((line_num++))
    record_line=$((line_num % 4))

    if [ $record_line -eq 1 ]; then
        # Header line
        if [[ ! "$line" =~ ^@ ]]; then
            echo "Error: Line $line_num - Invalid header format"
            exit 1
        fi
    elif [ $record_line -eq 2 ]; then
        # Sequence line
        if [[ -z "$line" ]]; then
            echo "Error: Line $line_num - Empty sequence"
            exit 1
        fi
        sequence_length=${#line}
    elif [ $record_line -eq 0 ]; then
        # Quality score line
        if [[ -z "$line" ]]; then
            echo "Error: Line $line_num - Empty quality score"
            exit 1
        elif [[ ${#line} -ne $sequence_length ]]; then
            echo "Error: Line $line_num - Sequence and quality score lengths do not match"
            exit 1
        fi
        ((record_num++))
    fi
done < "$fastq_file"

echo "Processed $record_num FASTQ records. No defects or incomplete sequences detected."

The script is subjected to testing using the dummy datasets.

./check_fastq_v1.sh 1M.fq
# Processed 1000000 FASTQ records. No defects or incomplete sequences detected.
# Time: 4m08s

./check_fastq_v1.sh 1M_error_1.fq
# Error: Line 250001 - Invalid header format
# Time: 0m15s

./check_fastq_v1.sh 1M_error_2.fq
# Error: Line 250002 - Empty sequence
# Time: 0m15s

./check_fastq_v1.sh 1M_error_3.fq
# Error: Line 250004 - Empty quality score
# Time: 0m15s

./check_fastq_v1.sh 1M_error_4.fq
# Error: Line 250004 - Sequence and quality score lengths do not match
# Time: 0m15s

./check_fastq_v1.sh 1M_error_5.fq
# Error: Line 250004 - Sequence and quality score lengths do not match
# Time: 0m15s

Conclusion: The script does what it is supposed to do, but it is a bit slow (4 mio reads / min). It notifies you when it finds an error and reports the line where the error occurred.

Version 2

We can optimise the script for better performance. One way to improve speed is to reduce the number of operations performed within the loop. In the original script, we calculate the length of the sequence line each time we encounter a quality score line. This can be inefficient, especially for large files. Instead, we can calculate and store the sequence line length once per record, avoiding redundant calculations.

#!/bin/bash

# Name: check_fastq_v1.sh
# Aim: Check fastq Files
# Version: 06.05.20 JCW

# Input
fastq_file="$1"

# Help
if [ -z "$fastq_file" ]; then
    echo "Usage: $0 <fastq_file>"
    exit 1
fi


while IFS= read -r header && IFS= read -r sequence && IFS= read -r plus && IFS= read -r quality; do
    if [[ ! "$header" =~ ^@ ]]; then
        echo "Error: Invalid header format"
        exit 1
    fi

    if [[ -z "$sequence" ]]; then
        echo "Error: Empty sequence"
        exit 1
    fi

    sequence_length=${#sequence}

    if [[ -z "$quality" ]]; then
        echo "Error: Empty quality score"
        exit 1
    fi

    if [[ ${#quality} -ne $sequence_length ]]; then
        echo "Error: Sequence and quality score lengths do not match"
        exit 1
    fi

    ((record_num++))
done < "$fastq_file"

echo "Processed $record_num FASTQ records. No defects or incomplete sequences detected."

In this optimised version, we read each record with a single read command that reads all four lines of the record at once. We then perform all the necessary checks on each record. This reduces the number of calculations and comparisons performed in the loop, leading to improved performance, especially for large files.

./check_fastq_v2.sh 1M.fq
# Processed 1000000 FASTQ records. No defects or incomplete sequences detected.
# Time: 2m53s (~70%)

./check_fastq_v2.sh 1M_error_1.fq
# Error: Invalid header format
# Time 0m11s

./check_fastq_v2.sh 1M_error_2.fq
# Error: Empty sequence
# Time 0m11s

./check_fastq_v2.sh 1M_error_3.fq
# Error: Empty quality score
# Time 0m11s

./check_fastq_v2.sh 1M_error_4.fq
# Error: Sequence and quality score lengths do not match
# Time 0m11s

Version 3

To further improve the performance of the script, especially for very large FASTQ files, we can consider using awk, which is very efficient for text processing tasks.

#!/bin/bash

# Input
fastq_file="$1"

# Help
if [ -z "$fastq_file" ]; then
    echo "Usage: $0 <fastq_file>"
    exit 1
fi

awk '{
    if (NR % 4 == 1) {
        if ($0 !~ /^@/) {
            print "Error: Invalid header format"
            exit 1
        }
    } else if (NR % 4 == 2) {
        if ($0 == "") {
            print "Error: Empty sequence"
            exit 1
        } else {
            seq_length = length($0)
        }
    } else if (NR % 4 == 0) {
        if ($0 == "") {
            print "Error: Empty quality score"
            exit 1
        } else if (length($0) != seq_length) {
            print "Error: Sequence and quality score lengths do not match"
            exit 1
        }
        record_num++
    }
} END {
    print "Processed " record_num " FASTQ records. No defects or incomplete sequences detected."
}' "$fastq_file"

This script uses awk to efficiently process each line of the FASTQ file. It checks each record for errors or incomplete sequences, and prints an error message if a problem is found. At the end of processing, it prints the number of records processed and a success message.

./check_fastq_v3.sh 1M.fq
# Processed 1000000 FASTQ records. No defects or incomplete sequences detected.
# Time 0m03s

./check_fastq_v3.sh 1M_error_1.fq 
# Error: Invalid header format
# Processed 62500 FASTQ records. No defects or incomplete sequences detected.
# Time: <1s

./check_fastq_v3.sh 1M_error_2.fq 
# Error: Empty sequence
# Processed 62500 FASTQ records. No defects or incomplete sequences detected.
# Time: <1s

./check_fastq_v3.sh 1M_error_3.fq 
# Error: Empty quality score
# Processed 62500 FASTQ records. No defects or incomplete sequences detected.
# Time: <1s

./check_fastq_v3.sh 1M_error_4.fq 
# Error: Sequence and quality score lengths do not match
# Processed 62500 FASTQ records. No defects or incomplete sequences detected.
# Time: <1s

No tool is perfect

It's important that users are aware of any limitations. For example, the scripts above stop running when they encounter an error. This makes it easy to identify problematic files, as the script will show where the first error occurred. Note, however, that the script will only report the first error it finds, not all errors in the file.