Biocomputing

Reproducible Research (phdcomics)

As a "modern" biologist, you need computing skills. Biologists are often self-taught and therefore lack basic knowledge. Applications are only as good as the user using them. It is essential to understand an application and to know its limits. The best application produces only garbage if not used properly. Writing scripts is not that difficult and writing great scripts does not need much extra if you know how. It is not important to know numerous applications and be fluent in multiple (script) languages. It is important that you know your tools, know how to use them best, and explain their usage well.

Naming

You are free to name your files and folder that way you like despite all the well-meant rules. There is one rule that you might consider because it prevents headaches:

A friendly word of advice

Avoid special characters and spaces and use capital and underscore instead.

Find better solutions for the following file names.

my new file

⇒ MyNewFile.txt
Note: Avoid spaces and use a suffix to indicate the file format.

Sample-1-ph5-temp-high.fa

⇒ Sample01_pH5_High.fa
Note: Avoid special characters and keep it short. If you have numbered files (e.g. sample1, sample2, ...) use leading zeros to keep the sample order.

Final Report: Irène Müller.md

⇒ FinalReport_Irene_Mueller.md Note: You might be Irène Müller in life but on the terminal your are Irene_Mueller

File Formats

FASTA

The FASTA format is a text-based format for representing nucleotide or protein sequences. The first line (header) in a FASTA file starts with a greater-than symbol followed by a unique description of the sequences (e.g. IG Number). After the initial line comes the actual sequence as a character string spread over one or more lines.

>Sequence_001
ATGGGCGTCACGTCCACGTTCACCGTGTTA
TGGAATGCGTCACGTCAACTGGGGT

FASTQ

Each entry (read/sequence) in a FASTQ file consists of 4 lines:

1: @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
2: GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGTGCGTAAAGGGAACGC...
3: +
4: >>1>>11>11>>1EC?E?CFBFAGFC0GB/CG1EACFE/BFE///AEG1DF122A/B///21//0BEG...
1: Sequence identifier with information about the sequencing run and the cluster.
2: The sequence (the base calls; A, C, T, G and N).
3: A separator, which is simply a plus (+) sign.
4: The base call quality scores. These are Phred +33 encoded, 
   using ASCII characters to represent the numerical quality scores.

SAM / BAM

Sequence Alignment Map (SAM) is a text-based format used for storing data.

fast5

The fast5 format is the native container for ONT data. It contains the raw electrical signal levels measured by the nanopores.

Encoding

Code is a system of rules to convert information. For example, the genetic code is a sequence of codons corresponding to a sequence of amino acids. In other words, a gene can encode a protein. Character encoding in textual data is used to represent a repertoire of characters. Common examples of character encoding systems include Morse code. We use Morse code to encode text characters as standardised sequences of dots and dashes.

--. -.. -.-.

ASCII or Unicode (UTF) are character encoding standards for electronic communication.

binary code ASCII code Letter
01000111 (64 + 4 + 2 + 1 =) 071 G
01000100 (64 + 4 =) 068 D
01000011 (64 + 2 + 1 =) 067 C

Plain text files need the corresponding encoding to decode the message correctly. You can use the terminal command file to determine the file type:

file -bi file.txt

Note: Unicode is a superset of ASCII (7 bits). The main difference is in the way they encode the character and the number of bits used.

Especially the ambiguity of control characters are notorious for causing troubles. A good example is the newline (end of line (EOL), line breaks) encoding.

OS EOL
Linux LF \n
macOS LF \n
MS Windows CRLF \r\n

A friendly word of advice

Problems with file import? Make sure encoding is correct!

Compress Sequence Files

Sequencing centres provide compressed sequence files for good reasons. Compressed files are smaller and safer to download. Keep the data compressed because many applications can handle compressed files directly. If you have to decompress data, consider using scratch space if available. In general, compress both fastq and fasta sequence files to (i) safe disk space and (ii) for a more secure data transfer. 

Linux includes various commands for compressing and de-compressing files. We have a closer look at gzp and tar.

## Working Directory
mkdir -p ${HOME}/MDA20/qc
cd ${HOME}/MDA20/qc

## Toy-Data (paired-end fastq)
wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/C-1_R1.fastq.gz" 
wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/C-1_R2.fastq.gz" 

There are terminal commands able to handle commpressed files. You used cat to print the content of text files either to the screen or a file. The command zcat is doing the same but works for compressed file.

## Create a test file - subset
zcat C-1_R1.fastq.gz | head -n 40 > subset_C-1_R1.fq
head -n 4 subset_C-1_R1.fq
zcat C-1_R2.fastq.gz | head -n 40 > subset_C-1_R2.fq
tail -n 4 subset_C-1_R2.fq

# This is not a perfect subset because it take the top sequences and
#  not a random subset. A better alternative provides `seqtk`:
# seqtk sample -s1505 C-1_R1.fastq.gz 10 > random_subset_C-1_R1.fq
# seqtk sample -s1505 C-1_R2.fastq.gz 10 > random_subset_C-1_R2.fq

GNU Zip

## ----------------------
## GNU zip - compress
## ----------------------

## Zip file
gzip subset_C-1_R1.fq

## Compression info
file subset_C-1_R1.fq.gz

## Compression level
gzip -l subset_C-1_R1.fq.gz

## ----------------------
## GNU zip - un-compress
## ----------------------

## Un-zip file
gunzip subset_C-1_R1.fq.gz

## ----------------------
## GNU zip - Compression level
## ----------------------

# You can specify the compression levels from 1 (fast) - 9 (best).
# The default level is 6.

## Speed-Test

# Test data
cp C-1_R1.fastq.gz test.fq.gz
gunzip test.fq.gz

# Level 1 (fast)
time gzip -1 test.fq
# t = 1.781s to zip at level 1 with 65.2% comprssion rate
time gunzip test.fq.gz
# t = 1.099s

# Level 6 (default)
time gzip -6 test.fq
# t = 8.459s to zip at level 6 with 74.4% comprssion rate
time gunzip test.fq.gz
# t = 0.864s

# Level 9 (best compression)
time gzip -9 test.fq
# t = 32.012s to zip at level 9 with 75.6% compression rate
time gunzip test.fq.gz
# t = 0.982s

# Clean what we do not need any longer
rm test.fq

# ⇒ Default works best for our example in terms of speed and comression.

## ----------------------
## GNU zip - Help
## ----------------------

man gzip
gzip --help

TAR Archive

## ----------------------------
## TAR - uncompressed archive
## ----------------------------

## tar (create archive)
tar cfv subset_C-1_R12.fq.tar subset_C-1_R1.fq subset_C-1_R2.fq

## Change in size
du -sch subset_C-1_R12.fq.tar subset_C-1_R1.fq subset_C-1_R2.fq
# The files are archvied together but not compressed:
# 20K   subset_C-1_R12.fq.tar
# 8.0K  subset_C-1_R1.fq
# 8.0K  subset_C-1_R2.fq

## un-tar (extract) 
tar xfv subset_C-1_R12.fq.tar

## ----------------------------
## TAR - compressed archive
## ----------------------------

## tar with compression (option z)
tar cfvz subset_C-1_R12.fq.tar.gz subset_C-1_R1.fq subset_C-1_R2.fq

du -sch subset_C-1_R12.fq.tar.gz subset_C-1_R1.fq subset_C-1_R2.fq
# The files are archived and compressed 
# 8.0K  subset_C-1_R12.fq.tar.gz
# 8.0K  subset_C-1_R1.fq
# 8.0K  subset_C-1_R2.fq

## un-tar and de-compress
tar xfvz subset_C-1_R12.fq.tar.gz

## ----------------------------
## TAR - Help
## ----------------------------

man tar
tar --help
info tar

Challenge What terminal command(s) can you use to peek inside a compressed sequence file without decompressing the file?

Suggestions

## Solution 1a 
zcat C-1_R1.fastq.gz | head -n 4

## Solution 1b 
gzip -cd C-1_R1.fastq.gz | head -n 4

# d decompress file but
# c keep original files unchange

## Solution 1c
tar -tf C-1_R1.fastq.tar.gz

# t list the contents of an archive
# f from an archive file
# v do it 

Watch Disk Space

Be careful with unzipping file(s). Make sure there is enough disk space (e.g. df -h). Considering using disk space that is not backed up (e.g. scratch). Zip large files and tar small files to kept your project folder tidy and manageable.

File Integrity

Every time you transfer data there is a chance that the file gets corrupted. To check the integrity of a file we can use md5sum. The command calculates a message-digest fingerprint (checksum) for a file.

md5sum C-1_R[12].fastq.gz
# 4a2e8876742302fad5bd24cba76c3cc6  C-1_R1.fastq.gz
# e0ce4ae266e8b2cedbc9580a3025712a  C-1_R2.fastq.gz

The name does not play a role:

cp C-1_R1.fastq.gz test.fq.gz
md5sum C-1_R1.fastq.gz test.fq.gz
# 4a2e8876742302fad5bd24cba76c3cc6  C-1_R1.fastq.gz
# 4a2e8876742302fad5bd24cba76c3cc6  test.fq.gz
rm test.fq.gz

Changing the file content, changes the key:

gunzip -c C-1_R1.fastq.gz > test.fq
gzip test.fq
md5sum C-1_R1.fastq.gz test.fq.gz
# 4a2e8876742302fad5bd24cba76c3cc6  C-1_R1.fastq.gz
# ea5c830f3a871b5662d63bc857ede2c2  test.fq.gz
rm test.fq.gz

Did we really change the content? Yes, because we used a different (default) compression levels to re-create the archive file.

gzip -l C-1_R1.fastq.gz
#         compressed        uncompressed  ratio uncompressed_name
#           26042547            74702669  65.1% C-1_R1.fastq
gzip -l test.fq.gz
#         compressed        uncompressed  ratio uncompressed_name
#           19157129            74702669  74.4% test.fq

Data Transfer

Files should be compressed (zip) or at least archived (tar) before copied or moved. It is important to verify data after they were moved.

Freeware

We are willing to spend large sums of money for sequencing but not much for data analysis or software. We prefer free open source solution because we need to check the code. Except the university pays for the site licence and then we do not care anymore.

Free does not mean free of bugs or limits. The more users the more likely it is to find bugs. Even a free program without a proper manual and regular maintenance is of limited use. Many applications are maintain until the authors have published it afterwards often not anymore. The papers always show that the new program outperform previous ones. Choose your tools carefully.

Remember, because it is free does not mean there are no restrictions. For example, most applications are free for academic use only. The code is available but not for the take. There are freeware licence agreements!

A friendly word of advice

Acknowledge the hard work of developers and donate if you want to keep it alive.

Applications

You are looking for applications to help you with processing your data. Before you start downloading frantically what should you consider?

What do I want?

* Copyrights / licence type * Input and output format * Understand the limitations * Check dependencies / resource requirements * Documentation

What should you do before you start processing your data?

What do I need to know?

* Control installation / check the logs / understand warnings * Run an example or a simplified test * Make use of the verbose option (-v or -verbose) * Evaluate performance (e.g. speed, memory consumption)

Where can you get help?

Where can I find help?

* Developers webpage * Manual / vignettes * Help option (`-help`) * ReadMe files * User community / forum * Developer

What information do you need to provide to make it reproducible?

What should I provide?

* Scripts / command lines (well commented) * Version (-v, -version) * Parameters / options used * Reference

Scripts

If you write code do it with style and be generous with comments!