Data Download - GDC First Aid Kit

Once the Miseq run is complete, you will need to obtain your sequences. At the GDC your data set will be uploaded to the server where it will either be processed for you or made available for download.

Regardless of where you get your data from, it is important that you download the data as soon as it is available and check the quality of the run. Here are some simple but important tips.

Check-List

Download data if possible via terminal (e.g. sftp)
Verify file integrity (e.g. md5sum)
Verify data e.g. N(samples) = 2 x N(Files)
Get a few random reads and blast them
Check fastq headers (How many runs?)
Read basecall and/or QC report(s)
Check read length distribution
Check for possible contamination (e.g. PhiX)
Archive a copy of your raw data
Upload the raw data (e.g. ENA)

Terminal - We recommend that you use the terminal (e.g. sftp, scp. or wget) to download your sequences whenever possible. It is often faster, more reliable and easier for batch downloads.

md5sum - You can use the md5sum Linux terminal command to check the integrity of the files being transferred. Changes to a file will cause its MD5 hash to change, and a corrupt file transfer can be detected.

## Create md5sum keyes for all the files
find . -name "*_R[12]*.fastq.gz" |\
 while read file ; do md5sum $file; done > p*_run*_*_md5sum.txt

Terminal Examples - Here are some simple terminal commands to help spot problems.

## Count R1 and R2 files
ls -al a_data/gz/*_R1*.f*q.gz | wc -l
ls -al a_data/gz/*_R2*.f*q.gz | wc -l
# n(R1)=n(R2)

## Get fastq header of first reads
zcat a_data/gz/*.f*q.gz | head -n 1
# Example @M01761:234:000000000-B32NW:1:2107:10522:1813 2:Y:0:CCTAAGAC+TAGCCTTA
#         @M01761:234 <- ID

## Count number of reads per sample 
zgrep -c "^+$" a_data/gz/*_R[12]*.f*q.gz

## Count number of total reads
zcat a_data/gz/*_R[12]*.f*q.gz | grep -c "^+$"

## Count number of undetermined (not demultiplexed) reads 
zgrep -c "^+$" a_data/gz/Undetermined*_R[12].f*q.gz

## Check if all reads are from the same run / platform
zgrep " " a_data/gz/*.f*q.gz | cut -d : -f 1,2 | sort -u

## Read-Length distribution
zcat a_data/gz/sample_R1.f*q.gz | awk 'NR % 4 == 2' | awk '{print length}' | sort | uniq -c

QC Report - The data is often accompanied by a quality report. FastQC in combination with multiQC is the preferred choice. Both applications come with detailed manuals. Read them carefully and try to understand the limitations. Remember that QC applications are context dependent! What you don't want to do for some sequencing projects (e.g. high duplication rate for genome sequencing projects) may not apply for others (e.g. high duplication rate for RADseq data).

Basecaller Report - The basecall report is particularly interesting for PacBio data. The report is not easy to understand at first, but with a little experience and perhaps some help, it is useful in many ways.