Skip to content

GDC: Microbiota Data Analysis Workshop Data Download

Data Download¶

After days (or weeks) of waiting, you get the anticipate e-mail from the sequencing facility telling you that your data is ready for downloading. Once recovered, you might want to verify your data, archive it for safekeeping, and submit your data to a read archive. Below a few tips help for the data verification.

Check-List¶

Tip

Download data if possible via terminal (e.g. sftp)
Verify file integrity (md5sum)
Verify data - N(samples) = 2 x N(Files)
Get a few random reads and blast them
Check fastq headers - how many runs?
Run a quality control (e.g. USEARCH or FastQC)
Read size distribution (e.g. EMBOSS infoseq)
Check for PhiX contamination (e.g. USEARCH)
Have a closer look at your negative controls
Archive a copy of you raw data (somewhere safe)
Submit the raw data to a read archive (e.g. ENA)

FTP Download¶

## SFTP-Server (OpenBis FGCZ)
cd ${wd}
lftp sftp://jwalser@openbis-dsu.ethz.ch:2222
mirror -c -v -p —log=p007_run191117_16S.log --parallel=10 p007_run191117_16S

Copy Data¶

## SSH
scp -r data/*_R[12]*.f*q.gz jwalser@euler.ethz.ch:${pd}/p007/run191117_16S/a_data/

md5sum¶

find . -name "*_R[12]*.f*q.gz" | while read file ; do md5sum $file; done > p007_run191117_16S_md5sum.txt

Data format: fastq¶

FASTQ format is a text-based format for biological sequence (usually nucleotide sequence) and its corresponding quality scores.

Verify Data (Illumina paired-end)¶

## Count forward (R1) and reverse (R2) read files
ls -al *_R1*.f*q.gz | wc -l
ls -al *_R2*.f*q.gz | wc -l

## Get fastq header of first reads
zcat *.f*q.gz | head -n 1
# Example @M01761:234:000000000-B32NW:1:2107:10522:1813 2:Y:0:CCTAAGAC+TAGCCTTA
#         @M01761:234 <- ID

## Specific - count total number of reads 
zcat *_R[12]*.f*q.gz | grep -c "^@M01761"

## Generic - count total number 
zcat *_R[12]*.f*q.gz | echo $((`wc -l`/4))

## Generic and faster
parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.f*q.gz

## Check instrument ID 
zcat *_R[12]*.f*q.gz | grep "^@" | sort -u

## Quick Blast
# FASTQ.GZ to FASTA
zcat *R1*.f*q.gz | head -n 40 | sed '/^@/!d;s//>/;N' > blast_query.fa

Links¶