Data Download¶
After days (or weeks) of waiting, you get the anticipate e-mail from the sequencing facility telling you that your data is ready for downloading. Once recovered, you might want to verify your data, archive it for safekeeping, and submit your data to a read archive. Below a few tips help for the data verification.
Check-List¶
Tip
- Download data if possible via terminal (e.g. sftp)
- Verify file integrity (md5sum)
- Verify data - N(samples) = 2 x N(Files)
- Get a few random reads and blast them
- Check fastq headers - how many runs?
- Run a quality control (e.g. USEARCH or FastQC)
- Read size distribution (e.g. EMBOSS infoseq)
- Check for PhiX contamination (e.g. USEARCH)
- Have a closer look at your negative controls
- Archive a copy of you raw data (somewhere safe)
- Submit the raw data to a read archive (e.g. ENA)
FTP Download¶
## SFTP-Server (OpenBis FGCZ) cd ${wd} lftp sftp://jwalser@openbis-dsu.ethz.ch:2222 mirror -c -v -p —log=p007_run191117_16S.log --parallel=10 p007_run191117_16S
Copy Data¶
## SSH scp -r data/*_R[12]*.f*q.gz jwalser@euler.ethz.ch:${pd}/p007/run191117_16S/a_data/
md5sum¶
find . -name "*_R[12]*.f*q.gz" | while read file ; do md5sum $file; done > p007_run191117_16S_md5sum.txt
Data format: fastq¶
FASTQ format is a text-based format for biological sequence (usually nucleotide sequence) and its corresponding quality scores.
Verify Data (Illumina paired-end)¶
## Count forward (R1) and reverse (R2) read files ls -al *_R1*.f*q.gz | wc -l ls -al *_R2*.f*q.gz | wc -l ## Get fastq header of first reads zcat *.f*q.gz | head -n 1 # Example @M01761:234:000000000-B32NW:1:2107:10522:1813 2:Y:0:CCTAAGAC+TAGCCTTA # @M01761:234 <- ID ## Specific - count total number of reads zcat *_R[12]*.f*q.gz | grep -c "^@M01761" ## Generic - count total number zcat *_R[12]*.f*q.gz | echo $((`wc -l`/4)) ## Generic and faster parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.f*q.gz ## Check instrument ID zcat *_R[12]*.f*q.gz | grep "^@" | sort -u ## Quick Blast # FASTQ.GZ to FASTA zcat *R1*.f*q.gz | head -n 40 | sed '/^@/!d;s//>/;N' > blast_query.fa