RNA-Seq

Lecture notes¶

BLAST Example¶

Exercise #1

BLAST (basic local alignment search tool) can be used to identify (annotate) unknown sequences. Below are two short nucleotide sequences. The first sequence is a random sequence with a GC content of 50%. Please replace the first random fasta sequence with anything you like better as a random sequence. You might use the random DNA sequence generator or do it by hand. The second sequence is the unknown sequence we are interested in annotating. Now, blast both sequences against the NCBI nucleotide collection (default) by copy and paste the 29nt long sequences below into the query sequence field and click [BLAST].

(1) Open the webbrowser and go to the NCBI BLAST page:

NCBI BLAST

(2) Copy & paste

>randomSequence
ACCATATACTTACGCTGGATCTTCTCCCG
>shortUCE_LOL 29nt
AATGATACGGCGACCACCGAGATCTACAC

Hint

If you have problem to understand the BLAST results you might have a look at the Illumina Adpater document (link below).

Statistical Power of RNA-seq Experiments¶

First, we run a few sample-size power simulation in R using either RNASeqPower or PROPER. Both are Bioconductor packages and can be installed via the BiocManager.

In the short manual of RNASeqPower Steven Hart and Terry Therneau do a wonderful job describing the problems of the experimental design of an RNS-seq experiment.

For the next exercise we we download an R script and use RStudio. There are two ways to get the prepared R script. Either you are using the command line to download the R script to your local hard drive and open it with RStudio or you download the script directly with R. Choose the one you prefer or try both to compare.

## Get R script via terminal
wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/scripts/PowerAnalysis.R"
## Get R script via R console
# You might have to install HelpersMG package first?
# install.packages("HelpersMG")
library("HelpersMG")
wget("http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/scripts/PowerAnalysis.R")
# Where did the script go?
getwd()
# Open file
file.edit("PowerAnalysis.R")

Exercise #2

Open the R script you just downloaded in RStudio and try to understand the command lines. Adjust, correct, extend the R commands to your liking! You might what to convert the short R script into a R Notebook?

RNA-Seq Data Analysis - Example for DESeq2¶

Special thanks to Heidi E.L. Tschanz-Lischer

Exercise #3

Following an step-by-step exercise in R where we try to find the genes showing differential expression between newborns (N), juvenil (J) and adults (A) mice. For both life-stages males and females were sequenced, which will be used as biological replicates. Copy and paste the R code junks but think about the meaning of each step.

A. Have a look at the read-count table.

wget("http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/data/RNAseq_Mus.txt")

B. Import the read-count table:

countTable <- read.delim("RNAseq_Mus.txt", header=TRUE, row.names=1)

Have a look at the first few rows of the table:

head(countTable)

C. Create a dataframe with the conditions of the samples. Each row should correspond to a sample with sample names as row names:

group <- data.frame(condition=c("N","N","J","J","A","A"),
  sex= c("f","m","f","m","f","m"),
  row.names=names(countTable))
group

D. Load the DESeq2 library and set up the count data set, with the two conditions as factor levels:

## DESeq2 is a bioconductor package and can be installed as follow:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")
# > This might take a while since DESeq2 depends on other libraries.  
library(DESeq2)
ddsMF <- DESeqDataSetFromMatrix(
  countData = countTable,
  colData = group,
  design = ~sex + condition)

E. The standard differential expression analysis steps are wrapped into a single function: DESeq. To get differentially expressed genes between newborns and adults you just need to execute following commands:

ddsMF <- DESeq(ddsMF)
res <- results(ddsMF , contrast=c("condition","N","A"))
head(res)

To get a description of the columns in res, execute following command:

mcols(res)$description

The DESeq function performs following steps, which can also be run manually:

# Normalize the count data set by estimating the size factor:
ddsMF <- estimateSizeFactors(ddsMF)
#If each column of the count table is divided by the size factor for this column, the count values are brought to a common scale. To view the size Factors you can execute following command:
sizeFactors(ddsMF)
# Estimate variance (dispersion) of gene expression between biological replicates. If the gene expression differs between replicates by 20%, then the gene’s dispersion is 0.2^2=0.04.
ddsMF <- estimateDispersions(ddsMF)
# To plot the extimated variance (red line= fitted dispersion):
plotDispEsts(ddsMF)
# Get differentially expressed genes:
ddsMF <- nbinomWaldTest(ddsMF)

F. Plot a PCA of the log2 transformed normalized counts:

rld <- rlog(ddsMF, blind=FALSE)
plotPCA(rld)

G. Get a summary of the differential expressed genes at an adjusted p-value cutoff (FDR) < 5%:

summary(res, alpha=0.05)

H. Plot results with all genes differentially expressed (red: FDR < 5% in red):

plotMA(res, alpha=0.05)

I. Plot count values of the most significant differential expressed gene:

plotCounts(ddsMF, gene=which.min(res$padj))

J. Order results according strength of differential expression:

resOrdered <- res[order(res$padj), ]

K. Get list of over-expressed genes in newborns/adults at FDR 5%:

overN <- resOrdered[resOrdered$padj < 0.05 & !is.na(resOrdered$padj) & resOrdered$log2FoldChange>0, ] 
# Plot
nrow(overN)
 overA <- resOrdered[resOrdered$padj < 0.05 & !
 is.na(resOrdered$padj) & resOrdered$log2FoldChange<0, ] 
nrow(overA)

Write differential expression table in a text file:

write.table(resOrdered, "diffExp_N-A.txt", sep="\t", row.names=TRUE)

Question

How many genes are differentially expressed at a FDR of 1%?

Solution

Recycle the R code from above but change the threshold accordingly.

Links¶

RNA-Seq News / Infos

Evaluation of RNA-seq data quality

RSeQC: An RNA-seq Quality Control Package

Read-Mappers:

Super-Fast alternative to conventional Read-Mappers:

Mapping-Viewer

Integrative Genomics Viewer (IGV)

Illumina Adapter Sequences

Illumina Adapters