RNA-Seq¶

DEGs

Lecture Notes¶

RNA-sequencing analyzes the transcriptome. Some of the most popular techniques are transcriptional (expression) profiling, differential gene expression analysis, RNA editing and even SNP identification. We will focus on the analysis to identify differential expressed genes (DEGs). There are many methods but no consensus about the most appropriate workflow to identifying differentially expressed genes from RNA-Seq data. Have a look at Costa-Silva et. al (2017) for am extended review on the topic.

Workflow

Quality Control¶

Content Bias - Depending on the library preparation method, QC of RNA-Seq data sets often show a pronounced sequence composition bias at the beginning (first ~12 bases) of the reads. The origin of this bias is related to the random priming step in the library production. The priming should be equal efficiency for all hexamer primers, which is not the case. Trimming the reads would remove the (visual) bias, but it would be wrong and not solve the problem. The QC warning is correct, but there is not much you can / should do.

Workflow

Contaminations - Getting high-quality intact total RNA is often the first step in many biological applications aimed at measuring gene expression. A tough task and it is not uncommon to sequences short fragments. Keep an eye open for adaptor contamination.

With the careful extraction of RNA we are not only isolating mRNA but other RNA molecules. Some library preparation workflows include a ribosomal RNA depletion step.

Duplicate Sequences - Nothing to worry about. We expect high(er) level of duplication in RNA-Seq datasets. Specific enrichments is a characteristic of RNA-Seq and does not indicating over-sequencing. A de-duplication step would remove the expression signal.

RNA Integrity¶

A QC like provided with FastQC does not reflect the true quality of RNA-Seq data. You might consider transcript integrity number (TIN) score PCA plot to infer the true quality (Son et al. 2018).

Data Filtering¶

With the careful extraction of RNA we are not only isolating mRNA but other RNA molecules. Some library preparation workflows include a ribosomal RNA depletion step.

Approximately 80 percent of the total RNA in rapidly growing mammalian cells (e.g., cultured HeLa cells) is rRNA, and 15 percent is tRNA; protein-coding mRNA thus constitutes only a small portion of the total RNA. The primary transcripts produced from most rRNA genes and from tRNA genes, like pre-mRNAs, are extensively processed to yield the mature, functional forms of these RNAs. < Molecular Cell Biology>

Therefore, it is important to screen (and filter) the sequence reads for non mRNA related data. A convenient screen application would be FastQ Screen

Statistical Power of RNA-seq Experiments¶

First, we run a few sample-size power simulation in R using either RNASeqPower or PROPER. Both are Bioconductor packages and can be installed via the BiocManager.

In the short manual of RNASeqPower Steven Hart and Terry Therneau do a wonderful job describing the problems of the experimental design of an RNA-seq experiment.

Exercise: Power Analysis¶

For the RNA-Seq exercise, we focus on the data analysis in R. First you have to download the R script first. You have two choices. Either you are using the wget command to download the R script to your local hard drive and open it with RStudio or you download and open the script directly within RStudio. Choose the one you prefer or try both to compare.

Terminal¶

## Working Directory
mkdir ${HOME}/RNASeq
cd ${HOME}/RNASeq

## Get R script via terminal
wget "http://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/scripts/PowerAnalysis.R"

RStudio¶

## --------------------------------
## Get R script via R console
## --------------------------------

# We use wget from the HelpersMG package. 
# Check if you have the package already installed:
# installed.packages(HelpersMG)
# If not, you have to install the package:
# install.packages("HelpersMG")

library("HelpersMG")

## Working Directory
getwd()                # current directory
#setwd()               # set directroy if needed
dir.create("RNASeq")   # new working directory
setwd("RNASeq")        # ready to go

## Downlaod R script
wget("http://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/scripts/PowerAnalysis.R")

## Open the script
file.edit("PowerAnalysis.R")

The benefit of power analysis might debatable, but these simulations help to understand the importance of biological replicates. Therefore, it is essential that you are not just copy and paste the code but understand it.

Tip

RNA-Seq experiments are powerful if well designed but limited if not. The within sample variations has to be low or the number of replicates (very) high if you want to trust in the results of your differential expression study.

RNA-Seq Data Analysis¶

Exercise: DESeq2¶

◇ Special thanks to Heidi E.L. Tschanz-Lischer

The next exercise is a step-by-step RNA-Seq analysis to find differential expression genes in newborn (N), juvenile (J) and adult (A) mice. Male and females from all life-stages were sequenced.

## Download the Count Table  
library(HelpersMG)
wget("http://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/RNAseq_Mus.txt")

## Import Count Table
countTable <- read.delim("RNAseq_Mus.txt", header = TRUE, row.names = 1)

## Data format
head(countTable)
# ENSMUSG - Ensemble Mouse (Mus musculus) Genome
# Example:
# ENSMUSG00000020333 > Acsl6 (acyl-CoA synthetase long-chain family member 6)
# This gene has 15 transcripts (splice variants), 161 orthologues, and 13 paralogues.
# f/m = female/male; N/J/A: newborn/juvenile/adults

We need a sample-design table:

group <- data.frame(condition = c("N","N","J","J","A","A"),
                    sex = c("f","m","f","m","f","m"),
                    row.names = names(countTable))
group

We are using the Bioconductor package DESeq2 to find DEGs.

## Installed DESeq2 
# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

BiocManager::install("DESeq2")
# > This might take a while since DESeq2 depends on other libraries.  

## Load Library
library(DESeq2); packageVersion("DESeq2")

## DESeq Data Set
dds <- DESeqDataSetFromMatrix(
                 countData = countTable,
                 colData   = group,
                 design    = ~sex + condition)

Plot a PCA & Clusters of log2 transformed normalized counts to check for outliers.

## Regularized Log Transformation
rld   <- rlog(dds, blind = FALSE)
dist  <- dist(t(assay(rld)))
hc    <- hclust(dist)

## PCA Plot
plotPCA(rld)

## Cluster
plot(hc)

⟹  No outliers, m & f (replicates) of the same age group together.         
⟹  Newborn and juvenile are more similar.

The standard differential expression analysis steps are all wrapped into a single function: DESeq. To get differentially expressed genes between e.g. newborns and adults you need to execute following commands:

# Standard Analysis
deg    <- DESeq(dds)
res.NA <- results(deg, contrast = c("condition", "N", "A"))
res.JA <- results(deg, contrast = c("condition", "J", "A"))
res.NJ <- results(deg, contrast = c("condition", "N", "J"))

# Results
head(res.NA)
# Meaning
mcols(res.NA)$description

# => e.g. Acsl6 (ENSMUSG00000020333) is not significantly different expressed between newborn and adults.
#                                    is significantly different expressed between juvenile and adults.
#                                    is not significantly different expressed between newborn and juvenile.

The DESeq function performs following steps, which can be run manually:

# Normalize the count data set by estimating the size factor:
esf <- estimateSizeFactors(dds)

# If each column of the count table is divided by the size factor for this column,
# the count values are brought to a common scale. 
# To view the size Factors you can execute following command:
sizeFactors(esf)

# Estimate variance (dispersion) of gene expression between biological replicates.
# We use M and F as biological replicates

# If the gene expression differs between replicates by 20%, then the gene’s dispersion is 0.2^2 = 0.04.
disp <- estimateDispersions(esf)

# Plot extimated variance (red line= fitted dispersion):
plotDispEsts(disp)

# Get differentially expressed genes:
deg.manual <- nbinomWaldTest(disp)

DEGs with adjusted p-value cutoff (FDR) < 5%:

## Results
summary(res.NA, alpha = 0.05)
summary(res.JA, alpha = 0.05)
summary(res.NJ, alpha = 0.05)

# Plot all DEGs (FDR < 5% in red):

par(mfrow = c(3,1))
 plotMA(res.NA, alpha = 0.05, main = "Newborn-Adult")
 plotMA(res.JA, alpha = 0.05, main = "Juvenile-Adult")
 plotMA(res.NJ, alpha = 0.05, main = "Newborn-Juvenile")
par(mfrow = c(1,1))

Individual DEGs:

# Plot normalized counts for a single gene
plotCounts(dds, gene = "ENSMUSG00000020333")

# Plot count of the most significant DEG for N-A:
plotCounts(dds, gene = which.min(res.NA$padj))
# res.NA[which.min(res.NA$padj),]
# ⟹  ENSMUSG00000054932: Afp (alpha-foetoprotein - plasma protein)

# Plot count of the most significant DEG for J-A:
plotCounts(dds, gene = which.min(res.JA$padj))
# res.JA[which.min(res.JA$padj),]
# ⟹  ENSMUSG00000005547: Cyp2a5 (cytochrome P450)

# Plot count of the most significant DEG for N-J:
plotCounts(dds, gene = which.min(res.NJ$padj))
# res.NJ[which.min(res.NJ$padj),]
# ⟹  ENSMUSG00000032517: Mobp (myelin-associated oligodendrocytic basic protein)

# Order results according strength of differential expression:
res.NA.ordered <- res.NA[order(res.NA$padj), ]

Get list of over-expressed genes in newborns/adults at FDR 5%:

## over-expressed genes in N (bigger 0)
overN <- res.NA.ordered[res.NA.ordered$padj < 0.05 & !
is.na(res.NA.ordered$padj) & res.NA.ordered$log2FoldChange > 0, ] 
nrow(overN)

## over-expressed genes in A (smaller 0) 
overA <- res.NA.ordered[res.NA.ordered$padj < 0.05 & !
is.na(res.NA.ordered$padj) & res.NA.ordered$log2FoldChange < 0, ] 
nrow(overA)

Write differential expression table in a text file:

write.table(res.NA.ordered, "DEGs_Newborn_Adults.txt", sep = "\t", row.names = TRUE)

The Bioconductor BiomaRt R package is a quick, easy and powerful way to access BioMart right from your R software terminal.

# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("biomaRt")

library(biomaRt)

## Available Datasets
# mart = useEnsembl('ENSEMBL_MART_ENSEMBL')
# listDatasets(mart)

ensembl = useEnsembl(biomart="ensembl", dataset="mmusculus_gene_ensembl")

## Available Attributes
# listAttributes(ensembl)

## Dataset for Mouse Genome
genes <- getBM(attributes=c('ensembl_gene_id',
                            'ensembl_transcript_id',
                            'description',
                            'external_gene_name'
                            'chromosome_name',
                            'start_position',
                            'end_position'),
                            filters = 'chromosome_name', values = c(1:20), mart = ensembl)

# Gene Name
unique(genes$external_gene_name[genes$ensembl_gene_id == "ENSMUSG00000032517"])
# Gene description
unique(genes$description[genes$ensembl_gene_id == "ENSMUSG00000032517"])
# Chromosome
unique(genes$chromosome_name[genes$ensembl_gene_id == "ENSMUSG00000032517"])

RNA-Seq¶

Lecture Notes¶

Quality Control¶

RNA Integrity¶

Data Filtering¶

Statistical Power of RNA-seq Experiments¶

Exercise: Power Analysis¶

Terminal¶

RStudio¶

RNA-Seq Data Analysis¶

Exercise: DESeq2¶

Additional Information¶

Analysis Workflows¶

RNA-Seq News / Infos¶

Evaluation of RNA-Seq data quality¶

Read-Mappers¶

Super-Fast alternative to conventional Read-Mappers:¶

Mapping-Viewer¶

Literature¶