Reproducible Research

Lecture Notes

Challenges

Post your comments or questions directly on the white board.

Markdown

Most of the primary data analysis is done in the linux environment and you will use different commands e.g. saved in your bash history. To increase reproducibility of your logs or scripts, we are going to use a markdown editor. Download and install Haroopad or MacDown. Explore the markdown editor yourself. Try to generate titles of different sizes, add plain text, code a pictures. Use either the features in the insert menu or the cheat sheet. Try to recycle code snippets from the terminal challenge and add comments. Then you can export your file and save it as an html or pdf. What is the advantage of the pdf version? If you like to present your results in a table you can use webtools like TableCovert to reformat text files to markdown.

RegEx

Quote

A regular expression is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. The most famous RegEx is the wildcard character *, which matches everything.

RegEx can be used in different languages (R, bash, python, ..) often they are slightly different. We going to explore them in an editor. Advanced users can use bash or any other language as well.

Download the table from the course website here and open it with the Atom editor. Go to the find and replace section and activate the regex mode.

Remove the ending (_L*) from each fasta header. The cheat sheet might be helpful.

Atome suggestion

>([A-Z,0-9]+)_L[0-9]+

>$1

Bash suggestion

sed -r 's|>([A-Z,0-9]*)_L[0-9]*|>\1|g' Rep_regex.fas

Often you need to make sure that you consider all possible patterns and finding the correct RegEx can be challenging but there are testers available, which are often very helpful.

RMarkdown

To conduct the analysis and generating a report at the same time we are going to use RMarkdown. As soon as you could reduce the data during the primary analysis, RMarkdown offers you a powerful tool to do reproducible research.

Set up a markdown document using Rstudio. Explore the markdown editor yourself using the example.

We have used the code from the R challenge below to reformat a table. The aim is to use Rmarkdown to write reproducible R code (hint: use styler and add comments).

If you like to learn more about RMarkdown, finde here a nice tutorial.

## base-version
sex = read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"));sex$Genus <- gsub("silene", "Silene", sex$Genus)
sex$Sexual_System = gsub("gynodioecy","non_hermaphrodite",sex$Sexual_System);sex$Sexual_System<-gsub("trioecy", "non_hermaphrodite",sex$Sexual_System);sex$Sexual_System<-gsub("dioecy", "non_hermaphrodite",sex$Sexual_System)
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

## tidy-version 
library(tidyverse);sex <- read_csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>% mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%mutate_if(is.character, str_replace_all, "silene", "Silene") %>%arrange(Sexual_system, Species)

Additional Information

R Code with Style

Style Guides

xkcd.com webcomic Picture Source: https://xkcd.com

Style Packages

styler - install.packages("styler")
lintr - devtools::install_github("jimhester/lintr")
formatR / Help with formatR - install.packages("formatR")
broom / Overview - install.packages("broom")

Code Folding

RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.

To insert a new code section you can use the Code > Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section.

## Setup ----

## clean/reset environment
rm(list=ls())

## R and Bioconductor libraries
library(ggplot2)

## Data Import ----

otufile <- "ZOTU_c99_Count_Sintax.txt"
mapfile <- "MapFile.txt"

## Import into Phyloseq
d.ZOTU <- import_qiime(otufilename = otufile, mapfilename = mapfile)
d.ZOTU

Tabs in rmarkdown

## data {.tabset}

### summary

```{r}
summary(iris)
```

### scatter plot
```{r}
plot(iris$Sepal.Length,iris$Species)
```

### boxplot plot
```{r}
boxplot(iris$Sepal.Length~iris$Species)
```