Reproducible Research

Introduction Notes

⬇︎ Reproducible Research

⬇︎ Markdown

⬇︎ RegEx


Challenges

Post your comments or questions directly in the Google group.

Markdown

Most of the primary data analysis is done in the linux environment and you will use different commands e.g. saved in your bash history. To increase reproducibility of your logs or scripts, we are going to use a markdown editor. Download and install Haroopad or MacDown. Explore the markdown editor yourself. Try to generate titles of different sizes, add plain text, code an pictures. Use either the features in the insert menu or the cheat sheet. Try to recycle code snippets from the terminal challenge and add comments. Then you can export your file and save it as a html or pdf.

RegEx

Quote

A regular expression is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. The most famous RegEx is the wildcard character *, which maches everthing.

Download the table from the course website here and open it with the Atom editor. Go to the find and replace section and activate the regex mode.

Remove the ending (_L*) from each fasta header. The cheat sheet might be helpful.

Suggestion

>([A-Z,0-9]+)_L[0-9]+

>$1

Often you need to make sure that you consider all possible patterns and finding the correct RegEx can be challenging but there are testers available, which are often very helpful.

RMarkdown

To conduct the analysis and generating a report at the same time we are going to use RMarkdown. As soon as you could reduce the data during the primary analysis, RMarkdown offers you a powerful tool to do reproducible research.

Set up a markdown document using Rstudio. Explore the markdown editor yourself using the example.

We have used the code from the R challenge below to reformat a table. The aim is to use Rmarkdown to write reproducible R code (hint: use styler and add commnents).

If you like to learn more about RMarkdown, finde here a nice tutorial.

## base-version
sex = read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"));sex$Genus <- gsub("silene", "Silene", sex$Genus)
sex$Sexual_System = gsub("gynodioecy","non_hermaphrodite",sex$Sexual_System);sex$Sexual_System<-gsub("trioecy", "non_hermaphrodite",sex$Sexual_System);sex$Sexual_System<-gsub("dioecy", "non_hermaphrodite",sex$Sexual_System)
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]
## tidy-version 
library(tidyverse);sex <- read_csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>% mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%mutate_if(is.character, str_replace_all, "silene", "Silene") %>%arrange(Sexual_system, Species)

Additional Information

R Code with Style

Style Guides

xkcd.com webcomic Picture Source: https://xkcd.com

Style Packages
Code Folding

RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.

To insert a new code section you can use the Code > Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section.

## Setup ----

## clean/reset environment
rm(list=ls())

## R and Bioconductor libraries
library(ggplot2)

## Data Import ----

otufile <- "ZOTU_c99_Count_Sintax.txt"
mapfile <- "MapFile.txt"

## Import into Phyloseq
d.ZOTU <- import_qiime(otufilename = otufile, mapfilename = mapfile)
d.ZOTU