R - UniBas - Evolutionary Genetics

Learning Objectives

◇ Learn the fundamental syntax of R, including working with variables, assignments, and basic operators.
◇ Learn how to install, load, and use R packages from different sources to extend R's functionality.
◇ Adopt coding best practices, including consistent code style, effective commenting, and clear documentation.

Knowing R is crucial for bioinformaticians, not only for its statistical capabilities but also for data visualization, graphics, and general computing. R is widely used in bioinformatics for analyzing complex datasets, and its visualization tools, like ggplot2, produce high-quality plots essential for research communication.

Beyond statistics, R is versatile for data manipulation, workflow automation, and integrating with other tools. Learning R properly, with a focus on best practices and style guides, ensures code is efficient, readable, and reusable. This gives students a strong foundation, making them more effective and versatile in research and data-driven fields.

Repositories for R Packages

Repositories in R serve as centralised locations where users can find, install and share R packages. CRAN (Comprehensive R Archive Network) is the primary repository, hosting thousands of packages that meet strict quality standards. It is the most trusted source for general purpose R packages. Bioconductor specialises in bioinformatics and computational biology packages, providing tools for genomic data analysis. GitHub is a more flexible platform where developers share cutting-edge or experimental packages, allowing users to install packages directly from repositories under development, often before they are officially released on CRAN.

CRAN - official R repository
Bioconductor - topic specific repository
Github - most popular repository for open source projects but not R specific

Package Installation

The commands for the installation of R packages depends on the repository.

### CRAN repository

install.packages("package")
install.packages(c("packageA", "packageB"))

### Bioconductor

## R version 3.6+
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("package")

## Older R versions
source("https://bioconductor.org/biocLite.R")
biocLite("package")

### GitHub
install.packages("devtools")
devtools::install_github("<author>/<package>")

pak - A Fresh Approach to R Package Installation

pak installs R packages from CRAN, Bioconductor, GitHub, URLs, git repositories, local files and directories. It is an alternative to install.packages() and devtools::install_github(). pak is fast, safe and convenient.

Package Info

It is always a good idea to look at the basic information about a package maybe before but certainly after you have installed it.

packageDescription("package")
help(package = "package")

Manage Packages

# List all installed packages
installed.packages()
# Get Package version
packageVersion("fun")
# Update a package
update.packages("fun")
# Load a package
library("fun")
# Un-load a package
detach("package:fun", unload=TRUE)
# Remove a package
remove.packages("fun")

Example(s)

Install dplyr by installing tidyverse (collection of data science tools):

install.packages("tidyverse")
search()

Alternatively, install just dplyr:

install.packages("dplyr")

Or the development version from GitHub:

install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

Load multiple CRAN packages

# Package list
package.list = c("ggplot2", "RColorBrewer", "ggpubr")

package.manager <- lapply(
  package.list,
  FUN <- function(x) {
    # Load multiple packages and
    # install missing packages
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)

Avoiding Conflicts

Using specific functions from an R package without loading the entire package can be advantageous for several reasons:

Minimize Namespace Conflicts - When you load an entire package with library(package), all its functions become available in your workspace, which can lead to conflicts if there are functions with the same name in other packages. Using functions with package::function() syntax avoids these conflicts by specifying exactly which package the function comes from.
Improve Code Readability - Explicitly specifying the package can make your code clearer and more understandable, as it indicates the source of each function used.
Reduce Memory Usage - Loading only the necessary functions without attaching the entire package can save memory and reduce the overhead associated with loading all of the package's functions and data.

Suppose you want to use the filter() function from the dplyr package but don’t need the other functions provided by dplyr. Instead of loading the entire package, you can use:

# Use the filter function from dplyr without loading the entire package
library(dplyr)  # Load the package if needed, or use directly

# Example data frame
data <- data.frame(id = 1:5, value = c(10, 20, 30, 40, 50))

# Filtering rows where value is greater than 20
filtered_data <- dplyr::filter(data, value > 20)
print(filtered_data)

In this example, dplyr::filter() specifies that the filter function from the `dplyr package should be used. This method avoids possible naming conflicts with other filter functions from different packages and makes it clear where the function comes from.

R Code with Style

First and Last Lines of any R-Script

To ensure your R code is clean and efficient, it’s important to adopt good habits from the start. Here are two key practices for writing scripts:

Start with a clean workspace - Before starting a new project, it's important to clean up your environment of any leftover variables or objects from previous sessions, as these can unintentionally affect your current work. In RStudio, the 'broom' icon does this visually, but in R scripts you should use the following command:

## clean/reset environment
rm(list = ls())

Log Your Session Information - Since different users might work with various R versions, packages, or settings, sharing a session log at the end of your script helps avoid conflicts and aids troubleshooting. This log captures important details about your environment, including R version, platform, and loaded packages. Add one of these to the end of your script:

## Version 1: Session-Log
writeLines(capture.output(sessionInfo()), "SessionInfo.txt")

## Version 2: Session-Log
sink("SessionInfo.txt")
  sessionInfo()
sink()

Style Guides

In programming, a style guide is a set of conventions for writing clean, consistent, and readable code. In R, using a style guide ensures that your code is easy to understand, both for yourself and for others. It promotes consistency in naming conventions, formatting, indentation, and commenting, which reduces errors and makes collaboration easier.

Style guides are important because they help make code more maintainable and scalable. By following one, you ensure that your work adheres to best practices, which improves the readability, efficiency, and reusability of your code. Early adoption of a style guide will help you develop habits that will help you write professional quality code.

Here are the top 5 Style Guide recommendations for R, each accompanied by an example to help illustrate what each means:

(1) Use Descriptive and Consistent Variable Names - Choose meaningful names that convey the purpose of the variable, and be consistent in your naming conventions (e.g., use snake_case or camelCase, but not both).

Example:

total_counts <- 100  # clear and descriptive
tc <- 100            # unclear and vague

(2) Keep Lines of Code Short - Limit the length of your code lines to around 80 characters to improve readability.

Example:

# Bad: Too long and hard to read
summary_stats <- data %>% group_by(group_var) %>% summarise(mean = mean(value), sd = sd(value))

# Good: Broken into multiple lines
summary_stats <- data %>%
  group_by(group_var) %>%
  summarise(mean = mean(value), sd = sd(value))

(3) Use Spaces Around Operators - Always add spaces around operators like =, +, -, and <- to make the code more readable.

Example:

x <- 5 * 3 + 2   # Good: Readable
x<-5*3+2         # Bad: Harder to read

(4) Write Clear and Concise Comments - Add comments to explain complex or non-obvious parts of your code. Keep them concise and relevant.

# Calculate the mean and standard deviation for each group
summary_stats <- data %>%
  group_by(group_var) %>%
  summarise(mean = mean(value), sd = sd(value))

(5) Use Proper Indentation - Use consistent indentation (typically 2 or 4 spaces) to make the structure of your code clear, especially within functions and control structures.

if (x > 10) {
  print("x is greater than 10")
} else {
  print("x is 10 or less")
}

Following these 5 basic recommendations will already help you write cleaner, more professional R code that's easier to maintain and share with others.

Style Guides

Style Packages

Several R packages can help you maintain clean and consistent code style:

These tools ensure your code is tidy, readable, and compliant with standard conventions, enhancing both collaboration and maintainability.

Code Folding

Code folding is a feature in many code editors that allows you to collapse or expand sections of code, such as functions or loops, to make large files more manageable. It helps improve readability by hiding unnecessary details and focusing on the relevant parts of the code, making it easier to navigate and edit complex scripts.

RStudio supports both automatic and user-defined code folding. This feature allows you to easily show or hide blocks of code, helping you navigate your source file more efficiently and focus on your coding tasks.

To create a new code section, use the Code > Insert Section command. Alternatively, a comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) will automatically create a code section.

Example:

## Setup ----

## clean/reset environment
rm(list=ls())

## R and Bioconductor libraries
library(ggplot2)

## Data Import ----

otufile <- "count_table_zotu.txt
mapfile <- "map_file.txt"

## Import into Phyloseq
d_zotu <- import_qiime(otufilename = otufile, mapfilename = mapfile)
d_zotu

Functions

Writing and using R functions enables modular, reusable and organised code. Functions encapsulate specific tasks or calculations, making your code more efficient and easier to maintain. They reduce redundancy by allowing you to define a process once and use it many times, and they improve readability by breaking complex problems into manageable parts.

function_name <- function(<arguments>) {
  ## Do something
}

Here are two R function examples to covert temperatur units.

c2f <- function(celsius) {
  return(9 / 5 * celsius + 32)
}

f2c <- function(fahrenheit) {
  return((fahrenheit - 32) * 5 / 9)
}

We can also combine the two temperature converter functions into one function with multiple options. We also include an error message if no argument (value) is provided (x = NULL) and we define a default conversion (option = "f2c")

hot <- function(x = NULL, option = "f2c") {
  # Check if the input value is missing
  if (is.null(x)) {
    cat("Missing value!\n")  # Inform the user that no input value was provided
  } else {
    # Convert temperature based on the specified option
    result <- switch(option,
                     f2c = (x - 32) * 5 / 9,  # Fahrenheit to Celsius
                     c2f = 9 / 5 * x + 32,     # Celsius to Fahrenheit
                     stop("Invalid option. Use 'f2c' or 'c2f'."))  # Handle invalid options
    return(result)  # Return the computed result
  }
}

You can create a collection of functions tailored to specific projects by storing them in simple text files. These files can be imported directly into your R environment, making it easy to use the functions. For example:

source("https://www.gdc-docs.ethz.ch/UniBS/EvolutionaryGenetics/BioInf/script/MyFunctions.R")

This command downloads and executes the R script from the provided URL, making all the functions in that script available for use in your current R session.

Some of the sourced and now available functions:

# Basic Stats
calc.stats(x = 1:10, digit = 2)
# Distance Converter (Kilometer > Miles)
K2M(1)
# Calculate BMI
BMI(w = 120, h = 1.8)

Writing / Reading Data

Managing data in R involves saving and loading objects into and out of files. This functionality is essential for preserving your work and reusing data across sessions. R provides several methods for working with data:

Single objects - Use saveRDS() and readRDS() to save and restore individual R objects. This approach is useful for handling single data sets or results.
Multiple objects - The save() and load() functions allow you to work with multiple objects simultaneously and store them in a single RData file.
Entire workspace - To save or restore the entire environment, including all objects, use save.image() and load(). This ensures that all your work is preserved and can be easily reloaded in future sessions or shared.

Examples:

Single Objects

# Save a Single R Object
saveRDS(object, file = "myobject.rds")

# Restore the Object
readRDS(file = "myobject.rds")

# Rename the Import
new.name <- readRDS(file = "myobject.rds")

Multiple Objects

# Save Multiple R Objects
save(object1, object1, file = "objects.RData")

# Restore Multiple R Objects
load("objects.RData")

Entire Workspace

# Save the Entire Workspace
save.image(file = "myworkspace.RData")

# Restore the Workspace
load("myworkspace.RData")

Coding Mistakes

Coding mistakes such as overwriting important variables and using poor coding practices can lead to significant issues. Here are three common coding mistakes in R and ways to prevent them:

(1) Overwriting Base Functions or Common Variable Names

Mistake: Users sometimes accidentally overwrite base R functions (e.g., c, mean) or use common variable names that conflict with these functions.
Prevention: Avoid using names for your variables or functions that conflict with base R functions. Use descriptive and unique names for your variables, and consider using a naming convention (e.g., my_data_frame instead of df).

(2) Not Managing Object Scope and Conflicts

Mistake: Variables and functions defined in one part of the code can unintentionally affect other parts of the script, leading to confusion and bugs.
Prevention: Use local environments for functions to limit the scope of variables. Be mindful of the workspace and use functions or scripts to compartmentalize code. Additionally, consider using the rm() function to clean up unused variables.

(3) Failing to Handle Errors and Warnings

Mistake: Users often ignore errors and warnings produced during code execution, which can lead to incorrect results or unexpected behavior.
Prevention: Always check for errors and warnings after running your code. Use tryCatch() to handle errors gracefully and include meaningful error messages. Regularly test your code with different inputs to ensure robustness.

Repositories for R Packages

Package Installation

pak - A Fresh Approach to R Package Installation

Package Info

Manage Packages

Example(s)

Avoiding Conflicts

R Code with Style

First and Last Lines of any R-Script

Style Guides

Style Guides

Style Packages

Code Folding

Functions

Writing / Reading Data

Single Objects

Multiple Objects

Entire Workspace

Coding Mistakes

Cheat-Sheets (pdf)

Links