Best Practices
Why is it important to avoid special characters, spaces, and umlauts when labeling files or folders?
It's good practice to use alphanumeric characters, underscores and hyphens in file and folder names. This improves compatibility and readability, and reduces the risk of problems when working across different systems and platforms.
Many operating systems and software handle special characters and diacritics inconsistently. For example, in command line environments such as Windows, MacOS or Linux, special characters can be interpreted as commands, causing errors. They can also cause problems during file transfers or when dealing with character encoding inconsistencies, potentially resulting in data corruption or loss.
Why is it important to use file extensions?
Using file extensions is important because they indicate the file type, allowing the operating system and applications to recognize and open files correctly. This helps prevent errors and ensures compatibility with the appropriate software.
Using specific file extensions, like .tmp
, helps organize and manage projects by clearly identifying temporary files. This makes it easier to clean up unwanted files with commands like rm *.tmp
or find . -type f -name "*.tmp" -delete
, ensuring you only delete what you intend to remove and keeping your workspace tidy.
Safety First
When you see a sign for Free Wi-Fi, what's the first thing you should do?
Turning on a VPN (Virtual Private Network) when using free public Wi-Fi is a smart practice for several reasons:
- Security: A VPN encrypts your internet connection, making it much harder for hackers to intercept your data.
- Privacy: It hides your IP address and encrypts your traffic, ensuring greater privacy and anonymity.
- Protection from Eavesdropping: Even if someone intercepts your data, a VPN's encryption makes it unreadable.
- Bypassing Restrictions: VPNs allow you to circumvent network restrictions by masking your location.
- Preventing Location Tracking: VPNs obscure your actual location, adding another layer of privacy.
- Data Protection: They also secure data-intensive activities like email and messaging, protecting sensitive information.
- Peace of Mind: Using a VPN reduces the risk of data breaches and cyber-attacks by adding a layer of security.
Note: While VPNs enhance security, they're not foolproof. Be sure to also use strong passwords, keep software updated, and avoid unsafe websites.
Why is it important to work with a copy rather than the original files, and why is it recommended to archive the original files in a safe place?
Working with a copy rather than the original files is important because it protects the integrity of the original data. This practice prevents accidental modification or deletion during analysis or experimentation, and ensures that the original data remains intact for reference or future use. Archiving original files in a secure location is recommended to provide a safe backup that can be restored in the event of data loss, corruption or processing errors. This ensures that you can always return to the original state of your data if required.
Bioinformatic Applications
What are the key things to consider when using scientific software?
Firstly, it's important to recognize that open-source software doesn't mean you own it, and you should ensure the software is suitable for your specific needs.
Key considerations include:
- Import and Export Formats: Ensure the software can handle the file formats you need.
- Know (and test) Limitations: Be aware of the software's capabilities and limitations.
- Test with a Dummy/Subset Dataset: Always test the software with a small or dummy dataset first.
- Understand the Parameters: Make sure you know what the various parameters and settings mean.
- Don't Ignore Warnings: Pay attention to any warnings and understand their implications.
- Seek Help: If you're unsure, consult the manual or ask questions.
Why do servers often have different versions of the same application instead of just the latest version?
There are several important reasons why multiple versions of an application might be kept on a server rather than only maintaining the latest version:
- Rollback: Older versions allow users to revert in case bugs or issues are found in the latest release.
- Compatibility: Some workflows may rely on older versions and may not work properly with the latest update.
- Testing: Keeping older versions allows users to compare new updates with previous versions before fully upgrading.
- Reproducibility: To replicate past experiments or conditions, it's essential to have access to the exact software version used at that time.
Writing Code / Scripts
What is the purpose of comments in code?
Comments in an R script serve several important purposes:
- Documentation: They explain the code's purpose and functionality, making it easier to understand, especially when revisiting the code or sharing it with others.
- Explanation: Comments clarify complex or non-obvious parts of the code, such as algorithms or data manipulations.
- Debugging: They can mark areas of the code that may need review or attention during the debugging process.
- Collaboration: Comments help convey intentions, instructions, and considerations to collaborators.
- Readability: They help organize and enhance the clarity of the code.
What is the purpose of a style guide?
Using a style guide for coding offers several advantages:
- Consistency: It ensures uniformity in coding style and formatting across a codebase, making it easier to read and understand, regardless of the author.
- Readability: Style guides establish rules for naming variables, functions, and other elements, improving code legibility and helping identify issues more easily.
- Maintainability: Consistent code structure simplifies updates, bug fixing, and feature additions, making long-term maintenance easier.
- Collaboration: Following a style guide facilitates sharing code with others, making it easier for collaborators to understand and contribute to the project.
Terminal
What is the difference between the two terminal commands?
The cp
(copy) command creates a copy of file.txt
in the specified destination (folder/file.txt
), leaving the original file unchanged in its original location. After running the command, you’ll have two identical files: one in the original location and one in the folder.
The mv
(move) command, on the other hand, moves file.txt
to the specified destination (folder/file.txt
), removing it from its original location. After running the command, the file only exists in the folder, not in its original place.
In short, cp
duplicates the file, while mv
transfers it to a new location.
cp file.txt folder/file.txt
mv file.txt folder/file.txt
What is the difference between the two terminal commands?
The first command, cat file.txt > folder/file.txt
, redirects the contents of file.txt
to folder/file.txt
, overwriting any existing contents in folder/file.txt
. If the target file doesn't exist, it will be created.
The second command, cat file.txt >> folder/file.txt
, appends the contents of file.txt
to the end of folder/file.txt
without overwriting the existing contents. If folder/file.txt
doesn't exist, this command will also create it, but will not remove any existing data if it does exist.
In short, >
overwrites the file, while >>
appends to it.
cat file.txt > folder/file.txt
cat file.txt >> folder/file.txt
R
What is the difference between a vector and a list in R?
In R, a vector is a one-dimensional array that can only contain elements of the same type, such as numeric or character values (e.g., c(1, 2, 3)
). In contrast, a list can hold elements of different types, including numbers, characters, and even other lists (e.g., list(1, "apple", TRUE)
). Vectors are accessed with single square brackets (e.g., vec[1]
), while elements in lists are accessed with double square brackets (e.g., lst[[1]]
). Thus, use vectors for uniform data and lists for mixed or complex data structures.
Vector (homogeneous):
vec <- c(1,2,3) # c for concatenate
str(vec)
# num [1:3] 1 2 3
List (heterogeneous):
lst <- list(1, "apple", TRUE)
str(lst)
# List of 3
# $ : num 1
# $ : chr "apple"
# $ : logi TRUE
Is the folloging object x
a vetor or a list?
lst <- c(1, "A")
: This creates a vector called lst
. In R, a vector can only contain elements of the same type. Since this vector includes both a numeric value (1) and a character value ("A"), R will coerce the numeric value into a character to maintain uniformity. Therefore, lst
will be a character vector containing c("1", "A")
.
What is the difference between these assignments?
12 = a
: Error in 12 = a : invalid (do_set) left-hand side to assignment12 <- a
: Error in 12 <- a : invalid (do_set) left-hand side to assignment12 -> a
: Assign value 12 to variable a.12 == a
: Operator: 12 is equal to value(s) ofa
> TRUE / FALSE
12 = a
12 <- a
12 -> a
12 == a
What is the meaning and purpose of this line of R code?
The R code is used to remove (delete) all objects from the current R workspace. The main purpose of this line of code is to clean up the R workspace by removing all objects, giving your R session a fresh start. This can be useful in situations where you want to clear memory and remove any potential conflicts or clutter from previous operations. It's often used at the beginning of a script or analysis to ensure that the workspace is in a clean state before (re-) running (new) code.
rm(list = ls())
Why is it not recommended to attach all the packages at the beginning of larger R projects?
While it’s important to load packages at the start of your R script, attaching all packages at the beginning of larger projects may not be ideal for several reasons:
- Conflicts: When a package is attached (using
library()
), its functions and objects are loaded into the global namespace. This can lead to namespace conflicts if multiple packages contain functions or objects with the same name, causing unexpected behavior and difficulties in tracing function origins. The risk of conflicts increases in larger projects that use many packages. - Performance: Loading all packages at once can slow down your program, as R must load and initialize every package, even if only a few are needed for specific tasks.
What is the difference between the following two lines of R code?
The difference between the two lines of R code lies in the use of the set.seed()
function in the second line:
-
Line 1:
sample(letters, 3)
generates a random sample of three letters from the alphabet. Each time you run this line, you may get different results because the random number generator's state is not fixed. -
Line 2:
set.seed(123); sample(letters, 3)
first sets a seed for the random number generator usingset.seed(123)
. This ensures that the random sample generated bysample(letters, 3)
will be reproducible. Every time you run this line, you will get the same three letters, which is crucial for reproducibility in statistical analysis.
In summary, set.seed()
allows reproducibility in random processes, which is essential for debugging and testing statistical methods. It guarantees that the same random results can be obtained each time the code is run, making it easier for others to replicate results or for the original author to revisit their analysis.
1: sample(letters, 3)
2: set.seed(123); sample(letters, 3)
What is the result of the following three calls (A,B and C) to the function fx1, defined below?
A: 4 (x=1, y=3)
B: 4 5 (x=c(1,2); y=3)
C: 7 (x=3; y=4)
fx1 <- function(x=2, y=3) { x + y }
A: fx1(1)
B: fx1(1:2)
C: fx1(6/2,2*2)
What is the result of the following five calls (A-E) to the function fx2, defined below?
A: 7 (x=1, y=2, z=3 -> 1+(2*3)=7)
B: numeric(0) (x=2 y=3 z=NULL)
C: numeric(0) (x=1,2,3 y=3 z=NULL)
D: Error in fx(1, 2, 3, 4) : unused argument (4)
E: 2 6 12 (x=y=z=c(1,2,3) -> z + y * z => c(1,2,3) + c(1,4,9) = c(2,6,12))
fx2 <- function(x=2, y=3, z=NULL) { x + y * z}
A: fx2(1,2,3)
B: fx2()
C: fx2(1:3)
D: fx2(1,2,3,4)
E: fx2(1:3,1:3,1:3)
What is the correct result to the function lb2kg
, defined below?
lb2kg(2.2046)
: solution C, no output.
In R, if a function does not explicitly return a value, the last evaluated expression will be returned implicitly. However, if you call fx2()
without specifying z
, it computes 2 + 3 * NULL
, which ultimately evaluates to 2 + 0 = 2
, but you may not see any output in your console unless you explicitly print the result.
You have to modify the function to get an output.
lb2kg <- function(lb) {
# Convert pounds to kg
lb / 2.2046
}
lb2kg <- function(lb) {
# Convert pounds to kg
kg = (lb / 2.2046)
return(kg)
}
lb2kg <- function(lb) {
# Convert pounds to kg
kg <- (lb / 2.2046)
}
A: lb2kg(2.2046): 1
B: lb2kg(2.2046): error
C: lb2kg(2.2046):
D: lb2kg(2.2046): 2.2046
E: lb2kg(2.2046): kg
What is the corect output if this hot
function?
hot(32): 0
: solution C is correct.
hot <- function(x, type = "f2c") {
switch(type,
f2c = ((x-32)*5/9),
c2f = (9/5*x+32),
default = "Error")
}
A: hot(32): Not working because type is missing.
B: hot(32): 32
C: hot(32): 0
D: hot(32): 89.6
E: hot(32): Error
Reproducible Science
Why is reproducibility considered a cornerstone of scientific research, and what are the key practices that promote reproducible science? List three.
Reproducibility is fundamental to scientific research because it ensures that findings can be independently verified and validated, thereby increasing the credibility and reliability of scientific knowledge. When research is reproducible, other scientists can replicate the experiments or analyses, confirming and building on the results. This process is crucial for the advancement of knowledge and for fostering confidence in scientific findings. Key practices that promote reproducible science include:
-
Researchers should provide detailed descriptions of their methods, data collection processes, and analysis techniques. This includes sharing data sources, statistical methods, and any software used.
-
Making code and datasets publicly available allows others to verify results and perform their analyses, which promotes collaboration and transparency in research.
-
Using tools such as RMarkdown or Jupyter Notebooks, which combine code, output, and narrative text, can help present research in a clear and reproducible format.
-
Using clear naming conventions and consistent labeling for data sets and variables reduces confusion and increases the ability to replicate results.
-
The use of version control systems (e.g., Git) helps track changes to scripts, documents, and data, making it easier to manage and share work over time.
By adopting these practices, researchers can significantly improve the reproducibility of their work and contribute to the overall integrity of the scientific process.
What is the relationship between commenting a script and reproducible research?
Commenting your script plays a crucial role in reproducible research by improving the clarity and understanding of the code. Well-placed comments help explain the purpose and logic behind certain sections, making it easier for others (and yourself at a later date) to follow the workflow and replicate the analysis. They provide context for complex functionality, highlight assumptions, and document important decisions made during the coding process (e.g., default settings). By facilitating better communication and understanding, effective commenting promotes transparency in research and ensures that results can be reproduced accurately and efficiently.