Help & more

Exam Examples¶

Terminal¶

Q1 What is the meaning and what is the difference of > and >>?

wc -l fileA.txt >  fileB.txt  # A
wc -l fileB.txt >> fileB.txt  # B

S1

The command wc with the option -l displays the number of lines in the input file(s)
(A) Redirect output to new file. If file exists overwrite it.
(B) Redirect output and add to file if it exists. If it does not exist, create the file.

Q2 What is difference between the to following command lines?

## A - Pipe
zcat sequence.fa.gz | grep ">" -c 
## B - Semicolon
zcat sequence.fa.gz ; grep ">" -c sequence.fa

S2

(A) The output of the first part zcat will be the imput of the second part grep. No intermediate file will be created.
(B) This are two independent commands on a single line.

Q3 What is the output of the following two command lines?

## A
echo HOME
## B
echo $HOME

S3

(A) Will show the word HOME.
(B) The dollar sign ($) is used to call a variable. $HOME is a build-in variable and the command will show your home directory.

Q4 The following two command lines are problematic. Why?

cat file.txt >  file.txt
cat file.txt >> file.txt

S4

(A) Input and output are identical. The command will execute but the file will be empty afterwards.
(B) Do not try it! The content of the file will be added to the content of the same file, will be added to the same file, ... You will create a never-ending loop and if you do not stop it (ctrl+c) the file will grow and your free disk space vanish.

Q5 What is the reason for running commands (scripts) in verbose mode?

curl --verbose -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/RDP_16S_Archaea_Subset.fasta

S5

It provides additional details as to what the application is doing. This level of detail can be very helpful for troubleshooting problems. Errors and warnings will be shown to help with the debugging.

Q6 Do you have an idea why the output of the two commands is different?

grep ">"  -c sequence.fa
# Output: 10 

grep "^>" -c sequence.fa
# Output: 9

S6

(A) All line that contain a > signs will be counted.
(B) Only lines that start (^) with a > sign will be counted. Something is wrong with this fasta file. It is possible that one the fasta headers of a record is missing a > in the beginning or a sequence contains a >.

Q7 What is the difference between the two commands?

# Remove version A
rm file1.tmp file2.tmp file3.tmp
# Remove version B
rm *.tmp

S7

(A) The command will remove the three listed files if present.
(B) The command will delete all files with the suffix tmp. You might first check ls *.tmp if you really want to remove all tmp files.

Reproducible Science¶

Q1 What is the difference between repeatability and reproducibility?

S1

Repeatability is a measure of the likelihood that, having produced one result from an experiment, you can try the same experiment, with the same setup, and produce that same result. It is a way for researchers to verify that their own results are true and are not just chance artefacts.

The reproducibility of data is a measure of whether a different research team can attain results published in a paper using the same methods. This shows that the results are not artefacts of the unique setup in one research lab. It is easy to see why reproducibility is desirable, as it reinforces findings and protects against rare cases of fraud, or less rare cases of human error, in the production of significant results.

Q2 What can you do to improve reproducibility?

S2

Be organised!
Avoid click-application with a GUI if possible.
Write detailed descriptions of your workflow(s) including version and parameters of the apps used.
Provide *polished* scripts.
Comment your scripts and follow a code style.

RegEX¶

Q1

What find and replace can you use to convert A to B?

A: A123;B2232;C4532

B: ABC-123,2232,4532

S1

Find: (\w)(\d+);(\w)(\d+);(\w)(\d+)
Replace: $1$3$5-$2,$4,$6

Q2 What regex can you use the find files with the suffix

fa
fasta
fq
fastq

but avoid files with the suffix

afa
pfa
txt
vcf
fastqc

S2

Find: *.f*[aq]

Q3 Use find and replace to create a sequence fasta file.

Primer_COI-F 22nt acgcttgcacgtctgcgacgtc

S3

Find: (\w+)-\w \w+ (\w+)
Replace: >$1\n$2

R¶

Q1 What is the outcome of sum(FunX(2,4) == FunX(4,2)) using the following function:

FunX <- function(x, y) {
  if (x > y) {
    RT <- (y / x) * 100
  } else {
    RT <- (x / y) * 100
  }
}

S1

Both function have the same output: FunX(2,4) = 50 FunX(4,2) = 50 Therefore the operator is `TRUE` FunX(2,4) == FunX(4,2) [1] TRUE There is only one TRUE and TRUE => 1 the result is [1] 1

Q2 What is the outcome of the following R commands using the function further below:

A: FunXYZ(1:3)
B: FunXYZ(c(1,2,3))
C: x <- 1; y <- 0; z <- 1; FunXYZ(x,y,z)

FunXYZ <- function(x, y, z) {
  if (x > 0 & y > 0 & z > 0 ) {
    xyz <- c(x, y, z)
  } else if (x == 0 | y == 0 | z == 0) {
    xyz <- "We have Zeros"
  } else {
    xyz <- "We have Sub-Zeros"
  }
  print(xyz)
}

S2

Q2-A:Error in FunXYZ(1:3) : argument "y" is missing, with no default
Q2-B:Error in FunXYZ(1:3) : argument "y" is missing, with no default
Q2-C:"We have Zeros"

BLAST¶

Q1 You blasted two query sequence (i.e. A and B) and got the following output table:

#	Query	Subject	E-value
1	A	S1	1.00E-16
2	B	S1	5.00E-16

Which hit is better (1) A with S1 or (2) B with S1?

S1

The e-value for (1) is lower but we not have enough information to compare the BLAST results. E-values are not the best indicators for BLAST hits. Other factors (e.g. alignment length) play an important role.

Q2 You blasted (blastn) the same query sequence against a draft genome.

#	Query	Subject	%Identity	E-value	Bit_Score	Q-length	S-length	Alignment-Length
1	A	S1	99	1.00E-06	256	100	5000	100
2	A	S2	95	5.00E-06	244	100	6000	99
3	A	S3	100	2.00E-07	80	100	80	80
4	A	S4	100	1.00E-06	35	100	4000	20

What does it mean? Explain the output table with a special focus on reliable hits.

S2

I would only consider the top two hits reliable with bit-score > 200. S1 is an almost perfect hit with high similarity and 100% coverage (query-length == alignment-length). S2 is also a very good hit with a high bit-score, good similarity and coverage. S3 is very short (80nt) and might not really be a part of the genome but more of an artefact of the assembly. S4 is much bigger but the alignment is short and might indicate a conserved region (adaptor?) than a real hit.

Q3 Can you use a BLAST search to find multiple copies of the same genes (e.g. AQP7 - transports water and sugary compounds into cells)? What would you need and how would you do it?

S3

The better the quality of the genome the better the chance to find duplicated genes / regions. A draft genome with many contigs is more difficult to use and the flanking region of possible hits need to taken into considerations. For well assembled genomes, all hits with high bit-scores should be closer evaluated. Copies of a gene at different location of the genome might evolve independently and become different in sequence composition and even in function over time. Important are similarity and genome location (start and end).

Downloads¶

Bioinformatics for Beginners