All contents are licensed under CC BY-NC-ND 4.0.

1 Objectives of control structures.

‘Automation’ of the repetition of structurally identical commands.

  • Repetition of a command – with objects remaining the same, or changing – with a predetermined or flexible number of repetitions.
  • Conditional execution of various tasks.
  • Generalization of tasks by defining functions.

2 Logical comparisons.

Command TRUE if:
== Equality
!= I nequality
>, >= Left side greater than (or equal to) the right side
<, <= Left side less than (or equal to) the right side
%in% Is left side in vector on right side?
  • all() returns TRUE if all elements of the vector are TRUE.
  • any() returns TRUE if at least one element of the vector is TRUE.
  • is.na() and is.null() return TRUE if the respective object (e.g. element of a vector) is NA or NULL.
  • A logical value can be negated with a preceding ! (e.g. !TRUE is FALSE)
  • which() returns the index set (as an integer vector) if the logical comparison resulted in TRUE.

3 Conditional execution

3.2 for-loops

for loops often offer a simple and pragmatic way to complete steps in data management / preparation.

Usage:

  • New object index runs all elements in vector.
  • index remains constant during ... index ...
  • index jumps to the next (if available) value of vector after running through ... index ....
  • index takes each value of vector once.
  • The number of iterations of ... index ... is determined by the length of vector.

3.3 Example of a for loop with if

The goal of this example is to get to know which day in May is the one at which a young Douglas fir was most often in the first development stage.

As a preparation, we nee to set up a data-frame as the object that will carry the result:

3.3.1 Illustrating the loop index

The foor loop will run thorugh our resulting data-frame res, line by line. We can try and illustrate this with the following graph, where the x-axis carries the values of the loop-index, and the y-axis the value of the days_since_may1st variable that will be taken in each of the loop’s ... index ... circles.

3.4 while-loops.

while loops are used less often in data management / preparation, but are more likely to be found in computationally intensive applications (e.g. for optimization).

Usage:

  • The commands that ’... stands for, and the following line, are repeated as long as the condition is TRUE (i.e. here as long as k\(<\)K).
  • flexible number of repetitions.
  • stops immediately after the condition – index < K in the above usage example – is no longer met, ie. is FALSE for the first time.

The following two examples are two applications of a while-loop that came into my mind. They might be a bit too distracting from the goals of ‘Introduction to R’, so feel completely free to skip them …

3.4.1 Example 1

This example does Bayesian inference for a simple one parameter model – estimation of an unknown quantity which is a proportion between \(0\) and \(1\) – by filtering the prior proposals that lead to the simulated data that are equal to the data sample – the likelihood works as some sort of sieve here.

## frost$n_frost > 0.5
## FALSE  TRUE 
##    63    12
## [1] 1000
## [1] 105721
## [1] 0.009458859
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06021 0.13146 0.16123 0.16335 0.19122 0.31806
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.08147 0.49555 0.49895 0.91785 1.00000

3.4.2 Example 2

This example implements a very primitive component-wise ‘L[2]-loss descent’ boosting (comparable to what add-on package mboost implements for a normally distributed response).

## component
##   1   2   3 
## 155  81  10

3.5 apply-commands

An apply-command applies the same function to each of the elements of a data object. This is usually done for taking the sum or calculating the arithmetic mean, or quantiles, of the columns or rows of a matrix. There are different - but actually very similar – versions of appyly.

Usage:

  • apply applies function (specified by FUN) to each element of the respective dimension (defined with argument MARGIN) of X.
  • MARGIN equals 1 for line-by-line, and 2 for column-wise execution.
  • ... for further arguments to FUNCTION (same for every element ofX!).
  • For lists X, MARGIN cannot be selected because lists only have one dimension.

3.6 plyr::ddply: ’split–apply–combine

‘split–apply–combine’ refers to a sequence of actions that is often used in the analysis of data:

  • split: Split the data set according to the characteristics of one or a combination of several categorical variables,
  • apply: Apply statistical methods (or functions like mean(), length(), …) to each of these partial data sets,
  • combine: Manage all results in a common result object.

‘split–apply–combine’ with the function ddply from the package plyr (Wickham 2011):

  • takes a dataframe (one of the ds in the functions name)
  • returns a dataframe (the second d in the functions name)

Alternative: base R aggregate.

Usage:

##       d_cut  n h_min h_q25   h_mean h_q75 h_max
## 1 [1.5,3.1] 86   1.9 2.900 3.389535 3.800   6.8
## 2   (3.1,4] 91   2.1 3.600 4.410989 5.050   7.5
## 3   (4,4.8] 86   3.2 4.200 4.995349 5.800   7.9
## 4 (4.8,5.4] 79   3.1 4.550 5.401266 6.050  10.1
## 5 (5.4,6.2] 93   3.4 5.200 6.311828 7.100  15.5
## 6 (6.2,6.9] 74   3.8 5.125 6.532432 7.775   9.8

3.7 Pragmatic Programming.

The primary aim of your R Code is that it does what you need it to do – without errors!

Faulty conclusions in your data analysis as a consequence of data handling errors are one of the worst things that can happen to you as a researcher.

Copy-paste sequences such as:

are one of the main error source for R users / ‘beginners’ that don’t rely on ‘programming techniques’.

Loops are somehow ill-reputed, but whatever way of programming you find that get’s you towards errorless handling of your data, is perfect!

Therefore:

  • Use loops as often as possible (‘upwards’: whereever you can replace long copy-paste chains with an errorless loop), but avoid loops as often as necessary (‘downwards’), because – very roughly said – loops read and write to the main memory in each iteration \(\rightarrow\) Vectorized programming reads and writes only once: many functions take vectors as arguments and are therefore (often) faster.
  • Use an apply command if you want the function to do the same on every element.
  • But: Loops are simple and pragmatic and whoever masters them is already a king: It is better if R-Code gets something done slowly, but correct, than quickly, but wrong!
  • Loops cannot be avoided in an iterative processes – but this is something you will rarely need!

And for making R-base graphics – in especially in sampling based Bayesian statistical modeling – loops are completely ok and very often a very convenient way to get you towards your graphics.

(ggplot might get you to an analogue graphic with avoiding loops!)

4 Define your own functions.

Why should I be able to define my own functions?

  • Functions generalize command sequences and make it easier and easier to try something out under many different argument values / dates / ….
  • Functions keep the workspace clean (see next section on environments).
  • Functions facilitate the reproducibility of analyzes.
  • Functions make it easier for other users to access your work.
  • As can be seen from the apply() examples, it is very often necessary to be able to write your own little helper functions. Also for your own orientation: Always comment on the processes and steps in your code and in your functions to make it easier to understand the motivation and ideas behind it later.
  • The general rules for naming objects also apply to function arguments.
  • Arguments can have preset values (here arg3 andarg4)
  • The last argument ... (optional) is a special argument and can be used to pass unspecified arguments to function calls.
  • Arguments changed by content and objects created are in their own local environment.
  • The result is returned to the global environment with return(result).

4.1 Naming conventions for arguments.

Argument name Inhalt
data Dataframe
x, y, z Vectors (most often with numerical elements)
n Sample size
formula Formula object
  • Use function and argument names that are based on existing R functions.
  • Make arguments as self-explanatory as possible by name.

4.2 content and result.

The content block:

  • Should make it possible to carry out many similar – but different – calculations and therefore define as few objects as possible to ‘fixed values’: alternatively, always try to define arguments with default values.
  • Falls back on the higher-level environment (or environments, if necessary) if it cannot find an object in the local environment (this is known as scoping).

The result object:

  • Can be of any possible R object class (vector, list, data set, function (a function that itself returns a function is called closure), …).
  • Is generated by calling the function and stored in the global environment.
  • All other objects are no longer ‘visible’ from the global environment.

4.4 Real-world helpers

4.4.1 drop_ghosts

This function drops ghosts, ie. it removes levels of a factor variable for which the absolute frequency in the data is \(0\).

Validation:

##  [1] Beech  Beech  Beech  Beech  Beech  Spruce Spruce Spruce Spruce Spruce
## Levels: Beech Spruce
## [1] Beech Beech Beech Beech Beech
## Levels: Beech Spruce
## [1] Beech Beech Beech Beech Beech
## Levels: Beech
## Error in drop_ghosts(x = sub$species, lev = c("Beech", "Oak")): Please provide levels of correct length!
## [1] Beech Oak  
## Levels: Beech Oak Spruce
## [1] Beech Oak  
## Levels: Beech Oak
## [1] Beech Oak  
## Levels: Oak Beech

4.4.2 overlap_seq

This function generates an overlapping sequence, ie. it takes a numeric variable x and a numeric step-length delta and calculates a sequence at multiples of delta that has a minimum below or just at the minumum of x, and a maximum beyond or just at the maximum of x.

Validation:

## [1]   0 100 200 300
## [1]   0 100 200 300
## [1]   0 100 200 300
## [1] -100    0  100  200  300  400

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.


  1. Private webpage: uncertaintree.github.io