Introduction to R: Session 01
May 16, 2024 (Version 0.6)
All contents are licensed under CC BY-NC-ND 4.0.
1 (Int)R(o)
1.1 What is R?
‘A Language for Data Analysis and Graphics’
A first example: Scatter plot, data-management and descriptive analysis
## A scatter-plot ...
library("ggplot2")
library("colorspace")
ggplot(data = drought,
aes(x = elev, y = bair, color = species, shape = species)) +
geom_point(size = 2) +
geom_vline(xintercept = 1000, linetype = "dashed") +
geom_hline(yintercept = 1, linetype = "dashed") +
scale_color_discrete_qualitative(pal = "Dark") +
labs(x = "Elevation [m]", y = "Basal area increment ratio [%]")
## [1] 335 1540
## [1] 1 6
br <- seq(tmp[1] * 250, (tmp[2] + 1) * 250, by = 250)
(drought$elev_cut <- cut(drought$elev, breaks = br))
## [1] (250,500] (250,500] (250,500] (500,750]
## [5] (500,750] (500,750] (500,750] (500,750]
## [9] (500,750] (750,1e+03] (750,1e+03] (750,1e+03]
## [13] (1e+03,1.25e+03] (1e+03,1.25e+03] (1e+03,1.25e+03] (1e+03,1.25e+03]
## [17] (1e+03,1.25e+03] (1e+03,1.25e+03] (1e+03,1.25e+03] (1.25e+03,1.5e+03]
## [21] (1.25e+03,1.5e+03] (1.25e+03,1.5e+03] (1.5e+03,1.75e+03] (250,500]
## [25] (250,500] (500,750] (500,750] (500,750]
## [29] (750,1e+03] (750,1e+03] (1e+03,1.25e+03] (1e+03,1.25e+03]
## [33] (1e+03,1.25e+03] (1e+03,1.25e+03]
## 6 Levels: (250,500] (500,750] (750,1e+03] ... (1.5e+03,1.75e+03]
## ... and then some descriptive statistics:
library("plyr")
ddply(.data = drought, .variables = c("species", "elev_cut"),
summarize, .drop = F,
n = length(bair),
mean_bair = mean(bair))
## species elev_cut n mean_bair
## 1 Beech (250,500] 2 0.6515000
## 2 Beech (500,750] 3 0.6733333
## 3 Beech (750,1e+03] 2 0.8570000
## 4 Beech (1e+03,1.25e+03] 4 0.9827500
## 5 Beech (1.25e+03,1.5e+03] 0 NaN
## 6 Beech (1.5e+03,1.75e+03] 0 NaN
## 7 Spruce (250,500] 3 0.5586667
## 8 Spruce (500,750] 6 0.5205000
## 9 Spruce (750,1e+03] 3 0.6103333
## 10 Spruce (1e+03,1.25e+03] 7 0.7952857
## 11 Spruce (1.25e+03,1.5e+03] 3 1.1050000
## 12 Spruce (1.5e+03,1.75e+03] 1 0.9830000
1.2 Why useR?
Because of packages such as mgcv
!
… and because we can organize our whole working-with-data-process reproducibly (markdown!) in one place!
(Note: the following figure is much too overloaded!)
## library("mgcv")
m <- mgcv::gam(bud_burst_days_since_may1st ~ te(year), data = frost)
nd <- data.frame(year = seq(min(frost$year), max(frost$year), by = .5))
tmp <- mgcv::gam.mh(m, thin = 5, ns = 2000, rw.scale = 2)$bs
coda::effectiveSize(coda::as.mcmc(tmp))
tmp <- predict(m, newdata = nd, type = "lpmatrix") %*% t(tmp)
tmp <- apply(X = tmp, MAR = 1 , FUN = quantile,
probs = seq(.01, .99, by = .01))
tmp <- data.frame(q = as.numeric(tmp),
year = rep(nd$year, each = nrow(tmp)),
p = rep(seq(.01, .99, by = .01), ncol(tmp)))
library("ggrepel")
ggplot(data = tmp, aes(x = year, y = q)) +
geom_line(aes(group = p, color = p)) +
scale_color_continuous_sequential(pal = "Light Grays") +
geom_point(data = frost, aes(x = year, y = bud_burst_days_since_may1st)) +
geom_text_repel(data = frost, aes(x = year, y = bud_burst_days_since_may1st,
label = end_1st_dev_stage - bud_burst),
min.segment.length = 0, box.padding = .5) +
labs(x = "Year", y = "Bud burst [Days since May, 1st]",
title = "At least one frost event during 1st dev. stage",
subtitle = "Labels give time [days] between bud burst and end of 1st development stage",
color = "Posterior probability of conditional expectation:") +
theme_bw() +
theme(legend.position = "bottom")
1.3 Literature
1.3.1 Books…
- Everitt, Hothorn (2006): A Handbook of Statistical Analyses using R. Chapman and Hall.
- Ligges (2008): Programmieren mit R. Springer Verlag.
- Venables, Smith (2002): An introduction to R. Network Theory Verlag.
- Verzani (2005): Using R for Introductory Statistics. Chapman and Hall.
- Wickham (2016): Advanced R.
1.3.2 … and the wide web:
1.4 R and RStudio
R:
- https://r-project.org
- Freely available
- Supported by all major operating system: Windows, Linux/Unix, Mac OS, …
RStudio and others editors for working with R:
- RStudio for Windows, Mac OS and Linux:
- RStudio website
- RStudio cheat sheet
- Previously, I presented a list of alternatives to RStudio, but in 2024, that doesn’t seem suitable anymore!?
1.5 R is a language
‘To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.’ (John M. Chambers)
Interpreted language: R interprets and evaluates your function calls without a compilation step.
Simple syntax with clear similarities to mathematical notation.
1.6 R is an environment for statistical computing
- Rich toolbox for doing (graphical) statistics with an ever growing list of ever improving ‘add-on’ packages.
- R is freely available for anyone.
- R code is the product: it is transparent and allows you to reproduce any single step of your analysis.
1.7 A short history of R
- 1976: Development of programming language S at Bell Laboratories
- 1988: Software S-PLUS released (for purchase)
- 1992: Ross Ihaka and Robert Gentleman start project R
- 1995: R available for free under GPL
- 1998: Comprehensive R archive network (CRAN) is founded
- 2000: First ‘complete’ version R-1.0.0 released
- 2004: First conference on R (useR!) takes place
- 2021: Current version is R-4.4.0 (May 2024)
2 Basic vector types
In R, there is no such thing as a single number, since in R, a single number is just a vector of length 1:
## [1] 1
R works on six basic vector types, of which four are used in our daily work as empirical researchers:
logical
(TRUE
andFALSE
),integer
(numbers we know as …,-1
,0
,1
,2
, …; explicitly created in R with suffixL
),numeric
(extending integers to ‘any’ value such as1/3
), andcharacter
('R'
,'Beech'
, …).
## [1] "logical"
## [1] "integer"
## [1] "numeric"
## [1] "character"
Again, the difference between integer and numeric (real number) is not important for our applied work, so just think of integer
and numeric
as classes that represent numerical values in a computer.
The following section introduces some manipulations we can do with them, logical
and character
will be treated later.
3 Basic mathematical commands
Mathematical operation | Command |
---|---|
Addition | + |
Subtraction | - |
Multiplication | * |
Division | \ |
Exponentiation | ^ |
Rest of integer division (Modulo) | %% |
Integer division | %\% |
Brackets | () |
3.1 Exercises
Run the following lines.
Note:
First and third line define objects a
and b
by using an indexing command on data-object drought
. You don’t need to fully understand what this doeas at this point of the course, just think that objects a
and b
each represent a single number with some applied meaning (Basal area increment ratio at 1st, and 2nd plot, respectively).
a <- drought$bair[1] ## Basal area increment ratio at 1st plot
a ## ... growth is about a half in 2003 in comparison to growth in 2002
b <- drought$bair[2] ## Basal area increment ratio at 2nd plot
b ## ... growth is about two thirds in 2003 in comparison to growth in 2002
a - b ## Whats the difference of bair?
a / b ## Whats the ratio of bair?
b %/% a ## How often is 'b' contained in 'a'?
b %% a ## If 'b' is contained once in 'a', what's remaining?
(a + b)/2 ## Arithmetic mean of 'a' and 'b'
Did we just calculate an arithmetic mean ‘by hand’?
## [1] 0.5765
4 Symbols and values
Objective | Call |
---|---|
Decimal sign | . |
List and seperate objects, arguments, … | , |
Several R-calls in one line (not recommended) | ; |
‘from-to’ operator for integer sequences | : |
Comments and help | # , ? |
Number \(\pi\) | pi |
A concept called infinity | Inf |
Base 10 exponential notation (eg. \(10^{-3}=0.001\)) | 1e-3 |
Integer value | L |
Empty object | NULL |
Missing value (not available) | NA |
Non-defined value (not a number) | NaN |
Comparison | > , >= , == , != , <= , < |
Boolean | TRUE , T , FALSE , F |
Negation (not) | ! |
Conjunction (and) | & |
Disjunction (or) | | |
Remember, everything in R is an object or a function, so we even can’t use ,
without a function call such as:
## [1] 0.505 0.648
4.1 Exercises
Run the following lines and ask yourself What is the applied question behind each line?.
5 Mathematical functions
Mathematical function | Call |
---|---|
Exponential function with basis \(e\) | exp() |
Natural logarithm | log() |
Square root | sqrt() |
Absolute value | abs() |
Trigonometric functions | sin() , cos() , tan() |
Sum and product | sum() , prod() |
Round (up and down) | round() (floor() , ceiling() ) |
Maximum and minimum value | max() , min() |
Factorial \(n! = 1\cdot 2\cdot 3\cdot\ldots\cdot n\) | factorial() |
Binomial coefficient \(n \choose k\) | choose() |
… for further functions, see ?Special
and ?groupGeneric
.
Also note: There are only countable many numbers that a computer is able to represent. Therefore, computers can also only represent some real numbers, and as a consequence, ‘rounding’ is more complicated than we might think! Here are some explanations why R results in 0 2 2 4 4 6 6
instead of 0 1 2 3 4 5 6
when it does round(seq(0.5, 6.5))
. Many thanks to Jan Schick for pointing me towards this ‘rounding’ standard in R, as well as providing me this nice example. Session 05 will also briefly touch upon R and computers and the ‘density’ of real numbers.
5.1 Exercises
Run the following lines.
a <- drought$bair[1]; a; b <- drought$bair[2]; b ## Basal are increment ratio values of first two stands
exp(x = a)
sqrt(x = a)
a - b
abs(x = a - b)
sum(a, b)
max(a, b)
Computers construct real values!
## [1] 1.224647e-16
## [1] 0
## [1] -1
6 Functions basics
(We will have a more detailed focus on functions in the third session.)
Basic properties of functions in R:
- Function call of function
f
using brackets:f()
- Documentation (‘help-page’) using preceded question mark (
?f
), or callinghelp(f)
- Search for documentation of
f
by:help.search('f')
- Syntax:
- Arguments often have default objects, eg. for
pch
inplot()
:
- Providing argument names is not optional in any case:
## [1] 1
## [1] 1
## Error in plot.xy(xy, type, ...): unzulässiger Plottyp '1'
7 Documentation
Structure:
- Description: In brief, what is this function about?
- Usage: How to call the function? What are mandatory arguments?
- Arguments: Explaining each argument
- Details: Some text about implementation, scope, …
- Value: Explaining the resulting object
- Authors and References and See also and …
- Examples: Always at the bottom, always scroll down there, you will never be disappointed!
7.1 Exercises
Go to the documentation for lm
.
- What is this function about?
- Copy-paste and run the example.
8 Objects
Basic Conventions:
- Objects store values, results or algorithms, ie. functions.
- Assignment of contents by
<-
or=
(or very seldomly->
). - Type of contents is described by object classes.
## [1] 0.505
## [1] 0.505
## [1] -0.4338646
## [1] 0.648
## [1] 0.648
8.1 Objekt names
There are few hard technical restrictions on how to name objects in R, and a few soft rules that make life much simpler:
- Object names are requared to begin with a lower or upper case letter, no numbers or any other signs are allowed.
- For any trailing character, numbers and some signs (restrict yourself to underscore
_
and.
) are allowed - Don’t use
?
,$
,%
,^
,&
,*
,(
,)
,-
,#
,?
,,
,<
,>
,/
,|
,\
,[
,]
,{
, and@
- As short as possible, as long as needed!
- This trade-off is simple, but (hopefully) still very useful: long object names mean more typing (RStudio weakens this point by auto-complete), but leave less room for questions on content (What was
a
again?). - So try to give ‘telling names’, ie. contents (and units!) should be clearly visible from the object’s name:
first_bair_measurement
is better thana
,elev_m
is better thanelev
. - Ask yourself: Will I be able to immediately recall the content of an object by the object name, even after two weeks no work on this project?
We get an error message:
## Error: <text>:1:2: unerwartetes Symbol
## 1: 3values
## ^
8.2 Save and load
Where will my files be stored? See section Preparatory Work in R in Session 04.
8.3 Litter service
- Names of objects in current session:
ls()
- Litter service using
rm()
dput()
comes as a helper
## [1] "a" "b" "br" "df" "drought"
## [6] "frost" "m" "nd" "three_values" "tmp"
## c("a", "b", "br", "df", "drought", "frost", "m", "nd", "three_values",
## "tmp")
## c("a", "b", "br", "m", "nd", "three_values", "tmp")
## [1] "df" "drought" "frost"
9 Dataobject classes
9.1 Vector
A vector is a one-dimensional combination of length 1 vectors of the same basic vector type.
9.1.1 Basics
Objective | Call |
---|---|
Combination of elements | c(A, B) |
Indexing | c(A, B)[1] |
Length of vector | length(x) |
Integer sequence | A:B |
Flexible sequence | seq(A, B, length = N) , seq(A, B, by = K) |
Repeat | rep(A, times = N) , rep(c(A, B), each = K) |
Sort elements | sort(c(A, B), decreasing = FALSE) |
Order of elements | order(c(A, B)) |
9.1.2 Exercises
Copy-paste and run each line: describe what it does and what it results in.
a <- drought$elev[1]; b <- drought$elev[2]
(x <- c(a, b))
c(x, a)
c(x, x)
x[2] <- drought$elev[3]; x
length(x)
a:b
seq(a, b, by = .5)
seq(a, b, by = 5)
seq(a, b, length = 10) ## Sequenz
rep(x, times = 3)
rep(x, each = 3)
rep(x, length = 7)
sort(x, decreasing = TRUE)
order(c(x, 100))
a <- as.character(drought$species[1]); b <- as.character(drought$species[2])
(x <- c(a, b))
c(x, "Beech")
c(x, 2)
9.2 Calculations
- Basic calculations operate independently on all elements
- Function calls work (usually) with vectorized arguments
- Summary of the content:
summary()
- Frequency table:
table()
(see Session 04 for much more detail)
9.3 Matrix
A vector is a two-dimensional combination of single-element objects of the same class.
9.3.1 Basics
P
\(\times\)P
matrix with content content
:
Fill matrix column- (default) or line-wise using argument byrow = TRUE
:
- Indexing of one element:
A[1, 1]
- Indexing of first row (result is vector):
A[1, ]
- Indexing of several rows (result is matrix):
A[1:3, ]
- Indexing of first column (result is vector):
A[, 1]
- Indexing of several columns (result is matrix):
A[, 1:3]
- Indexing of several rows and columns (result is matrix):
A[1:3, 1:3]
9.3.2 Exercises
9.4 List
A list is a general ‘container’ object.
9.4.1 Basics
- Lists may contain elements storing objects of varying classes, lengths, …. (
character
,numeric
,integer
,factor
, …) - Construct a list using
list(...)
- Indexing:
x[[1]]
orx[['name']]
orx$name
str
gives the structure- For lists of ‘consistent’ classes (or generic functions):
9.5 Synthesis and indexing
- We may provide names for elements of vectors, matrices and lists:
dimnames(A)
,rownames(A)
,colnames(A)
for matrices- Select a named element by using the
$
sign - Select elements by
[[]]
(for lists),[]
(for vectors), and[, ]
(for matrices).
9.6 Dataframe
Based on their object properties, dataframes can be classified between matrices and lists, whereby the columns (the so-called ‘variables’) of the dataframe correspond to the elements of a list:
- Dataframes are more rigid than lists because all columns must have the same length,
- Dataframes are more flexible than matrices, since all columns can contain different contents (numbers, strings, …).
- Dataframes are generated with
data.frame ()
- Indexing:
- Rows and columns can be indexed in the same way as the elements of a matrix,
- Columns can be indexed in the same way as the elements of a list (we have already used this by
drought$elev
).
9.6.1 Exercises
weather <- data.frame(day = c("Monday", "Tuesday", "Wednesday"),
daily_mean_temperature_C = c(12, 14, 11),
precipitation_sum_mm = c(5, 9, 25),
site = "Goettingen")
weather
weather[1:2, ]
weather$day[1:2]
weather[["daily_mean_temperature_C"]][3]
weather[1:2, c("daily_mean_temperature_C", "precipitation_sum_mm")]
weather[1:2, c("day", "precipitation_sum_mm")]
9.7 Functions on dataframes
summary
: Summary of each of the variables in the dataframestr
: overview of the structure of the dataframedim
: Dimension of the dataframe (number of rows and columns)- The first
N
lines of a dataframedf
are extracted withhead(df, n = N)
, and - the last
N
lines of a dataframedf
are extracted withtail (d, n = N)
.
10 Functions for character strings
If we want to change character strings (such as variable names of a dataframe), the following functions can be helpful:
tolower
und toupper
Replaces uppercase with lowercase letters (and vice versa).
## [1] "BEECH" "BEECH" "BEECH"
## [1] "beech" "beech" "beech"
gsub
Replace a pattern:
## [1] "day" "daily_mean_temperature_C"
## [3] "precipitation_sum_mm" "site"
names(weather) <- gsub(names(weather), pattern = "_mm", replacement = "__mm", fixed = T)
names(weather) <- gsub(names(weather), pattern = "_C", replacement = "__C", fixed = T)
names(weather)
## [1] "day" "daily_mean_temperature__C"
## [3] "precipitation_sum__mm" "site"
substring
Substrings from first
to last
.
## [1] "day" "daily" "preci" "site"
strsplit
Splits strings according to a certain pattern (strsplit
always returns a list).
names(weather) <- gsub(names(weather), pattern = "__", replacement = "_", fixed = T)
(tmp <- strsplit(names(weather), split = "_", fixed = T))
## [[1]]
## [1] "day"
##
## [[2]]
## [1] "daily" "mean" "temperature" "C"
##
## [[3]]
## [1] "precipitation" "sum" "mm"
##
## [[4]]
## [1] "site"
11 Factors
For the analysis of qualitative characteristics, it makes sense to represent variables with character strings as factors.
A factor comes with three ingredients:
- data vector
x
levels
labels
We can generate a factor without specification of levels and labels:
In this case, factor levels (levels
) are generated (in alphabetical order), numbered internally (see this numeric expression using as.numeric
), and represented externally by the original values of the character string.
## [1] Beech Spruce Beech
## Levels: Beech Spruce
## $levels
## [1] "Beech" "Spruce"
##
## $class
## [1] "factor"
## [1] "Beech" "Spruce"
## [1] 1 2 1
## [1] "Beech" "Spruce" "Beech"
We can provide levels, to which the numerical representation will refer to:
## [1] Beech Spruce Beech
## Levels: Spruce Beech
## $levels
## [1] "Spruce" "Beech"
##
## $class
## [1] "factor"
## [1] "Spruce" "Beech"
## [1] 2 1 2
## [1] "Beech" "Spruce" "Beech"
Using labels, the values represented externally are the result of the mapping from levels
to labels
(tmp <- factor(x = c("Beech", "Spruce", "Beech"),
levels = c("Beech", "Spruce"),
labels = c("Fagus sylvatica", "Picea abies")))
## [1] Fagus sylvatica Picea abies Fagus sylvatica
## Levels: Fagus sylvatica Picea abies
## $levels
## [1] "Fagus sylvatica" "Picea abies"
##
## $class
## [1] "factor"
## [1] "Fagus sylvatica" "Picea abies"
## [1] 1 2 1
## [1] "Fagus sylvatica" "Picea abies" "Fagus sylvatica"
11.1 Exercises
tmp <- factor(x = c("Beech", "Spruce", "Beech", "Oak"),
levels = c("Beech", "Spruce"),
labels = c("Fagus sylvatica", "Picea abies"))
attributes(tmp)
levels(tmp)
as.numeric(tmp)
as.character(tmp)
tmp <- factor(x = c("Beech", "Spruce", "Beech", "Oak"),
levels = c("Beech", "Oak", "Spruce"),
labels = c("Deciduous", "Deciduous", "Conifer"))
attributes(tmp)
levels(tmp)
as.numeric(tmp)
as.character(tmp)
References
Private webpage: uncertaintree.github.io↩︎