R (https://www.r-project.org) is a free, open source, software environment for statistical computing and graphics with many excellent packages for importing and manipulating data, statistical modeling, and visualization. R can be downloaded and installed from CRAN (the Comprehensive R Archive Network) (https://cran.rstudio.com). It is recommended to run R using the integrated development environment (IDE) called RStudio. RStudio allows to interact with R more readily and can be freely downloaded from https://www.rstudio.com/products/rstudio/download.
To install R, go to the home website of R www.r-project.org and
To install IDE RStudio, go to www.rstudio.com and
Figure below shows a snapshot of an RStudio IDE with the following four panes:
Typing commands in the R console
Writing code in script files
source()
Type the equation in the command window after the >
symbol.
10^2 + 36
## [1] 136
# Variable assignment
<- 5+5
x x
## [1] 10
If we use brackets and forget to add the closing bracket, the
>
on the command line changes into a +
. The
+
can also mean that R is still busy with some heavy
computation. If we want R to quit what it was doing and give back the
>
, press ESC.
We can store your commands in scripts. These scripts have file names
with the extension .R
. We can open an editor window to edit
these files by clicking File and New or Open file…
We can run the whole script with the console command source, so
e.g. for the script in the file foo.R
source("foo.R")
The working directory is the folder on our computer in which we are currently working.
# returns path for the current working directory
getwd()
# set the working directory to a specified directory
setwd("path/of/directory")
Within RStudio we can also go to Session - Set working directory - Choose directory.
With the standard installation of R, most common packages are installed. If we need additional functionality, we can also install R packages. The Comprehensive R Archive Network (CRAN) is the main repository for R packages.
To install a package from CRAN we type:
install.packages("packagename")
To attach a package to start using it we need to type:
library(packagename)
We can see a list of all installed packages in the Packages tab of
RStudio, or by typing installed.packages()
or
library()
.
ls()
list objects in the environment. They can also be
seen in the Environment tab of RStudio
data()
lists all datasets
rm(x, mx)
remove objects
rm(list = ls())
removes all objects from R’s memory
We can get help on specific functions by typing
?functionname
or help(functionname)
??functionname
searches the function in the
Comprehensive R Archive Network (CRAN) and provides the name of the
package that contains the function
help.start()
calls an HTML-based global help
We can also search our question in Google or Stack Overflow (https://stackoverflow.com/)
System and user information can be retrieved with
Sys.info()
Version information about R, the operating system and attached packages
sessionInfo()
Package version
packageVersion("sf")
Vector of numbers
# vector consisting of 1, 5, and 10
# c() is the 'combine' function
c(1, 5, 10)
## [1] 1 5 10
# vector of integers between 1 and 10
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
# assign vector of integers between 1 and 10 to variable x
<- 1:10
x x
## [1] 1 2 3 4 5 6 7 8 9 10
Regular sequences
# sequence of numbers from 1 to 21 by increments of 2
seq(from = 1, to = 21, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19 21
# sequence of numbers from 1 to 31 with 3 equal incremented numbers
seq(1, 31, length.out = 3)
## [1] 1 16 31
Repeated sequences
# replicates x a specified number of times
rep(x = 1:4, times = 2)
## [1] 1 2 3 4 1 2 3 4
# each element of x is repeated each times
rep(x = 1:4, each = 2)
## [1] 1 1 2 2 3 3 4 4
Random numbers
<- 10
n
# generate n random numbers between the default values of 0 and 1
runif(n)
## [1] 0.80825190 0.86440488 0.83792299 0.63020129 0.77438091 0.69731244
## [7] 0.04682129 0.79331743 0.35917790 0.39893421
# generate n random numbers between 0 and 25
runif(n, min = 0, max = 25)
## [1] 2.557973 19.409298 16.673220 8.037745 22.873311 23.103512 18.655786
## [8] 23.798240 12.469154 13.033757
# generate n random numbers between 0 and 25 (with replacement)
sample(0:25, n, replace = TRUE)
## [1] 10 5 13 10 2 3 10 8 17 3
# generate n random numbers between 0 and 25 (without replacement)
sample(0:25, n, replace = FALSE)
## [1] 13 0 12 15 2 5 24 18 19 14
Rounding numeric values
<- c(1, 1.35, 1.7, 2.05, 2.4, 2.75, 3.1, 3.45, 3.8, 4.15, 4.5, 4.85, 5.2, 5.55, 5.9)
x
# Round to the nearest integer
round(x)
## [1] 1 1 2 2 2 3 3 3 4 4 4 5 5 6 6
# Round up
ceiling(x)
## [1] 1 2 2 3 3 3 4 4 4 5 5 5 6 6 6
# Round down
floor(x)
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
# Round to a specified decimal
round(x, digits = 1)
## [1] 1.0 1.4 1.7 2.0 2.4 2.8 3.1 3.5 3.8 4.2 4.5 4.8 5.2 5.6 5.9
Counting elements
length(1:10)
## [1] 10
Sort a vector
sort(10:1)
## [1] 1 2 3 4 5 6 7 8 9 10
Unique elements
<- c(1:3, 2:5)
x
# which elements are duplicated
duplicated(x)
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
# unique elements (duplicated deleted)
unique(x)
## [1] 1 2 3 4 5
!duplicated(x)] x[
## [1] 1 2 3 4 5
Character strings need to be in quotation marks and can have spaces
<- "learning to create" # create string a
a <- "character strings" # create string b b
paste()
to concatenate strings
# paste together strings a and b
paste(a, b)
## [1] "learning to create character strings"
# paste character and number (converts numbers to character class)
paste("The variable x is equal to", pi)
## [1] "The variable x is equal to 3.14159265358979"
# paste multiple strings with a separating character
paste("a", "b", "c", sep = "-")
## [1] "a-b-c"
# use paste0() to paste without spaces between characters
paste0("a", "b", "c", "R")
## [1] "abcR"
# paste objects with different lengths
paste("R", 1:5, sep = " v1.")
## [1] "R v1.1" "R v1.2" "R v1.3" "R v1.4" "R v1.5"
Case conversion
<- "Learning To MANIPULATE strinGS in R"
x
tolower(x)
## [1] "learning to manipulate strings in r"
toupper(x)
## [1] "LEARNING TO MANIPULATE STRINGS IN R"
Extract or replace substrings in a character vector
<- paste(LETTERS, collapse = "")
alphabet
# extract 18-24th characters in string
substr(alphabet, start = 18, stop = 24)
## [1] "RSTUVWX"
# replace 19st-24th characters with `R`
substr(alphabet, start = 19, stop = 24) <- "RRRRRR"
alphabet
## [1] "ABCDEFGHIJKLMNOPQRRRRRRRYZ"
Split string x
into substrings according to a substring
split
within them
strsplit(x = "aa-bb", split = "-")
## [[1]]
## [1] "aa" "bb"
Comparing numeric values
<- 9
x <- 10
y
< y # is x less than y x
## [1] TRUE
> y # is x greater than y x
## [1] FALSE
<= y # is x less than or equal to y x
## [1] TRUE
>= y # is x greater than or equal to y x
## [1] FALSE
== y # is x equal to y x
## [1] FALSE
!= y # is x not equal to y x
## [1] TRUE
Comparing vector of several elements with a value results in a logical vector
<- c(5, 14, 10, 22)
x > 13 x
## [1] FALSE TRUE FALSE TRUE
%in%
for group membership
3 %in% 1:10
## [1] TRUE
Logical vectors used in ordinary arithmetic are coerced into numeric vectors, FALSE becoming 0 and TRUE becoming 1
<- c(5, 14, 10, 22)
x
# how many elements in x are greater than 13?
sum(x > 13)
## [1] 2
Position of the vector equal to a number
which(3 == 1:10)
## [1] 3
Factors are used to represent categorical data and can be unordered or ordered.
Creating a factor string
<- factor(c("male", "female", "female", "male", "female"))
gender gender
## [1] male female female male female
## Levels: female male
class(gender)
## [1] "factor"
levels(gender)
## [1] "female" "male"
summary(gender)
## female male
## 3 2
Convert from characters to factors
<- c("Group1", "Group2", "Group2", "Group1", "Group1")
group str(group)
## chr [1:5] "Group1" "Group2" "Group2" "Group1" "Group1"
as.factor(group)
## [1] Group1 Group2 Group2 Group1 Group1
## Levels: Group1 Group2
Ordering levels
# when not specified, the default puts order as alphabetical
<- factor(c("male", "female", "female", "male", "female"))
gender gender
## [1] male female female male female
## Levels: female male
# specifying order
<- factor(c("male", "female", "female", "male", "female"),
gender levels = c("male", "female"))
gender
## [1] male female female male female
## Levels: male female
Drop levels
<- gender[gender != "male"]
gender
# lets say we have no observations in one level
summary(gender)
## male female
## 0 3
# we can drop that level if desired
droplevels(gender)
## [1] female female female
## Levels: female
Getting current date and time
Sys.timezone()
## [1] "Asia/Riyadh"
Sys.Date()
## [1] "2023-03-16"
Sys.time()
## [1] "2023-03-16 12:23:01 +03"
Converting strings to dates using as.Date()
. Default
date format is YYYY-MM-DD
. For a complete list of
formatting code options type ?strftime
<- c("2015-07-01", "2015-08-01", "2015-09-01")
x as.Date(x)
## [1] "2015-07-01" "2015-08-01" "2015-09-01"
<- c("07/01/2015", "07/01/2015", "07/01/2015")
y as.Date(y, format = "%m/%d/%Y")
## [1] "2015-07-01" "2015-07-01" "2015-07-01"
Creating date sequences
seq(as.Date("2010-1-1"), as.Date("2015-1-1"), by = "years")
## [1] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"
## [6] "2015-01-01"
seq(as.Date('2015-09-15'), as.Date('2015-09-30'), by = "2 days")
## [1] "2015-09-15" "2015-09-17" "2015-09-19" "2015-09-21" "2015-09-23"
## [6] "2015-09-25" "2015-09-27" "2015-09-29"
The basic structure in R is the vector. A vector is a sequence of data elements of the same basic type: numeric, character, logical, factors, or dates.
Creating a vector using :
, c()
,
seq()
or rep()
# integer vector
<- 8:17
w w
## [1] 8 9 10 11 12 13 14 15 16 17
# numbers with decimals vector
<- c(0.5, 0.6, 0.2)
x x
## [1] 0.5 0.6 0.2
# logical vector
<- c(TRUE, FALSE, FALSE)
y y
## [1] TRUE FALSE FALSE
# Character vector
<- c("a", "b", "c")
z z
## [1] "a" "b" "c"
Coercing vectors. All elements of a vector must be of the same type. When attempting to combine different types of elements (i.e. character and numeric) they will be coerced to one of the types.
# numerics are turned to characters
str(c("a", "b", "c", 1, 2, 3))
## chr [1:6] "a" "b" "c" "1" "2" "3"
# logicals are turned to numerics...
str(c(1, 2, 3, TRUE, FALSE))
## num [1:5] 1 2 3 1 0
# or characters
str(c("A", "B", "C", TRUE, FALSE))
## chr [1:5] "A" "B" "C" "TRUE" "FALSE"
Often it is best to explicitly coerce with
as.character()
, as.double()
,
as.integer()
, or as.logical()
.
class(state.region)
## [1] "factor"
<- as.character(state.region)
state.region2 class(state.region2)
## [1] "character"
Adding additional elements to a pre-existing vector
<- 8:17
v1
c(v1, 18:22)
## [1] 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Subsetting vectors
Subsetting with positive integers
v1
## [1] 8 9 10 11 12 13 14 15 16 17
2] v1[
## [1] 9
2:4] v1[
## [1] 9 10 11
c(2, 4, 6, 8)] v1[
## [1] 9 11 13 15
# note that we can duplicate index positions
c(2, 2, 4)] v1[
## [1] 9 9 11
Subsetting with negative integers will omit the elements at the specified positions
-1] v1[
## [1] 9 10 11 12 13 14 15 16 17
-c(2, 4, 6, 8)] v1[
## [1] 8 10 12 14 16 17
Subsetting with logical values will select the elements where the corresponding logical value is TRUE
c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)] v1[
## [1] 8 10 12 13 14 17
< 12] v1[v1
## [1] 8 9 10 11
< 12 | v1 > 15] v1[v1
## [1] 8 9 10 11 16 17
A list is an R structure that allows us to combine elements of different types
Creating Lists
<- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.5, 4.2))
l str(l)
## List of 4
## $ : int [1:3] 1 2 3
## $ : chr "a"
## $ : logi [1:3] TRUE FALSE TRUE
## $ : num [1:2] 2.5 4.2
Adding additional list components to a list
We can add a new list component by utilizing the $
sign
and naming the new item
<- list(1:3, "a", c(TRUE, FALSE, TRUE))
l1 str(l1)
## List of 3
## $ : int [1:3] 1 2 3
## $ : chr "a"
## $ : logi [1:3] TRUE FALSE TRUE
$item4 <- "new list item"
l1str(l1)
## List of 4
## $ : int [1:3] 1 2 3
## $ : chr "a"
## $ : logi [1:3] TRUE FALSE TRUE
## $ item4: chr "new list item"
To add additional values to a list item, we need to subset that
specific list item and then we can use the c()
function to
add the additional elements to that list item
1]] <- c(l1[[1]], 4:6)
l1[[str(l1)
## List of 4
## $ : int [1:6] 1 2 3 4 5 6
## $ : chr "a"
## $ : logi [1:3] TRUE FALSE TRUE
## $ item4: chr "new list item"
2]] <- c(l1[[2]], c("dding", "to a", "list"))
l1[[str(l1)
## List of 4
## $ : int [1:6] 1 2 3 4 5 6
## $ : chr [1:4] "a" "dding" "to a" "list"
## $ : logi [1:3] TRUE FALSE TRUE
## $ item4: chr "new list item"
Adding names to a pre-existing list
names(l1) <- c("item1", "item2", "item3")
Subsetting lists
Subset list and preserve output as a list
# extract first list item
1] l1[
## $item1
## [1] 1 2 3 4 5 6
"item1"] l1[
## $item1
## [1] 1 2 3 4 5 6
# extract multiple list items
c(1, 3)] l1[
## $item1
## [1] 1 2 3 4 5 6
##
## $item3
## [1] TRUE FALSE TRUE
c("item1", "item3")] l1[
## $item1
## [1] 1 2 3 4 5 6
##
## $item3
## [1] TRUE FALSE TRUE
Subset list and simplify output
# extract first list item and simplify to a vector
1]] l1[[
## [1] 1 2 3 4 5 6
"item1"]] l1[[
## [1] 1 2 3 4 5 6
$item1 l1
## [1] 1 2 3 4 5 6
Extract individual elements out of a specific list item
# extract third element from the second list item
2]][3] l1[[
## [1] "to a"
"item2"]][3] l1[[
## [1] "to a"
$item2[3] l1
## [1] "to a"
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. In R, the elements that make up a matrix must be of a consistent mode (i.e. all elements must be numeric, or character, etc.).
Creating Matrices. Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.
<- matrix(1:6, nrow = 2, ncol = 3)
m1 m1
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
<- matrix(letters[1:6], nrow = 2, ncol = 3)
m2 m2
## [,1] [,2] [,3]
## [1,] "a" "c" "e"
## [2,] "b" "d" "f"
Matrices can also be created using the column-bind
cbind()
and row-bind rbind()
functions. The
vectors that are being binded must be of equal length.
<- 1:4
v1 <- 5:8
v2 cbind(v1, v2)
## v1 v2
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
rbind(v1, v2)
## [,1] [,2] [,3] [,4]
## v1 1 2 3 4
## v2 5 6 7 8
<- 9:12
v3 cbind(v1, v2, v3)
## v1 v2 v3
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
Adding on to matrices
<- cbind(v1, v2)
m1 m1
## v1 v2
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
# add a new column
cbind(m1, v3)
## v1 v2 v3
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
# add a new row
rbind(m1, c(4.1, 8.1))
## v1 v2
## [1,] 1.0 5.0
## [2,] 2.0 6.0
## [3,] 3.0 7.0
## [4,] 4.0 8.0
## [5,] 4.1 8.1
Adding names
<- matrix(1:12, nrow = 4, ncol = 3)
m2 m2
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
# the dimension attribute shows this matrix has 4 rows and 3 columns
attributes(m2)
## $dim
## [1] 4 3
# add row names as an attribute
rownames(m2) <- c("row1", "row2", "row3", "row4")
m2
## [,1] [,2] [,3]
## row1 1 5 9
## row2 2 6 10
## row3 3 7 11
## row4 4 8 12
attributes(m2)
## $dim
## [1] 4 3
##
## $dimnames
## $dimnames[[1]]
## [1] "row1" "row2" "row3" "row4"
##
## $dimnames[[2]]
## NULL
# add column names
colnames(m2) <- c("col1", "col2", "col3")
m2
## col1 col2 col3
## row1 1 5 9
## row2 2 6 10
## row3 3 7 11
## row4 4 8 12
attributes(m2)
## $dim
## [1] 4 3
##
## $dimnames
## $dimnames[[1]]
## [1] "row1" "row2" "row3" "row4"
##
## $dimnames[[2]]
## [1] "col1" "col2" "col3"
Subsetting matrices
A generic form of matrix subsetting looks like:
matrix[rows, columns]
# subset for rows 1 and 2 but keep all columns
1:2, ] m2[
## col1 col2 col3
## row1 1 5 9
## row2 2 6 10
# subset for columns 1 and 3 but keep all rows
c(1, 3)] m2[ ,
## col1 col3
## row1 1 9
## row2 2 10
## row3 3 11
## row4 4 12
# subset for both rows and columns
1:2, c(1, 3)] m2[
## col1 col3
## row1 1 9
## row2 2 10
# use a vector to subset
<- c(1, 2, 4)
v c(1, 3)] m2[v,
## col1 col3
## row1 1 9
## row2 2 10
## row4 4 12
# use names to subset
c("row1", "row3"), ] m2[
## col1 col2 col3
## row1 1 5 9
## row3 3 7 11
Note that subsetting matrices with the [
operator will simplify the results to the lowest possible dimension. To
avoid this we can introduce the drop = FALSE
argument
# simplifying results in a named vector
2] m2[,
## row1 row2 row3 row4
## 5 6 7 8
# preserving results in a 4x1 matrix
2, drop = FALSE] m2[,
## col2
## row1 5
## row2 6
## row3 7
## row4 8
A data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor).
Creating data frames
<- data.frame(col1 = 1:3,
df col2 = c("this", "is", "text"),
col3 = c(TRUE, FALSE, TRUE),
col4 = c(2.5, 4.2, pi))
# structure of a data frame
str(df)
## 'data.frame': 3 obs. of 4 variables:
## $ col1: int 1 2 3
## $ col2: chr "this" "is" "text"
## $ col3: logi TRUE FALSE TRUE
## $ col4: num 2.5 4.2 3.14
# number of rows
nrow(df)
## [1] 3
# number of columns
ncol(df)
## [1] 4
data.frame()
has an argument
stringsAsFactors = default.stringsAsFactors()
to specify
whether character columns should be converted to factors. We can turn
this off by setting stringsAsFactors = FALSE
default.stringsAsFactors()
## Warning: 'default.stringsAsFactors' is deprecated.
## Use '`stringsAsFactors = FALSE`' instead.
## See help("Deprecated")
## [1] FALSE
<- data.frame(col1 = 1:3,
df col2 = c("this", "is", "text"),
col3 = c(TRUE, FALSE, TRUE),
col4 = c(2.5, 4.2, pi),
stringsAsFactors = FALSE)
# note how col2 now is of a character class
str(df)
## 'data.frame': 3 obs. of 4 variables:
## $ col1: int 1 2 3
## $ col2: chr "this" "is" "text"
## $ col3: logi TRUE FALSE TRUE
## $ col4: num 2.5 4.2 3.14
Converting pre-existing structures to a data frame
<- 1:3
v1 <- c("this", "is", "text")
v2 <- c(TRUE, FALSE, TRUE)
v3
# convert same length vectors to a data frame using data.frame()
data.frame(col1 = v1, col2 = v2, col3 = v3)
# convert a list to a data frame using as.data.frame()
<- list(item1 = 1:3, item2 = c("this", "is", "text"), item3 = c(2.5, 4.2, 5.1))
l as.data.frame(l)
# convert a matrix to a data frame using as.data.frame()
<- matrix(1:12, nrow = 4, ncol = 3)
m1 as.data.frame(m1)
Adding on to data frames
df
# add a new column
<- c("A", "B", "C")
v4 cbind(df, v4)
Adding attributes to data frames
rownames(df) <- c("row1", "row2", "row3")
df
attributes(df)
## $names
## [1] "col1" "col2" "col3" "col4"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] "row1" "row2" "row3"
Change the existing column names by using colnames()
or
names()
# add/change column names with colnames()
colnames(df) <- c("col_1", "col_2", "col_3", "col_4")
df
# add/change column names with names()
names(df) <- c("col.1", "col.2", "col.3", "col.4")
df
Subsetting data frames
# subsetting by row numbers
2:3, ] df[
# subsetting by row names
c("row2", "row3"), ] df[
# subsetting columns like a list
c("col.2", "col.4")] df[
# subsetting columns like a matrix
c("col.2", "col.4")] df[,
# subset for both rows and columns
1:2, c(1, 3)] df[
# use a vector to subset
<- c(1, 2, 4)
v df[, v]
Note that subsetting data frames with the [
operator will simplify the results to the lowest possible dimension. To
avoid this we can set drop = FALSE
.
# simplifying results in a named vector
2] df[,
## [1] "this" "is" "text"
# preserving results in a 3x1 data frame
2, drop = FALSE] df[,
Subset data frames based on conditional statements
head(mtcars)
# using brackets
$mpg > 20, ] mtcars[mtcars
# using the subset function
subset(mtcars, mpg > 20)
# using brackets
$mpg > 20 & mtcars$cyl == 6, ] mtcars[mtcars
# using the subset function
subset(mtcars, mpg > 20 & cyl == 6)
# using brackets
$mpg > 20 & mtcars$cyl == 6, c("mpg", "cyl", "wt")] mtcars[mtcars
# using the subset function
subset(mtcars, mpg > 20 & cyl == 6, c("mpg", "cyl", "wt"))
A common task in data analysis is dealing with missing values. In R, missing values are often represented by NA (NA = not available)
Identifying missing values using is.na()
<- c(1:4, NA, 6:7, NA)
x is.na(x)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
<- data.frame(col1 = c(1:3, NA),
df col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
is.na(df)
## col1 col2 col3 col4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE TRUE
is.na(df$col4)
## [1] FALSE FALSE FALSE TRUE
Location of NAs
# identify location of NAs in vector
which(is.na(x))
## [1] 5 8
Number of NAs
# identify count of NAs in data frame
sum(is.na(df))
## [1] 3
Total missing values in each column
colSums(is.na(df))
## col1 col2 col3 col4
## 1 1 0 1
Recode missing values
# vector with missing data
<- c(1:4, NA, 6:7, NA)
x # recode missing values with the mean
is.na(x)] <- mean(x, na.rm = TRUE)
x[round(x, 2)
## [1] 1.00 2.00 3.00 4.00 3.83 6.00 7.00 3.83
# data frame that codes missing values as 99
<- data.frame(col1 = c(1:3, 99), col2 = c(2.5, 4.2, 99, 3.2))
df # change 99s to NAs
== 99] <- NA
df[df df
# data frame with missing data
<- data.frame(col1 = c(1:3, NA),
df col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
$col4[is.na(df$col4)] <- mean(df$col4, na.rm = TRUE)
df df
Exclude missing values
# A vector with missing values
<- c(1:4, NA, 6:7, NA)
x
# including NA values produces an NA output
mean(x)
## [1] NA
# excluding NA values calculates the mathematical operation for all non-missing values
mean(x, na.rm = TRUE)
## [1] 3.833333
Subset of observations (rows) in our data that contain no missing data
# data frame with missing values
<- data.frame(col1 = c(1:3, NA),
df col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
complete.cases(df)
## [1] TRUE FALSE TRUE FALSE
# subset with complete.cases to get the complete rows
complete.cases(df), ] df[
# or subset with `!` operator to get the incomplete rows
!complete.cases(df), ] df[
# or use na.omit() to get the complete rows
na.omit(df)
Text file formats use delimiters to separate the different elements
in a line, and each line of data is in its own line in the text file. To
import a text file we can use read.table()
which is a
multipurpose function in base R for importing data.
read.csv()
and read.delim()
are special cases
of read.table()
. Alternatively, we can use function
read_csv()
from the readr
package, and
function fread()
from the data.table
package
which are faster.
variable 1,variable 2,variable 3
10,beer,TRUE
25,wine,TRUE
8,cheese,FALSE
<- read.csv("mydata.csv")
mydata
mydata
# this is the same
<- read.table("mydata.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE) mydata
# invoke a spreadsheet-style data viewer on a matrix-like R object
View(mydata)
str(mydata)
To import Excel files, we can simply export the Excel file as a CSV
file and then import into R using read.csv()
. We can also
use the functions in the readxl
package.
With the file.choose()
function we can simply choose the
file we want to open by clicking some buttons.
<- read.csv(file.choose()) d
The foreign
package can be used to read data from other
programming languages (e.g., .sav
for SPSS data).
Importing R object files
load("mydata.RData")
load(file = "mydata.rda")
readRDS("mydata.rds")
We can use attach()
to keep the dataset as the current
or working dataset. By doing that we will not need to keep using the
$
sign to point to the data set, we can just call the
variables in the dataset by name. For example:
# mean(Temp)
mean(airquality$Temp)
## [1] 77.88235
attach(airquality)
mean(Temp)
## [1] 77.88235
We should never attach two datasets that have the same variable names as this could lead to confusion.
write.table()
is the multipurpose function in base R for
exporting data. The functions write.csv()
and
write.delim()
are special cases of
write.table()
in which the defaults have been adjusted for
efficiency. We can also use write_csv()
from the
readr
package.
<- data.frame(var1 = c(10, 25, 8),
df var2 = c("beer", "wine", "cheese"),
var3 = c(TRUE, TRUE, FALSE),
row.names = c("billy", "bob", "thornton"))
# write to a csv file
write.csv(df, file = "export_csv")
# write to a csv and save in a different directory
write.csv(df, file = "folder/subfolder/subsubfolder/export_csv")
# write to a csv file with added arguments
write.csv(df, file = "export_csv", row.names = FALSE, na = "MISSING!")
Exporting R object files
Sometimes we may need to save data or other R objects outside of our
workspace. There are three primary ways that people tend to save R
data/objects: as .RData
, .rda
, or
.rds
files.
.rda
is just short for .RData
, therefore,
these file extensions represent the same underlying object type. We use
the .rda
or .RData
file types when we want to
save several, or all, objects and functions that exist in our global
environment.
On the other hand, if we only want to save a single R object such as
a data frame, function, or statistical model results its best to use
.rds
file type. We can use .rda
or
.RData
to save a single object but the benefit of
.rds
is it only saves a representation of the object and
not the name whereas .rda
and .RData
save the
both the object and its name. As a result, with .rds
the
saved object can be loaded into a named object within R that is
different from the name it had when originally saved.
# save() can be used to save multiple objects in our global environment,
# in this case we save two objects to a .RData file
<- stats::runif(20)
x <- list(a = 1, b = TRUE, c = "oops")
y save(x, y, file = "xy.RData")
# save.image() is just a short-cut for 'save my current workspace',
# i.e. all objects in our global environment
save.image()
# save a single object to file
saveRDS(x, "x.rds")
# restore it under a different name
<- readRDS("x.rds")
x2 identical(x, x2)
plot()
is the generic function for plotting R
objects.
<- rnorm(100)
x plot(x)
plot(rnorm(100), type = "l", col = "red")
hist()
computes histograms
hist(rnorm(100))
<- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
d
plot(d$a, type = "l", ylim = range(d), lwd = 3, col = "red")
lines(d$b, type = "s", lwd = 2, col = "green")
points(d$c, pch = 20, cex = 4, col = "blue")
Type ?plot.default
to know more about the arguments of
the function.
We will also see the package ggplot2
for plotting data:
https://ggplot2.tidyverse.org/
Functions allows us to automate common tasks. Key steps to creating a new function:
# Define the function
<- function(x) {
fnRescale01 <- range(x, na.rm = TRUE)
rng return((x - rng[1]) / (rng[2] - rng[1]))
}
# Execute the function
<- c(0, 5, 10)
x fnRescale01(x)
The txtProgressBar()
and
setTxtProgressBar()
functions from the utils
package can be used to show a text progress bar in the R console. This
is useful to show in long computations.
2.3 Comments
Write comments with
#
R is case sensitive, so we need to make sure we write capital letters where necessary.