Introduction to R
Notes from the John Hopkins Coursera Data Science Specialisation
R Basics
Assignment
x <- 1
Printing to screen
print(x)
# or just
x
Creation of an integer sequence
Vector of values 1 to 20
x <- 1:20
Objects / Data Types
R has five basic atomic objects
- Character
- Number
- NaN - Not a number
- Inf - Infinity
- Integer
- Integers need to be specifically created as R will by default create a number. This can be done by using the suffix L. For example x <- 1L
- Complex (1 + 4i)
- Logical (TRUE, FALSE)
The basic object is a vector. A vector can only contain one object type. Lists can contain objects of different types.
Vectors and Lists
# Example of using c() to perform concatenations on different types
x <- c(0.1, 0.3) # Numeric
x <- c(TRUE, FALSE) # Logical
# Create a empty vector of length 10 - this will initialise the vector with default values
x <- vector("numeric", length = 10)
# Vectors cannot have mixed types but it will not error when types are mixed.
# By default R will coerce the data types to be the same.
# The below example would become a character vector
x <- c(1.7, "a")
# Explicit Coercion / Casting
x <- 0:6 # create integer sequence
as.numeric(x) # convert to a numeric sequence
as.character(x) # convert to a character sequence
# Lists example - note that a list can contain mixed types
x <- list(1, "a", TRUE, 1 + 4i)
Matrices
Vector with a dimension attribute (dimension is an integer vector with a length of two)
# create a new matrix with two rows and three columns
x <- matrix(nrow = 2, ncol = 3)
# This will return the dimension attributes passed to the matrix method
dim(x) # [1] 2 3
# create a matrix populated with a sequence one to six
# Note: matrices are filled by columns first (over population by row).
x <- matrix(1:6, nrow = 2, ncol = 3)
# A vector can be transformed into a matrix by adding a dimension to it's attributes
x <- 1:10
dim(x) <- c(2,5) # two rows and five columns
# They can also be created by performing column-binding or row-binding
x <- 1:3
y <- 10:12
cbind(x, y) # take these two vectors and bind them as two separate columns
rbind(x, y) # take these two vectors and bind them as two separate rows
Factors
A factor is a vector representing categorical data
- This data can be sorted or unsorted
- Can be thought of as an integer vector where each integer has a label
- Are self describing so generally better than using integers. Male and Female make more sense than 1 and 2 for gender data.
# creation of a new factor
x <- factor(c("yes", "yes", "no"))
# frequency of occurrence
table(x)
# Return the vector as the integer version of itself
unclass(x)
# Something to note with factors is R will set the baseline to what comes alphabetically first.
# In the case of the example below this would be no. To force R to use yes as the baseline
# you can specify it through the levels attribute (important in linear modelling)
x <- factor(
c("yes", "yes", "no"),
levels = c("yes", "no")
)
Missing Values
- NA - Not set / missing. NA values are not just numbers.
- NaN - Not a number. NaN is also NA but NA is not NaN
# Logical tests
is.na()
is.nan()
# Example of is.na()
x <- c(1, 2, NA)
is.na(x) # [1] FALSE FALSE TRUE
Data Frames
- Data frames are used to store tabular data.
- They are stored as a special type of list with each column being the same length.
- Each column can have different data types.
- Every row of a data frame has a name.
# Create a data frame with two columns, foo and bar
x <- data.frame(foo = 1:4, bar = c(T,T,F,F))
# list the number of rows
nrow(x)
# list the number of columns
ncol(x)
Names Attribute
x <- 1:3
# By default the integer sequence will not have any names associated with the values
names(x) # NULL
# Elements can be named though
names(x) <- c("foo", "bar", "something")
names(x) # [1] "foo" "bar" "something"
# list can also have names
x <- list(a = 1, b = 2)
# so can matrices - we use a new method called dimnames(vector or row names, vector of column names)
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
Reading Data
Common reading methods in R
- read.table and read.csv - used for reading in tabular data
- readLines - for reading in lines from a text file
- source - reading in R code (inverse of dump)
- dget - reading in R code files (inverse of dput)
- unserialize - reading in R objects in a binary form
read.table(
file = "", # name of the file
header = TRUE, # does the first line have the column names
sep = ",", # what is the table separator, example csv would be commas
colClasses = c(), # The list of classes that make up each of the columns in the table
nrows = 5, # number of rows in the dataset
comment.char = "", # is there a comments character
skip = 0, # skip lines at the start of the file
stringsAsFactors = TRUE # treat strings in columns as factors
)
# "Generally" you can call read.table with only the file param
read.table("test.txt")
# read.csv will set the separator to comma
read.csv("test.csv")
Large Datasets
- Set comment.char = “”
- Set nrows if possible - can help R with memory management (you can over estimate)
- Set colClasses to the expected data types. This means R does not have to infer the type. You can also sample the data and then set the class types before performing a large read
initial <- read.table("data.txt", nrows = 100)
classes <- sapply(initial, class)
all <- read.table("data.txt", colClasses = classes)
Textual Formats
- Data formats that contain contextual information like data type.
- Two examples are dumping / source and dput / dget
Generally
- Not very space efficient
- Work nicely with version control
# dput
y <- data.frame(a = 1, b = "a")
dput(y) # this will print to the console
# duming / source
x <- "foo"
y <- data.frame(a = 1, b = "a")
dump(c("x", "y"), file = "data.R")
rm(x, y) # remove the variables that were created
source("data.R") # load in the dumped data
Connecting to external data
- file - open a file
- gzfile, bzfile - opens a compressed gzip / bzip2 file
- url - opens a webpage
# read in a csv file
con <- file("data.txt", "r")
data <- read.csv(con)
close(con)
# read some lines
con <- gzfile("words.gz")
x <- readLines(con, 10)
# read a webpage
con <- url("http://www.google.com.au", "r")
x <- readLines(con)
head(x)
Subsetting
# Vectors
# With a single set of brackets the return type will be the same as the original
# For example the below vectors return another vector when accessed with the single set of brackets
x <- c("a", "b", "c", "d", "e")
x[1] # [1] "a"
x[1:2] # [1] "a" "b"
x[x > "d"] # [1] "e"
u <- x > "d" # [1] FALSE FALSE FALSE FALSE TRUE
x[u] # [1] "e"
# Lists
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
x[1] # $foo [1] 1 2 3 4
x[[1]] # [1] 1 2 3 4
x$bar # [1] 0.6
x[["bar"]] # [1] 0.6
x["bar"] # $bar [1] 0.6
x[c(1, 3)] # $foo [1] 1 2 3 4
# $baz [1] "hello"
# The double bracket has to be used over the $ when the name is calculated
name <- "foo"
x[[name]] # [1] 1 2 3 4
x$name # NULL
# Matrices
x <- matrix(1:6, 2, 3)
# By default when selecting single elements of a matrix a vector is returned
x[1, 2] # [1] 3
x[1, ] # [1] 1 3 5
x[ ,2] # [1] 3 4
# This can be turned off by telling R explicitly
x[1, 2, drop = FALSE]
# Partial Matching
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
# find foo with a partial match
x$f # [1] 1 2 3 4
# Note that the double bracket operator by default looks for exact matches
x[["f"]] # NULL
x[["f", exact = FALSE]] # [1] 1 2 3 4
Removing NA values
x <- c(1, 2, NA)
bad <- is.na(x)
x[!bad] # [1] 1 2
for
x <- c("a", "b", "c", "d")
for(i in 1:4) {
print(x[i])
}
for(i in seq_along(x)) {
print(x[i])
# Example of next - skip the current loop
if(x[i] == "b") {
next
}
}
for(letter in x) {
print(letter)
}
for(i in 1:4) print(x[i])
while
count <- 0
while(count < 10) {
print(count)
count <- count + 1
}
Repeat
count <- 0
repeat {
if(count < 10) {
count <- count + 1
} else {
break
}
}
Functions
# Basic function
add2 <- function(x, y) {
x + y
}
# Example of returning a vector with a default value in the method argument
above <- function(x, n = 10) {
use <- x > n
x[use]
}
# Calculate the mean of a matrix column (will return a vector of each columns mean)
columnmean <- function(y, removeNA = TRUE) {
nc <- ncol(y)
means <- numeric(nc)
for(i in 1:nc) {
means[i] <- mean(y[, i], na.remove = removeNA)
}
means
}
Handy methods
# Create a sequence from an integer. Similar to 1:5, can be paired with nrow
x <- seq_len(5)
x # [1] 1 2 3 4 5
Loop Functions
- lapply - loop over a list and evaluate a function on each of the elements
- sapply - same as lapply but try to simplify the result
- apply - apply a function over the margins of an array
- tapply - (table apply) apply a function over subsets of a vector
- mapply - multivariate version of lapply
lapply
lapply takes a list (or will attempt to coerce to a list) and will return the list. The below example will take the list with elements a and b and then return the mean of each of those elements.
x <- list (a = 1:5, b = rnorm(10))
lapply(x, mean)
# $a
# [1] 3
#
# $b
# [1] 0.0296824
x <- 1:4
# runif will return a value, with first variable that is passed to it is how many to return
# in this case, lapply will pass 1 through 4 to the method which will result in the first element
# having a vector of 1, the second a vector of 2 and so on.
# Note the values passed after the named function (min and max) are passed directly to the runif method
lapply(x, runif, min = 0, max = 10)
sapply
sapply will try and simplify the result of lapply
apply
apply is used to evaluate a function over the margins of an array
Debugging
traceback #will print out the call stack
debug #flags a function for debug mode
browser # suspends the execution of a function where it is called from
Generating random numbers
- rnorm - generating random normal with a given mean and standard deviation
- dnorm - evaluate the normal probability density at a point
- pnorm
- rpois
Probablity distributions normally have the following four function, d (for density), r (for random number generation), p (for cumulative distribution), q (for quantile function)
set.seed(1)
rnorm(5) # returns 5 random vars
rnorm(5) # will 5 different vars
set.seed(1)
rnorm(5) # will return the same 5 that were generated the first time
# draw a random sample
sample(1:10, 4) # pick 4 entries from the vector 1-10
sample(letters, 4) # pick 4 random letters from the alphabet
sample(1:10, replace = TRUE) # allow the sample function to return the same thing, so might get 2 ones
Basic functions
List what is in the current working directory
dir()
# R's current working directory
getwd()
# set the working directory
setwd("path/to/the/wd")
Load in an R file
source("mycode.R")