Session-1 Getting Started with R

In the world of data analysis and programming, everything boils down to the application of functions to data. Understanding how to manipulate and work with different types of data is fundamental to mastering the art of programming.

1.1 Functions and Data

  • Data: Data can take various forms, from simple numbers like 4 or “four” to complex structures like matrices or even mathematical expressions.

4, “four”, 4.000, \(\left[ \begin{array}{ccc} 4 & 4 & 4 \\ 4 & 4 & 4\end{array}\right]\)

  • Functions: Functions are the tools we use to process data. They can be as basic as addition or as intricate as logarithms, and they follow specific rules to transform input data into output, possibly with side effects.

\(\log{}\), \(+\) (two arguments), \(<\) (two), \(\mod{}\) (two), mean (one)

A function acts like a machine, taking input objects (arguments) and producing an output object (return value), all according to a predefined rule.

1.2 Types of Data

As you delve deeper into programming, you’ll encounter different types of data:

  • Booleans: These are direct binary values, often represented as TRUE or FALSE in R.

  • Integers: Whole numbers, including both positive and negative values, as well as zero.

  • Characters: These are fixed-length blocks of bits with special encoding. They are the building blocks of strings, which are sequences of characters.

  • Floating Point Numbers: These are numbers represented as a fraction (with a finite number of bits) times an exponent, like \(1.87 \times {10}^{6}\).

  • Missing or Ill-Defined Values: Programming languages provide special values like NA and NaN to represent missing or undefined data.

Understanding the intricacies of these data types is crucial for effective programming and data analysis. So, let’s dive in and explore the world of functions and data!

1.3 R as calculator

R is a versatile programming language that can be used as a powerful calculator. Its ability to perform basic arithmetic operations and more advanced mathematical calculations makes it a handy tool for quick calculations and data manipulation.

You can use R as a very, very fancy calculator

Command Description
+,-,*,\ add, subtract, multiply, divide
^ raise to the power of
%% remainder after division (ex: 8 %% 3 = 2)
( ) change the order of operations
log(), exp() logarithms and exponents (ex: log(10) = 2.302)
sqrt() square root
round() round to the nearest whole number (ex: round(2.3) = 2)
floor(), ceiling() round down or round up
abs() absolute value

Here are few examples of using R as a calculator:

1.3.1 Basic Arithmetic Operations:

# Addition
3 + 5
## [1] 8
# Subtraction
10 - 7
## [1] 3
# Multiplication
4 * 6
## [1] 24
# Division
12 / 3
## [1] 4

1.3.2 Exponents and Roots:

# Exponentiation
2^3
## [1] 8
# Square Root
sqrt(25)
## [1] 5

1.3.3 Trigonometric Functions:

# Sine
sin(pi/2)
## [1] 1
# Cosine
cos(pi)
## [1] -1
# Tangent
tan(pi/4)
## [1] 1

1.3.4 Logarithms:

# Natural Logarithm
log(10)
## [1] 2.303
# Common Logarithm (base 10)
log10(100)
## [1] 2

1.3.5 Absolute Value:

abs(-7)
## [1] 7

1.3.6 Rounding Numbers:

# Round to a specific number of decimal places
round(3.14159265, 2)
## [1] 3.14
# Round down
floor(4.9)
## [1] 4
# Round up
ceiling(4.1)
## [1] 5

1.3.7 Random Numbers:

# Generate a random number between 0 and 1
runif(1)
## [1] 0.4783
# Generate a random integer between 1 and 100
sample(1:100, 1)
## [1] 53

1.3.8 Comparisons:

Are binary operators; they take two objects, like numbers, and give a Boolean

7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE
7 <= 5
## [1] FALSE

1.3.9 Boolean operators:

Basically “and” and “or”:

(5 > 7) & (6*7 == 42)
## [1] FALSE
(5 > 7) | (6*7 == 42)
## [1] TRUE

(will see special doubled forms, && and ||, later)

1.3.10 More types

  • typeof() function returns the type

  • is.foo() functions return Booleans for whether the argument is of type foo

  • as.foo() (tries to) “cast” its argument to type foo — to translate it sensibly into a foo-type value

Special case: as.factor() will be important later for telling R when numbers are actually encodings and not numeric values. (E.g., 1 = High school grad; 2 = College grad; 3 = Postgrad)

typeof(4)
## [1] "double"
is.numeric(4)
## [1] TRUE
is.na(4)
## [1] FALSE
is.character(4)
## [1] FALSE
is.character("4")
## [1] TRUE
is.character("four")
## [1] TRUE
is.na("four")
## [1] FALSE

1.4 Variables and Data Types

Variables are fundamental in programming, serving as containers for storing and manipulating data. R supports various data types, including numeric, character, and logical types.

We can give names to data objects; these give us variables

A few built variables are:

pi
## [1] 3.142

Variables can be arguments to functions or operators, just like constants:

pi*10
## [1] 31.42
cos(pi)
## [1] -1

1.4.1 Numeric

Numeric variables store numeric values like integers and decimals.

Examples:

# Integer
age <- 25
age
## [1] 25
# Decimal
temperature <- 98.6
temperature
## [1] 98.6

1.4.2 Character:

Character variables store text data enclosed in quotes.

Examples:

# Single character
gender <- 'M'
gender
## [1] "M"
# Text string
name <- 'Alice'
name 
## [1] "Alice"

1.4.3 Logical:

Logical variables represent binary values (TRUE or FALSE).

Examples:

# Logical condition
is_student <- TRUE
is_student
## [1] TRUE
# Logical comparison
has_license <- 2 > 1
has_license
## [1] TRUE

1.5 Basic Operators in R

In R, you can use various operators to perform different types of operations, including arithmetic, comparison, and logical operations.

1.5.1 Arithmetic Operators

These operators perform basic mathematical calculations:

# Addition
result_add <- 5 + 3
result_add
## [1] 8
# Subtraction
result_sub <- 10 - 7
result_sub
## [1] 3
# Multiplication
result_mul <- 4 * 6
result_mul
## [1] 24
# Division
result_div <- 12 / 3
result_div
## [1] 4
# Exponentiation
result_exp <- 2^3
result_exp
## [1] 8

1.5.2 Comparison Operators

Comparison operators are used to compare values and return logical results

# Equal to
result_equal <- 5 == 5
result_equal
## [1] TRUE
# Not equal to
result_not_equal <- 10 != 7
result_not_equal
## [1] TRUE
# Greater than
result_greater_than <- 8 > 5
result_greater_than
## [1] TRUE
# Less than or equal to
result_less_equal <- 4 <= 4
result_less_equal
## [1] TRUE

1.5.3 Logical Operators

Logical operators combine logical values

# Logical AND
result_and <- TRUE & FALSE
result_and
## [1] FALSE
# Logical OR
result_or <- TRUE | FALSE
result_or
## [1] TRUE
# Logical NOT
result_not <- !TRUE
result_not
## [1] FALSE

1.6 Assignment operator

Most variables are created with the assignment operator, <- or =

time.factor <- 12
time.factor
## [1] 12
time.in.years = 2.5
time.in.years * time.factor
## [1] 30

The assignment operator also changes values:

time.in.months <- time.in.years * time.factor
time.in.months
## [1] 30
time.in.months <- 45
time.in.months
## [1] 45

1.7 Pipe operator

The %>% operator in R is part of the magrittr package and is commonly referred to as the “pipe” operator. It is used to chain together multiple operations or functions in a way that enhances code readability and conciseness, particularly when working with data manipulation and transformation tasks.

Here’s how the %>% operator works:

It takes the result of the expression on its left-hand side and passes it as the first argument to the function on its right-hand side. It can be used to chain together a series of operations, allowing you to perform a sequence of actions on a data frame or other objects. It eliminates the need for nested function calls, making code more linear and easier to understand.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Chain together operations on a data frame
result <- data.frame(x = 1:10, y = 11:20) %>%
  filter(x > 5) %>%
  mutate(z = x + y) %>%
  select(x, z)
# The result will contain the filtered and mutated data frame
result
##    x  z
## 1  6 22
## 2  7 24
## 3  8 26
## 4  9 28
## 5 10 30

In this example, the %>% operator is used to filter rows, create a new column, and select specific columns in a data frame, making the code more readable and structured. It is a valuable tool for improving the clarity of data manipulation pipelines in R.

1.8 Creating and indexing vectors

Creating and Indexing Vectors in R

In R, a vector is a fundamental data structure that stores a collection of values of the same data type. You can create vectors using various methods and access their elements through indexing.

Examples:

  • Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read

  • Avoid “magic constants”; use named variables

  • Use descriptive variable names

    • Good: num.students <- 35
    • Bad: ns <- 35

1.9 The workspace

What names have you defined values for?

ls()
##  [1] "age"                 "gender"             
##  [3] "githubs"             "has_license"        
##  [5] "is_student"          "name"               
##  [7] "result"              "result_add"         
##  [9] "result_and"          "result_div"         
## [11] "result_equal"        "result_exp"         
## [13] "result_greater_than" "result_less_equal"  
## [15] "result_mul"          "result_not"         
## [17] "result_not_equal"    "result_or"          
## [19] "result_sub"          "temperature"        
## [21] "time.factor"         "time.in.months"     
## [23] "time.in.years"

Getting rid of variables:

rm("time.in.months")
ls()
##  [1] "age"                 "gender"             
##  [3] "githubs"             "has_license"        
##  [5] "is_student"          "name"               
##  [7] "result"              "result_add"         
##  [9] "result_and"          "result_div"         
## [11] "result_equal"        "result_exp"         
## [13] "result_greater_than" "result_less_equal"  
## [15] "result_mul"          "result_not"         
## [17] "result_not_equal"    "result_or"          
## [19] "result_sub"          "temperature"        
## [21] "time.factor"         "time.in.years"
  • Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read

  • Avoid using constants or hard coded values instead use named variables

  • Use descriptive variable names

    • Good: num.students <- 35
    • Bad: ns <- 35

1.10 Vectors

  • Group related data values into one object, a data structure

  • A vector is a sequence of values, all of the same type

  • c() function returns a vector containing all its arguments in order

1.10.1 Creating Vectors

# 1. Creating a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
numeric_vector
## [1] 1 2 3 4 5
# 2. Creating a character vector
character_vector <- c("apple", "banana", "cherry")
character_vector
## [1] "apple"  "banana" "cherry"
# 3. Creating a logical vector
logical_vector <- c(TRUE, FALSE, TRUE)
logical_vector
## [1]  TRUE FALSE  TRUE
# 4. Creating a sequence of numbers
sequence_vector <- 1:10
sequence_vector
##  [1]  1  2  3  4  5  6  7  8  9 10
# 5. Creating a repeated vector
repeat_vector <- rep(0, times = 5)
repeat_vector
## [1] 0 0 0 0 0

1.10.2 Indexing vectors

# 6. Indexing by position
first_element <- numeric_vector[1]
first_element
## [1] 1
# 7. Indexing multiple elements
subset_vector <- numeric_vector[c(2, 4)]
subset_vector
## [1] 2 4
# 8. Indexing using logical condition
filtered_vector <- numeric_vector[numeric_vector > 3]
filtered_vector
## [1] 4 5
# 9. Named vector elements
names(numeric_vector) <- c("one", "two", "three", "four", "five")
numeric_vector
##   one   two three  four  five 
##     1     2     3     4     5
# 10. Accessing by name
element_by_name <- numeric_vector["two"]
element_by_name
## two 
##   2

1.10.3 Vector arithmetic

Operators apply to vectors “pairwise” or “elementwise”:

students <- c("SaiKumar", "Aditi", "Akshay", "Arun", "Deepika")
final <- c(87, 45, 98, 80, 75) # Final exam scores
midterm <- c(25, 28, 26, 28, 25)# Midterm exam scores
midterm + final # Sum of midterm and final scores
## [1] 112  73 124 108 100
(midterm + final)/2 # Average exam score
## [1] 56.0 36.5 62.0 54.0 50.0
course.grades <- 0.4*midterm + 0.6*final # Final course grade
course.grades
## [1] 62.2 38.2 69.2 59.2 55.0

1.10.4 Pairwise comparisons

Is the final score higher than the midterm score?

midterm 
## [1] 25 28 26 28 25
final
## [1] 87 45 98 80 75
final > midterm
## [1] TRUE TRUE TRUE TRUE TRUE

Boolean operators can be applied elementwise:

(final < midterm) & (midterm > 80)
## [1] FALSE FALSE FALSE FALSE FALSE

1.10.5 Functions on vectors

Command Description
sum(vec) sums up all the elements of vec
mean(vec) mean of vec
median(vec) median of vec
min(vec), max(vec) the largest or smallest element of vec
sd(vec), var(vec) the standard deviation and variance of vec
length(vec) the number of elements in vec
pmax(vec1, vec2), pmin(vec1, vec2) example: pmax(quiz1, quiz2) returns the higher of quiz 1 and quiz 2 for each student
sort(vec) returns the vec in sorted order
order(vec) returns the index that sorts the vector vec
unique(vec) lists the unique elements of vec
summary(vec) gives a five-number summary
any(vec), all(vec) useful on Boolean vectors

1.10.6 Functions on vectors

course.grades
## [1] 62.2 38.2 69.2 59.2 55.0
mean(course.grades) # mean grade
## [1] 56.76
median(course.grades)
## [1] 59.2
sd(course.grades) # grade standard deviation
## [1] 11.6
sort(course.grades)
## [1] 38.2 55.0 59.2 62.2 69.2
max(course.grades) # highest course grade
## [1] 69.2
min(course.grades) # lowest course grade
## [1] 38.2

1.10.7 Referencing elements of vectors

students 
## [1] "SaiKumar" "Aditi"    "Akshay"   "Arun"    
## [5] "Deepika"

Vector of indices:

students[c(2,4)]
## [1] "Aditi" "Arun"

Vector of negative indices : Excludes the elements at specified indices

students[c(-1,-3)]
## [1] "Aditi"   "Arun"    "Deepika"

which() returns the TRUE indexes of a Boolean vector:

course.grades
## [1] 62.2 38.2 69.2 59.2 55.0
a.threshold <- 90 # A grade = 90% or higher
course.grades >= a.threshold # vector of booleans
## [1] FALSE FALSE FALSE FALSE FALSE
a.students <- which(course.grades >= a.threshold) # Applying which() 
a.students
## integer(0)
students[a.students] # Names of A students
## character(0)

1.10.8 Named components

You can give names to elements or components of vectors

students
## [1] "SaiKumar" "Aditi"    "Akshay"   "Arun"    
## [5] "Deepika"
names(course.grades) <- students # Assign names to the grades
names(course.grades)
## [1] "SaiKumar" "Aditi"    "Akshay"   "Arun"    
## [5] "Deepika"
course.grades[c("Aditi", "Akshay", "Arun", "Frank")] # Get final grades for 3 students
##  Aditi Akshay   Arun   <NA> 
##   38.2   69.2   59.2     NA

Note the labels in what R prints; these are not actually part of the value