Session-4 Data Frames and Data Manipulation

Data frames are a fundamental data structure in R, serving as a cornerstone for data manipulation and analysis. They are akin to tables in a relational database or spreadsheets, providing an organized way to store and work with structured data.

In a data frame, data is organized into rows and columns, where each column can contain different types of data, such as numeric, character, or factors. This flexibility makes data frames a versatile choice for handling real-world datasets, as they can accommodate diverse data types and structures.

R offers a rich ecosystem of packages and functions for data manipulation with data frames. Popular packages like dplyr and tidyr enable users to filter, transform, summarize, and reshape data with ease. With these tools, you can clean messy datasets, aggregate information, and prepare data for analysis or visualization.

Data frames are central to tasks like exploratory data analysis, data cleaning, and model building. They allow data scientists and analysts to efficiently explore data, identify patterns, and make data-driven decisions.

Data frames are the backbone of data manipulation in R, enabling users to transform raw data into meaningful insights. Proficiency in working with data frames is a critical skill for anyone involved in data science, as it lays the foundation for effective data analysis and modeling.

4.1 Intro to Data Frames

In the world of data analysis and manipulation, data frames stand as a fundamental pillar. But what exactly are data frames, and why are they so crucial?

Data Frames Defined:

A data frame is a structured data object that resembles a table or spreadsheet. It consists of rows and columns, where each column can contain different types of data—numeric, character, or factors. Think of data frames as a versatile and organized way to store and work with data.

The Significance of Data Frames:

Data frames are the backbone of data analysis in R for several reasons:

Structured Storage: Data frames provide an organized structure for storing data. With rows representing observations and columns representing variables or attributes, they make it easier to maintain the integrity of your data.

Data Variety: In real-world scenarios, data comes in diverse forms—numbers, text, categories, dates, and more. Data frames are capable of accommodating this variety, allowing you to work with mixed data types seamlessly.

Data Manipulation: Data frames are the canvas on which you can perform a wide range of data manipulation operations. Whether you need to filter, sort, aggregate, or transform your data, data frames offer the necessary tools.

Data Exploration: When you embark on an exploratory data analysis journey, data frames serve as your trusty companion. You can use them to examine the structure of your data, identify trends, and uncover insights.

Data Visualization: Many data visualization libraries, like ggplot2, are designed to work effortlessly with data frames. This integration simplifies the process of creating informative charts and plots.

Over the next course of sessions we’ll see how dataframes are helpful.

4.2 Creating and Working with Data Frames

Data frames are versatile containers for your data, and you can easily create them using built-in datasets in R. Let’s see how:

# Create a data frame from the built-in 'mtcars' dataset
df <- mtcars

# View the first few rows of the data frame
head(df)
##                    mpg cyl disp  hp drat    wt  qsec
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
## Valiant           18.1   6  225 105 2.76 3.460 20.22
##                   vs am gear carb
## Mazda RX4          0  1    4    4
## Mazda RX4 Wag      0  1    4    4
## Datsun 710         1  1    4    1
## Hornet 4 Drive     1  0    3    1
## Hornet Sportabout  0  0    3    2
## Valiant            1  0    3    1

This code snippet uses the ‘mtcars’ dataset to create a data frame named ‘df.’ You can substitute ‘mtcars’ with other datasets available in R.

Accessing and Manipulating Data within Data Frames

Once you have a data frame, you’ll want to access and manipulate the data within it. Here are some common operations:

Accessing columns:

# Access the 'mpg' column
df$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
## [21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Filtering rows:

# Filter rows where 'mpg' is greater than 20
filtered_df <- df[df$mpg > 20, ]

Creating new columns:

# Create a new column 'kpl' by converting 'mpg' to kilometers per liter
df$kpl <- df$mpg * 0.4251

Summary statistics:

# Calculate summary statistics for the 'mpg' column
summary(df$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.4    15.4    19.2    20.1    22.8    33.9
library(tidyverse)
Cars93 <- MASS::Cars93

4.2.0.1 2. Table function

(a) Use the table() function to tabulate the data by DriveTrain and Origin.

table(Cars93$DriveTrain, Cars93$Origin)
##        
##         USA non-USA
##   4WD     5       5
##   Front  34      33
##   Rear    9       7

(b) Repeat part (a), this time using the count() function.

Cars93 %>% 
  count(DriveTrain, Origin)
##   DriveTrain  Origin  n
## 1        4WD     USA  5
## 2        4WD non-USA  5
## 3      Front     USA 34
## 4      Front non-USA 33
## 5       Rear     USA  9
## 6       Rear non-USA  7

(c) Does it looks like foreign car manufacturers had different Drivetrain production preferences compared to US manufacturers?

The counts for each Drivetrain category are nearly the same for US and non-US manufacturers. The table suggests that they had similar Drivetrain production preferences. ## Joining Columns and Rows with cbind() and rbind():

Combining Data Horizontally with cbind():

Sometimes, you may need to combine data frames horizontally, i.e., adding columns. The cbind() function comes in handy for this task:

# Create two sample data frames
df1 <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                  Score = c(95, 88, 72))

df2 <- data.frame(ID = c(101, 102, 103),
                  Grade = c("A", "B", "C"))

# Combine the data frames horizontally
combined_df <- cbind(df1, df2)

# View the result
combined_df
##      Name Score  ID Grade
## 1   Alice    95 101     A
## 2     Bob    88 102     B
## 3 Charlie    72 103     C

The cbind() function combines ‘df1’ and ‘df2’ horizontally based on their row order, creating ‘combined_df.’

Combining Data Vertically with rbind():

To stack data frames on top of each other, you can use the rbind() function:

# Create two sample data frames
df1 <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                  Score = c(95, 88, 72))

df2 <- data.frame(Name = c("David", "Eve"),
                  Score = c(78, 89))

# Combine the data frames vertically
stacked_df <- rbind(df1, df2)

# View the result
stacked_df
##      Name Score
## 1   Alice    95
## 2     Bob    88
## 3 Charlie    72
## 4   David    78
## 5     Eve    89

The rbind() function vertically combines ‘df1’ and ‘df2,’ stacking ‘df2’ below ‘df1.’

These functions are valuable when you need to merge datasets or add additional information to your existing data frames. In the next section, we’ll explore reading data from external sources.

4.3 Reading Data from CSV, Excel, and Other Formats

Importing Data into R:

Data analysis often involves working with external datasets. R provides various functions to import data from different file formats, such as CSV, Excel, and more. Let’s discuss some of these functions:

read.csv(): Used to read CSV files.

Example:

# Read data from a CSV file
data <- read.csv("data/MobileOS_US.csv")

head(data)
##        MobileOS X2009.01 X2009.02 X2009.03 X2009.04
## 1       Android     1.75     5.47     6.12     5.56
## 2 BlackBerry OS     2.16    12.36     4.74     5.52
## 3          Brew     0.00     0.00     0.00     0.00
## 4           iOS    59.18    60.48    68.06    69.79
## 5            LG     0.00     0.00     0.00     0.00
## 6         Linux     3.33     0.00     0.00     0.00
##   X2009.05 X2009.06 X2009.07 X2009.08 X2009.09
## 1     5.47     0.30     5.19     5.45     5.94
## 2    12.98    20.07    20.13    17.69    16.80
## 3     0.00     0.00     0.00     0.00     0.00
## 4    63.10    59.98    56.64    57.60    55.92
## 5     0.00     0.00     0.14     0.75     0.79
## 6     0.00     0.00     0.00     0.00     0.00
##   X2009.10 X2009.11 X2009.12 X2010.01 X2010.02
## 1     6.83     8.23    10.62    11.90    12.86
## 2    17.73    23.14    19.66    20.13    24.24
## 3     0.00     0.00     0.00     0.00     0.00
## 4    56.24    51.90    53.26    53.58    50.63
## 5     0.66     0.57     0.57     0.52     0.44
## 6     0.00     0.00     0.00     0.00     0.00
##   X2010.03 X2010.04 X2010.05 X2010.06 X2010.07
## 1    14.44    13.37     9.75     9.63    16.85
## 2    28.67    30.46    30.56    31.80    32.77
## 3     0.00     0.00     0.00     0.00     0.00
## 4    45.49    43.33    42.58    40.06    37.99
## 5     0.40     0.26     0.06     0.05     0.18
## 6     0.00     0.00     0.00     0.00     0.00
##   X2010.08 X2010.09 X2010.10 X2010.11 X2010.12
## 1    19.10    23.22    23.34    23.77    25.46
## 2    34.42    29.45    32.02    34.34    32.15
## 3     0.00     0.00     0.00     0.00     0.00
## 4    36.55    37.32    35.16    32.96    33.55
## 5     0.18     0.15     0.12     0.10     0.09
## 6     0.00     0.00     0.00     0.00     0.00
##   X2011.01 X2011.02 X2011.03 X2011.04 X2011.05
## 1    26.68    29.31    31.83    30.81    32.23
## 2    27.78    23.39    20.66    23.31    23.45
## 3     0.00     0.00     0.00     0.00     0.00
## 4    35.04    36.74    39.01    36.80    34.09
## 5     0.08     0.07     0.06     0.05     0.05
## 6     0.00     0.00     0.00     0.00     0.00
##   X2011.06 X2011.07 X2011.08 X2011.09 X2011.10
## 1    34.79    35.80    39.37    39.52    40.03
## 2    16.60    15.87    13.83    12.20    10.10
## 3     0.00     0.00     0.00     0.00     0.00
## 4    36.05    38.65    38.10    39.12    40.40
## 5     0.05     0.04     0.02     0.02     0.02
## 6     0.00     0.00     0.00     0.00     0.03
##   X2011.11 X2011.12 X2012.01 X2012.02 X2012.03
## 1    39.67    40.92    41.77    41.66    40.83
## 2     7.60     6.52     5.83     5.23     4.93
## 3     0.00     0.00     0.00     0.00     0.01
## 4    42.79    42.52    45.37    47.42    48.35
## 5     0.02     0.01     0.01     0.01     0.01
## 6     0.05     0.05     0.05     0.05     0.06
##   X2012.04 X2012.05 X2012.06 X2012.07 X2012.08
## 1    40.69    40.44    39.77    40.31    40.85
## 2     4.25     4.02     3.64     3.24     2.84
## 3     0.08     0.09     0.11     0.11     0.11
## 4    48.85    48.82    50.21    50.43    50.40
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.06     0.06     0.05     0.06     0.06
##   X2012.09 X2012.10 X2012.11 X2012.12 X2013.01
## 1    41.30    40.30    39.83    40.22    39.40
## 2     2.64     2.39     2.23     1.89     1.71
## 3     0.12     0.12     0.12     0.11     0.10
## 4    50.46    50.75    50.63    50.49    52.69
## 5     0.00     0.01     0.01     0.01     0.01
## 6     0.06     0.05     0.05     0.06     0.05
##   X2013.02 X2013.03 X2013.04 X2013.05 X2013.06
## 1    38.83    40.14    40.08    39.88    39.36
## 2     1.61     1.49     1.43     1.38     1.38
## 3     0.10     0.10     0.09     0.09     0.09
## 4    53.48    52.50    53.51    54.24    54.85
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.05     0.06     0.06     0.06     0.06
##   X2013.07 X2013.08 X2013.09 X2013.10 X2013.11
## 1    38.77    39.05    38.46    38.66    39.18
## 2     3.53     5.11     5.71     5.85     5.33
## 3     0.08     0.07     0.07     0.06     0.06
## 4    53.91    52.22    51.95    51.54    51.63
## 5     0.10     0.09     0.09     0.09     0.10
## 6     0.11     0.13     0.16     0.17     0.12
##   X2013.12 X2014.01 X2014.02 X2014.03 X2014.04
## 1    39.25    38.09    40.37    40.72    41.38
## 2     5.01     3.79     2.96     1.88     1.62
## 3     0.05     0.04     0.04     0.04     0.04
## 4    51.75    54.49    52.87    53.02    52.50
## 5     0.09     0.08     0.08     0.08     0.08
## 6     0.10     0.08     0.07     0.09     0.09
##   X2014.05 X2014.06 X2014.07 X2014.08 X2014.09
## 1    42.72    42.93    42.83    43.43    42.33
## 2     1.54     1.53     1.48     1.23     1.12
## 3     0.03     0.03     0.03     0.03     0.03
## 4    51.17    51.36    51.51    51.40    53.31
## 5     0.07     0.06     0.07     0.06     0.05
## 6     0.09     0.09     0.09     0.10     0.06
##   X2014.10 X2014.11 X2014.12 X2015.01 X2015.02
## 1    41.36    45.66    47.32    45.40    44.73
## 2     0.92     1.03     0.88     0.80     0.73
## 3     0.02     0.03     0.02     0.02     0.02
## 4    55.22    51.00    49.54    51.38    52.18
## 5     0.04     0.04     0.03     0.03     0.03
## 6     0.04     0.04     0.03     0.02     0.02
##   X2015.03 X2015.04 X2015.05 X2015.06 X2015.07
## 1    44.48    47.51    48.52    48.12    46.79
## 2     0.71     0.57     0.46     0.32     0.37
## 3     0.02     0.02     0.02     0.02     0.02
## 4    52.59    49.50    48.64    49.38    50.68
## 5     0.03     0.03     0.03     0.03     0.03
## 6     0.02     0.02     0.04     0.02     0.03
##   X2015.08 X2015.09 X2015.10 X2015.11 X2015.12
## 1    46.37    47.15    47.25    46.03    45.65
## 2     0.34     0.30     0.33     0.28     0.37
## 3     0.02     0.02     0.02     0.02     0.02
## 4    51.17    50.37    50.09    51.64    51.86
## 5     0.03     0.04     0.04     0.03     0.03
## 6     0.04     0.05     0.05     0.03     0.03
##   X2016.01 X2016.02 X2016.03 X2016.04 X2016.05
## 1    45.05    44.68    46.07    45.87    45.53
## 2     0.24     0.24     0.20     0.22     0.24
## 3     0.02     0.02     0.02     0.02     0.02
## 4    52.90    53.37    52.16    52.42    52.78
## 5     0.03     0.03     0.02     0.02     0.02
## 6     0.02     0.02     0.02     0.02     0.02
##   X2016.06 X2016.07 X2016.08 X2016.09 X2016.10
## 1    44.76    46.19    44.13    46.30    43.75
## 2     0.19     0.17     0.17     0.16     0.15
## 3     0.02     0.02     0.02     0.02     0.01
## 4    53.67    52.41    54.47    51.72    54.77
## 5     0.01     0.01     0.01     0.02     0.01
## 6     0.02     0.02     0.02     0.03     0.01
##   X2016.11 X2016.12 X2017.01 X2017.02 X2017.03
## 1    44.91    45.14    44.97    44.90    45.43
## 2     0.13     0.13     0.13     0.12     0.11
## 3     0.01     0.01     0.01     0.01     0.01
## 4    53.83    53.70    53.89    54.01    53.57
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.01     0.01     0.01     0.01
##   X2017.04 X2017.05 X2017.06 X2017.07 X2017.08
## 1    45.07    45.40    46.04    46.91    45.09
## 2     0.11     0.10     0.08     0.08     0.08
## 3     0.01     0.01     0.01     0.01     0.01
## 4    53.88    53.49    53.07    52.17    54.12
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.00     0.00     0.00     0.01
##   X2017.09 X2017.10 X2017.11 X2017.12 X2018.01
## 1    45.33    44.37    44.01    45.04    45.99
## 2     0.09     0.07     0.06     0.05     0.04
## 3     0.01     0.01     0.01     0.01     0.01
## 4    53.85    54.96    55.42    54.42    53.44
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.01     0.01     0.01     0.01
##   X2018.02 X2018.03 X2018.04 X2018.05 X2018.06
## 1    45.48    43.24    45.79    45.09    45.94
## 2     0.03     0.03     0.03     0.03     0.03
## 3     0.01     0.01     0.01     0.01     0.01
## 4    54.02    56.26    53.60    54.40    53.60
## 5     0.01     0.01     0.01     0.02     0.02
## 6     0.00     0.01     0.06     0.01     0.01
##   X2018.07 X2018.08 X2018.09 X2018.10 X2018.11
## 1    47.39    46.47    45.96    43.57    37.57
## 2     0.03     0.04     0.04     0.04     0.04
## 3     0.01     0.01     0.01     0.01     0.01
## 4    52.19    53.13    53.70    56.08    62.01
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.01     0.01     0.01     0.01
##   X2018.12 X2019.01 X2019.02 X2019.03 X2019.04
## 1    42.75    42.43    42.29    44.89    45.07
## 2     0.04     0.05     0.06     0.06     0.06
## 3     0.01     0.01     0.01     0.01     0.01
## 4    56.91    57.22    57.33    54.74    54.61
## 5     0.00     0.00     0.00     0.00     0.00
## 6     0.01     0.01     0.01     0.01     0.01
##   X2019.05
## 1    46.13
## 2     0.03
## 3     0.01
## 4    53.62
## 5     0.00
## 6     0.01

read.table(): General function for reading tabular data, including CSV files. Example:

# Read data from a tab-delimited file
data <- read.table("data/tab_Sep_file", header = TRUE, sep = "\t")
head(data)
##   Date.......Revenue......Expenses......Profit....
## 1      2023-01-01  50000       30000         20000
## 2      2023-01-02  48000       31000         17000
## 3      2023-01-03  52000       29000         23000
## 4      2023-01-04  55000       32000         23000
## 5      2023-01-05  49000       28000         21000
## 6      2023-01-06  51000       33000         18000

read.xlsx() (from the readxl package): Used for reading Excel files. Example:

# Read data from an Excel file
library(readxl)
data <- read_excel("data/MobileOS_US.xlsx")
head(data)
## # A tibble: 6 × 126
##   MobileOS      `2009-01` `2009-02` `2009-03` `2009-04`
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 Android            1.75      5.47      6.12      5.56
## 2 BlackBerry OS      2.16     12.4       4.74      5.52
## 3 Brew               0         0         0         0   
## 4 iOS               59.2      60.5      68.1      69.8 
## 5 LG                 0         0         0         0   
## 6 Linux              3.33      0         0         0   
## # ℹ 121 more variables: `2009-05` <dbl>,
## #   `2009-06` <dbl>, `2009-07` <dbl>, `2009-08` <dbl>,
## #   `2009-09` <dbl>, `2009-10` <dbl>, `2009-11` <dbl>,
## #   `2009-12` <dbl>, `2010-01` <dbl>, `2010-02` <dbl>,
## #   `2010-03` <dbl>, `2010-04` <dbl>, `2010-05` <dbl>,
## #   `2010-06` <dbl>, `2010-07` <dbl>, `2010-08` <dbl>,
## #   `2010-09` <dbl>, `2010-10` <dbl>, …

Hands-on-exercise:

Let’s demonstrate how to read data from a CSV file. Assuming you have a file named “data.csv” with columns ‘Name’ and ‘Age,’ you can use the read.csv() function:

# Read data from a CSV file
data <- read.csv("data/MobileOS_US.csv")

# View the first few rows
head(data)
##        MobileOS X2009.01 X2009.02 X2009.03 X2009.04
## 1       Android     1.75     5.47     6.12     5.56
## 2 BlackBerry OS     2.16    12.36     4.74     5.52
## 3          Brew     0.00     0.00     0.00     0.00
## 4           iOS    59.18    60.48    68.06    69.79
## 5            LG     0.00     0.00     0.00     0.00
## 6         Linux     3.33     0.00     0.00     0.00
##   X2009.05 X2009.06 X2009.07 X2009.08 X2009.09
## 1     5.47     0.30     5.19     5.45     5.94
## 2    12.98    20.07    20.13    17.69    16.80
## 3     0.00     0.00     0.00     0.00     0.00
## 4    63.10    59.98    56.64    57.60    55.92
## 5     0.00     0.00     0.14     0.75     0.79
## 6     0.00     0.00     0.00     0.00     0.00
##   X2009.10 X2009.11 X2009.12 X2010.01 X2010.02
## 1     6.83     8.23    10.62    11.90    12.86
## 2    17.73    23.14    19.66    20.13    24.24
## 3     0.00     0.00     0.00     0.00     0.00
## 4    56.24    51.90    53.26    53.58    50.63
## 5     0.66     0.57     0.57     0.52     0.44
## 6     0.00     0.00     0.00     0.00     0.00
##   X2010.03 X2010.04 X2010.05 X2010.06 X2010.07
## 1    14.44    13.37     9.75     9.63    16.85
## 2    28.67    30.46    30.56    31.80    32.77
## 3     0.00     0.00     0.00     0.00     0.00
## 4    45.49    43.33    42.58    40.06    37.99
## 5     0.40     0.26     0.06     0.05     0.18
## 6     0.00     0.00     0.00     0.00     0.00
##   X2010.08 X2010.09 X2010.10 X2010.11 X2010.12
## 1    19.10    23.22    23.34    23.77    25.46
## 2    34.42    29.45    32.02    34.34    32.15
## 3     0.00     0.00     0.00     0.00     0.00
## 4    36.55    37.32    35.16    32.96    33.55
## 5     0.18     0.15     0.12     0.10     0.09
## 6     0.00     0.00     0.00     0.00     0.00
##   X2011.01 X2011.02 X2011.03 X2011.04 X2011.05
## 1    26.68    29.31    31.83    30.81    32.23
## 2    27.78    23.39    20.66    23.31    23.45
## 3     0.00     0.00     0.00     0.00     0.00
## 4    35.04    36.74    39.01    36.80    34.09
## 5     0.08     0.07     0.06     0.05     0.05
## 6     0.00     0.00     0.00     0.00     0.00
##   X2011.06 X2011.07 X2011.08 X2011.09 X2011.10
## 1    34.79    35.80    39.37    39.52    40.03
## 2    16.60    15.87    13.83    12.20    10.10
## 3     0.00     0.00     0.00     0.00     0.00
## 4    36.05    38.65    38.10    39.12    40.40
## 5     0.05     0.04     0.02     0.02     0.02
## 6     0.00     0.00     0.00     0.00     0.03
##   X2011.11 X2011.12 X2012.01 X2012.02 X2012.03
## 1    39.67    40.92    41.77    41.66    40.83
## 2     7.60     6.52     5.83     5.23     4.93
## 3     0.00     0.00     0.00     0.00     0.01
## 4    42.79    42.52    45.37    47.42    48.35
## 5     0.02     0.01     0.01     0.01     0.01
## 6     0.05     0.05     0.05     0.05     0.06
##   X2012.04 X2012.05 X2012.06 X2012.07 X2012.08
## 1    40.69    40.44    39.77    40.31    40.85
## 2     4.25     4.02     3.64     3.24     2.84
## 3     0.08     0.09     0.11     0.11     0.11
## 4    48.85    48.82    50.21    50.43    50.40
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.06     0.06     0.05     0.06     0.06
##   X2012.09 X2012.10 X2012.11 X2012.12 X2013.01
## 1    41.30    40.30    39.83    40.22    39.40
## 2     2.64     2.39     2.23     1.89     1.71
## 3     0.12     0.12     0.12     0.11     0.10
## 4    50.46    50.75    50.63    50.49    52.69
## 5     0.00     0.01     0.01     0.01     0.01
## 6     0.06     0.05     0.05     0.06     0.05
##   X2013.02 X2013.03 X2013.04 X2013.05 X2013.06
## 1    38.83    40.14    40.08    39.88    39.36
## 2     1.61     1.49     1.43     1.38     1.38
## 3     0.10     0.10     0.09     0.09     0.09
## 4    53.48    52.50    53.51    54.24    54.85
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.05     0.06     0.06     0.06     0.06
##   X2013.07 X2013.08 X2013.09 X2013.10 X2013.11
## 1    38.77    39.05    38.46    38.66    39.18
## 2     3.53     5.11     5.71     5.85     5.33
## 3     0.08     0.07     0.07     0.06     0.06
## 4    53.91    52.22    51.95    51.54    51.63
## 5     0.10     0.09     0.09     0.09     0.10
## 6     0.11     0.13     0.16     0.17     0.12
##   X2013.12 X2014.01 X2014.02 X2014.03 X2014.04
## 1    39.25    38.09    40.37    40.72    41.38
## 2     5.01     3.79     2.96     1.88     1.62
## 3     0.05     0.04     0.04     0.04     0.04
## 4    51.75    54.49    52.87    53.02    52.50
## 5     0.09     0.08     0.08     0.08     0.08
## 6     0.10     0.08     0.07     0.09     0.09
##   X2014.05 X2014.06 X2014.07 X2014.08 X2014.09
## 1    42.72    42.93    42.83    43.43    42.33
## 2     1.54     1.53     1.48     1.23     1.12
## 3     0.03     0.03     0.03     0.03     0.03
## 4    51.17    51.36    51.51    51.40    53.31
## 5     0.07     0.06     0.07     0.06     0.05
## 6     0.09     0.09     0.09     0.10     0.06
##   X2014.10 X2014.11 X2014.12 X2015.01 X2015.02
## 1    41.36    45.66    47.32    45.40    44.73
## 2     0.92     1.03     0.88     0.80     0.73
## 3     0.02     0.03     0.02     0.02     0.02
## 4    55.22    51.00    49.54    51.38    52.18
## 5     0.04     0.04     0.03     0.03     0.03
## 6     0.04     0.04     0.03     0.02     0.02
##   X2015.03 X2015.04 X2015.05 X2015.06 X2015.07
## 1    44.48    47.51    48.52    48.12    46.79
## 2     0.71     0.57     0.46     0.32     0.37
## 3     0.02     0.02     0.02     0.02     0.02
## 4    52.59    49.50    48.64    49.38    50.68
## 5     0.03     0.03     0.03     0.03     0.03
## 6     0.02     0.02     0.04     0.02     0.03
##   X2015.08 X2015.09 X2015.10 X2015.11 X2015.12
## 1    46.37    47.15    47.25    46.03    45.65
## 2     0.34     0.30     0.33     0.28     0.37
## 3     0.02     0.02     0.02     0.02     0.02
## 4    51.17    50.37    50.09    51.64    51.86
## 5     0.03     0.04     0.04     0.03     0.03
## 6     0.04     0.05     0.05     0.03     0.03
##   X2016.01 X2016.02 X2016.03 X2016.04 X2016.05
## 1    45.05    44.68    46.07    45.87    45.53
## 2     0.24     0.24     0.20     0.22     0.24
## 3     0.02     0.02     0.02     0.02     0.02
## 4    52.90    53.37    52.16    52.42    52.78
## 5     0.03     0.03     0.02     0.02     0.02
## 6     0.02     0.02     0.02     0.02     0.02
##   X2016.06 X2016.07 X2016.08 X2016.09 X2016.10
## 1    44.76    46.19    44.13    46.30    43.75
## 2     0.19     0.17     0.17     0.16     0.15
## 3     0.02     0.02     0.02     0.02     0.01
## 4    53.67    52.41    54.47    51.72    54.77
## 5     0.01     0.01     0.01     0.02     0.01
## 6     0.02     0.02     0.02     0.03     0.01
##   X2016.11 X2016.12 X2017.01 X2017.02 X2017.03
## 1    44.91    45.14    44.97    44.90    45.43
## 2     0.13     0.13     0.13     0.12     0.11
## 3     0.01     0.01     0.01     0.01     0.01
## 4    53.83    53.70    53.89    54.01    53.57
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.01     0.01     0.01     0.01
##   X2017.04 X2017.05 X2017.06 X2017.07 X2017.08
## 1    45.07    45.40    46.04    46.91    45.09
## 2     0.11     0.10     0.08     0.08     0.08
## 3     0.01     0.01     0.01     0.01     0.01
## 4    53.88    53.49    53.07    52.17    54.12
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.00     0.00     0.00     0.01
##   X2017.09 X2017.10 X2017.11 X2017.12 X2018.01
## 1    45.33    44.37    44.01    45.04    45.99
## 2     0.09     0.07     0.06     0.05     0.04
## 3     0.01     0.01     0.01     0.01     0.01
## 4    53.85    54.96    55.42    54.42    53.44
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.01     0.01     0.01     0.01
##   X2018.02 X2018.03 X2018.04 X2018.05 X2018.06
## 1    45.48    43.24    45.79    45.09    45.94
## 2     0.03     0.03     0.03     0.03     0.03
## 3     0.01     0.01     0.01     0.01     0.01
## 4    54.02    56.26    53.60    54.40    53.60
## 5     0.01     0.01     0.01     0.02     0.02
## 6     0.00     0.01     0.06     0.01     0.01
##   X2018.07 X2018.08 X2018.09 X2018.10 X2018.11
## 1    47.39    46.47    45.96    43.57    37.57
## 2     0.03     0.04     0.04     0.04     0.04
## 3     0.01     0.01     0.01     0.01     0.01
## 4    52.19    53.13    53.70    56.08    62.01
## 5     0.01     0.01     0.01     0.01     0.01
## 6     0.01     0.01     0.01     0.01     0.01
##   X2018.12 X2019.01 X2019.02 X2019.03 X2019.04
## 1    42.75    42.43    42.29    44.89    45.07
## 2     0.04     0.05     0.06     0.06     0.06
## 3     0.01     0.01     0.01     0.01     0.01
## 4    56.91    57.22    57.33    54.74    54.61
## 5     0.00     0.00     0.00     0.00     0.00
## 6     0.01     0.01     0.01     0.01     0.01
##   X2019.05
## 1    46.13
## 2     0.03
## 3     0.01
## 4    53.62
## 5     0.00
## 6     0.01

This code reads the CSV file and loads it into the ‘data’ data frame. You can then explore and analyze the data within R.

Importing data is a crucial step in any data analysis project, as it allows you to work with real-world datasets. In the next section, we’ll cover handling missing data effectively.

4.4 Dealing with Missing Data

4.4.1 Detecting Missing Values

Missing data is a common challenge in data analysis. In R, you can identify missing values using the is.na() function or the complete.cases() function.

4.4.1.1 Example 1: Using is.na()

# Create a vector with missing values
data <- c(10, 15, NA, 25, 30)

# Check for missing values
missing_values <- is.na(data)
missing_values
## [1] FALSE FALSE  TRUE FALSE FALSE

Example 2: Using complete.cases()

# Create a data frame with missing values
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie", NA, "Eve"),
  Age = c(25, 30, NA, 28, 35)
)
head(df)
##      Name Age
## 1   Alice  25
## 2     Bob  30
## 3 Charlie  NA
## 4    <NA>  28
## 5     Eve  35
# Check for complete cases (rows without missing values)
complete_rows <- complete.cases(df)
complete_rows
## [1]  TRUE  TRUE FALSE FALSE  TRUE

4.4.2 Handling Missing Data

Once you’ve identified missing values, you can choose how to handle them. Common strategies include removing rows with missing data or imputing missing values with a specific value or statistic.

Example 1: Removing Rows with Missing Data

# Remove rows with missing values
df_clean <- df[complete.cases(df), ]
df_clean
##    Name Age
## 1 Alice  25
## 2   Bob  30
## 5   Eve  35

Example 2: Imputing Missing Values

# Impute missing values with the mean
mean_age <- mean(df$Age, na.rm = TRUE)
df$Age[is.na(df$Age)] <- mean_age
df
##      Name  Age
## 1   Alice 25.0
## 2     Bob 30.0
## 3 Charlie 29.5
## 4    <NA> 28.0
## 5     Eve 35.0

4.5 Checking Data Types and Structure in R

4.5.1 Understanding Data Types

In R, it’s essential to understand the data types of variables in your data frames. Common data types include numeric, character, factor, logical, and more. The str() function allows you to inspect the structure of your data frame and view data types.

4.5.2 Example 1: Using str()

# Create a sample data frame
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25.0, 30.6, 28.8),
  Married = c(TRUE, FALSE, TRUE)
)

# View the structure of the data frame
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ Name   : chr  "Alice" "Bob" "Charlie"
##  $ Age    : num  25 30.6 28.8
##  $ Married: logi  TRUE FALSE TRUE

The output of str(df) will display the data types and structure of the data frame.

Checking and Modifying Data Types

Sometimes, you may need to change the data type of a variable. R provides functions like as.numeric(), as.character(), and others for this purpose.

Example 2: Modifying Data Types

# Convert Age column to numeric
str(df$Age)
##  num [1:3] 25 30.6 28.8
df$Age <- as.integer(df$Age)
# View the updated structure
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ Name   : chr  "Alice" "Bob" "Charlie"
##  $ Age    : int  25 30 28
##  $ Married: logi  TRUE FALSE TRUE

The as.integer() function is used to change the data type of the Age column to integer.

Example 3: Changing Character to Factor

# Create a character vector
colors <- c("Red", "Green", "Blue", "Red", "Green")

# Convert it to a factor
factor_colors <- as.factor(colors)
str(factor_colors)
##  Factor w/ 3 levels "Blue","Green",..: 3 2 1 3 2

In this example, a character vector is converted to a factor using as.factor().

Example 4: Converting Factor to Character

# Convert factor_colors back to character
char_colors <- as.character(factor_colors)
str(char_colors)
##  chr [1:5] "Red" "Green" "Blue" "Red" "Green"

Here, the as.character() function is used to convert a factor back to a character vector.

Example 5: Logical Data Type

# Create a logical vector
is_student <- c(TRUE, FALSE, TRUE)

# Check the data type and structure
str(is_student)
##  logi [1:3] TRUE FALSE TRUE

This example demonstrates the logical data type.

Example 6: Date Data Type

# Create a date vector
dates <- as.Date(c("2023-09-01", "2023-09-15"))

# Check the data type and structure
str(dates)
##  Date[1:2], format: "2023-09-01" "2023-09-15"

Here, we use as.Date() to create a date vector.

Example 7: Complex Data Type

# Create a complex vector
complex_numbers <- complex(real = c(1, 2, 3), imaginary = c(4, 5, 6))

# Check the data type and structure
str(complex_numbers)
##  cplx [1:3] 1+4i 2+5i 3+6i

This example introduces the complex data type for handling complex numbers in R.

4.6 Data Cleaning Basics

Data cleaning is a critical step in data analysis to ensure that your dataset is accurate and ready for analysis. In R, you can perform various data cleaning tasks easily.

Hands-on:

library(tidyverse)
Cars93 <- MASS::Cars93

(a) Use group_by() and summarize() commands on the Cars93 data set to create a table showing the average Turn.circle of cars, broken down by vehicle Type and DriveTrain

Cars93 %>%
  group_by(Type, DriveTrain) %>%
  summarize(mean(Turn.circle))
## `summarise()` has grouped output by 'Type'. You can
## override using the `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   Type [6]
##    Type    DriveTrain `mean(Turn.circle)`
##    <fct>   <fct>                    <dbl>
##  1 Compact 4WD                       37  
##  2 Compact Front                     38.8
##  3 Compact Rear                      35.5
##  4 Large   Front                     42  
##  5 Large   Rear                      43.8
##  6 Midsize Front                     40.5
##  7 Midsize Rear                      39  
##  8 Small   4WD                       33.5
##  9 Small   Front                     35.3
## 10 Sporty  4WD                       39.5
## 11 Sporty  Front                     37  
## 12 Sporty  Rear                      41.2
## 13 Van     4WD                       41.8
## 14 Van     Front                     41.8

(b) Are all combinations of Type and DriveTrain shown in the table? If not, which ones are missing? Why are they missing?

Some are missing. E.g., there is no entry for Large 4WD cars. This is because there are no vehicles in this category.

sum(Cars93$Type == "Large" & Cars93$DriveTrain == "4WD")
## [1] 0

(c) Add the argument .drop = FALSE to your group_by command, and then re-run your code. What happens now?

Cars93 %>%
  group_by(Type, DriveTrain, .drop = FALSE) %>%
  summarize(mean(Turn.circle))
## `summarise()` has grouped output by 'Type'. You can
## override using the `.groups` argument.
## # A tibble: 18 × 3
## # Groups:   Type [6]
##    Type    DriveTrain `mean(Turn.circle)`
##    <fct>   <fct>                    <dbl>
##  1 Compact 4WD                       37  
##  2 Compact Front                     38.8
##  3 Compact Rear                      35.5
##  4 Large   4WD                      NaN  
##  5 Large   Front                     42  
##  6 Large   Rear                      43.8
##  7 Midsize 4WD                      NaN  
##  8 Midsize Front                     40.5
##  9 Midsize Rear                      39  
## 10 Small   4WD                       33.5
## 11 Small   Front                     35.3
## 12 Small   Rear                     NaN  
## 13 Sporty  4WD                       39.5
## 14 Sporty  Front                     37  
## 15 Sporty  Rear                      41.2
## 16 Van     4WD                       41.8
## 17 Van     Front                     41.8
## 18 Van     Rear                     NaN

The .drop argument, which is set to TRUE by default, controls whether variable combinatinos that never appear together are dropped. When we set .drop = FALSE the combinations with 0 counts still appear in the table, but the summary shows NaN in that cell (not a number).

(d) Having a car with a small turn radius makes city driving much easier. What Type of car should city drivers opt for?

Answer:

Small cars appear to have smaller turn radii.

(e) Does the vehicle’s DriveTrain appear to have an impact on turn radius?
Answer:

There is no consistent association.

Understanding data types and ensuring they match the nature of your data is crucial for accurate analysis and visualization in R.