Session-4 Data Frames and Data Manipulation
Data frames are a fundamental data structure in R, serving as a cornerstone for data manipulation and analysis. They are akin to tables in a relational database or spreadsheets, providing an organized way to store and work with structured data.
In a data frame, data is organized into rows and columns, where each column can contain different types of data, such as numeric, character, or factors. This flexibility makes data frames a versatile choice for handling real-world datasets, as they can accommodate diverse data types and structures.
R offers a rich ecosystem of packages and functions for data manipulation with data frames. Popular packages like dplyr and tidyr enable users to filter, transform, summarize, and reshape data with ease. With these tools, you can clean messy datasets, aggregate information, and prepare data for analysis or visualization.
Data frames are central to tasks like exploratory data analysis, data cleaning, and model building. They allow data scientists and analysts to efficiently explore data, identify patterns, and make data-driven decisions.
Data frames are the backbone of data manipulation in R, enabling users to transform raw data into meaningful insights. Proficiency in working with data frames is a critical skill for anyone involved in data science, as it lays the foundation for effective data analysis and modeling.
4.1 Intro to Data Frames
In the world of data analysis and manipulation, data frames stand as a fundamental pillar. But what exactly are data frames, and why are they so crucial?
Data Frames Defined:
A data frame is a structured data object that resembles a table or spreadsheet. It consists of rows and columns, where each column can contain different types of data—numeric, character, or factors. Think of data frames as a versatile and organized way to store and work with data.
The Significance of Data Frames:
Data frames are the backbone of data analysis in R for several reasons:
Structured Storage: Data frames provide an organized structure for storing data. With rows representing observations and columns representing variables or attributes, they make it easier to maintain the integrity of your data.
Data Variety: In real-world scenarios, data comes in diverse forms—numbers, text, categories, dates, and more. Data frames are capable of accommodating this variety, allowing you to work with mixed data types seamlessly.
Data Manipulation: Data frames are the canvas on which you can perform a wide range of data manipulation operations. Whether you need to filter, sort, aggregate, or transform your data, data frames offer the necessary tools.
Data Exploration: When you embark on an exploratory data analysis journey, data frames serve as your trusty companion. You can use them to examine the structure of your data, identify trends, and uncover insights.
Data Visualization: Many data visualization libraries, like ggplot2, are designed to work effortlessly with data frames. This integration simplifies the process of creating informative charts and plots.
Over the next course of sessions we’ll see how dataframes are helpful.
4.2 Creating and Working with Data Frames
Data frames are versatile containers for your data, and you can easily create them using built-in datasets in R. Let’s see how:
# Create a data frame from the built-in 'mtcars' dataset
df <- mtcars
# View the first few rows of the data frame
head(df)
## mpg cyl disp hp drat wt qsec
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
## Valiant 18.1 6 225 105 2.76 3.460 20.22
## vs am gear carb
## Mazda RX4 0 1 4 4
## Mazda RX4 Wag 0 1 4 4
## Datsun 710 1 1 4 1
## Hornet 4 Drive 1 0 3 1
## Hornet Sportabout 0 0 3 2
## Valiant 1 0 3 1
This code snippet uses the ‘mtcars’ dataset to create a data frame named ‘df.’ You can substitute ‘mtcars’ with other datasets available in R.
Accessing and Manipulating Data within Data Frames
Once you have a data frame, you’ll want to access and manipulate the data within it. Here are some common operations:
Accessing columns:
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
## [21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Filtering rows:
Creating new columns:
Summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.4 15.4 19.2 20.1 22.8 33.9
4.2.0.1 2. Table function
(a) Use the table()
function to tabulate the data by DriveTrain and Origin.
##
## USA non-USA
## 4WD 5 5
## Front 34 33
## Rear 9 7
(b) Repeat part (a), this time using the count()
function.
## DriveTrain Origin n
## 1 4WD USA 5
## 2 4WD non-USA 5
## 3 Front USA 34
## 4 Front non-USA 33
## 5 Rear USA 9
## 6 Rear non-USA 7
(c) Does it looks like foreign car manufacturers had different Drivetrain production preferences compared to US manufacturers?
The counts for each Drivetrain category are nearly the same for US and non-US manufacturers. The table suggests that they had similar Drivetrain production preferences. ## Joining Columns and Rows with
cbind()
andrbind()
:
Combining Data Horizontally with cbind()
:
Sometimes, you may need to combine data frames horizontally, i.e., adding columns. The cbind()
function comes in handy for this task:
# Create two sample data frames
df1 <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Score = c(95, 88, 72))
df2 <- data.frame(ID = c(101, 102, 103),
Grade = c("A", "B", "C"))
# Combine the data frames horizontally
combined_df <- cbind(df1, df2)
# View the result
combined_df
## Name Score ID Grade
## 1 Alice 95 101 A
## 2 Bob 88 102 B
## 3 Charlie 72 103 C
The cbind()
function combines ‘df1’ and ‘df2’ horizontally based on their row order, creating ‘combined_df.’
Combining Data Vertically with rbind()
:
To stack data frames on top of each other, you can use the rbind()
function:
# Create two sample data frames
df1 <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Score = c(95, 88, 72))
df2 <- data.frame(Name = c("David", "Eve"),
Score = c(78, 89))
# Combine the data frames vertically
stacked_df <- rbind(df1, df2)
# View the result
stacked_df
## Name Score
## 1 Alice 95
## 2 Bob 88
## 3 Charlie 72
## 4 David 78
## 5 Eve 89
The rbind()
function vertically combines ‘df1’ and ‘df2,’ stacking ‘df2’ below ‘df1.’
These functions are valuable when you need to merge datasets or add additional information to your existing data frames. In the next section, we’ll explore reading data from external sources.
4.3 Reading Data from CSV, Excel, and Other Formats
Importing Data into R:
Data analysis often involves working with external datasets. R provides various functions to import data from different file formats, such as CSV, Excel, and more. Let’s discuss some of these functions:
read.csv()
: Used to read CSV files.
Example:
## MobileOS X2009.01 X2009.02 X2009.03 X2009.04
## 1 Android 1.75 5.47 6.12 5.56
## 2 BlackBerry OS 2.16 12.36 4.74 5.52
## 3 Brew 0.00 0.00 0.00 0.00
## 4 iOS 59.18 60.48 68.06 69.79
## 5 LG 0.00 0.00 0.00 0.00
## 6 Linux 3.33 0.00 0.00 0.00
## X2009.05 X2009.06 X2009.07 X2009.08 X2009.09
## 1 5.47 0.30 5.19 5.45 5.94
## 2 12.98 20.07 20.13 17.69 16.80
## 3 0.00 0.00 0.00 0.00 0.00
## 4 63.10 59.98 56.64 57.60 55.92
## 5 0.00 0.00 0.14 0.75 0.79
## 6 0.00 0.00 0.00 0.00 0.00
## X2009.10 X2009.11 X2009.12 X2010.01 X2010.02
## 1 6.83 8.23 10.62 11.90 12.86
## 2 17.73 23.14 19.66 20.13 24.24
## 3 0.00 0.00 0.00 0.00 0.00
## 4 56.24 51.90 53.26 53.58 50.63
## 5 0.66 0.57 0.57 0.52 0.44
## 6 0.00 0.00 0.00 0.00 0.00
## X2010.03 X2010.04 X2010.05 X2010.06 X2010.07
## 1 14.44 13.37 9.75 9.63 16.85
## 2 28.67 30.46 30.56 31.80 32.77
## 3 0.00 0.00 0.00 0.00 0.00
## 4 45.49 43.33 42.58 40.06 37.99
## 5 0.40 0.26 0.06 0.05 0.18
## 6 0.00 0.00 0.00 0.00 0.00
## X2010.08 X2010.09 X2010.10 X2010.11 X2010.12
## 1 19.10 23.22 23.34 23.77 25.46
## 2 34.42 29.45 32.02 34.34 32.15
## 3 0.00 0.00 0.00 0.00 0.00
## 4 36.55 37.32 35.16 32.96 33.55
## 5 0.18 0.15 0.12 0.10 0.09
## 6 0.00 0.00 0.00 0.00 0.00
## X2011.01 X2011.02 X2011.03 X2011.04 X2011.05
## 1 26.68 29.31 31.83 30.81 32.23
## 2 27.78 23.39 20.66 23.31 23.45
## 3 0.00 0.00 0.00 0.00 0.00
## 4 35.04 36.74 39.01 36.80 34.09
## 5 0.08 0.07 0.06 0.05 0.05
## 6 0.00 0.00 0.00 0.00 0.00
## X2011.06 X2011.07 X2011.08 X2011.09 X2011.10
## 1 34.79 35.80 39.37 39.52 40.03
## 2 16.60 15.87 13.83 12.20 10.10
## 3 0.00 0.00 0.00 0.00 0.00
## 4 36.05 38.65 38.10 39.12 40.40
## 5 0.05 0.04 0.02 0.02 0.02
## 6 0.00 0.00 0.00 0.00 0.03
## X2011.11 X2011.12 X2012.01 X2012.02 X2012.03
## 1 39.67 40.92 41.77 41.66 40.83
## 2 7.60 6.52 5.83 5.23 4.93
## 3 0.00 0.00 0.00 0.00 0.01
## 4 42.79 42.52 45.37 47.42 48.35
## 5 0.02 0.01 0.01 0.01 0.01
## 6 0.05 0.05 0.05 0.05 0.06
## X2012.04 X2012.05 X2012.06 X2012.07 X2012.08
## 1 40.69 40.44 39.77 40.31 40.85
## 2 4.25 4.02 3.64 3.24 2.84
## 3 0.08 0.09 0.11 0.11 0.11
## 4 48.85 48.82 50.21 50.43 50.40
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.06 0.06 0.05 0.06 0.06
## X2012.09 X2012.10 X2012.11 X2012.12 X2013.01
## 1 41.30 40.30 39.83 40.22 39.40
## 2 2.64 2.39 2.23 1.89 1.71
## 3 0.12 0.12 0.12 0.11 0.10
## 4 50.46 50.75 50.63 50.49 52.69
## 5 0.00 0.01 0.01 0.01 0.01
## 6 0.06 0.05 0.05 0.06 0.05
## X2013.02 X2013.03 X2013.04 X2013.05 X2013.06
## 1 38.83 40.14 40.08 39.88 39.36
## 2 1.61 1.49 1.43 1.38 1.38
## 3 0.10 0.10 0.09 0.09 0.09
## 4 53.48 52.50 53.51 54.24 54.85
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.05 0.06 0.06 0.06 0.06
## X2013.07 X2013.08 X2013.09 X2013.10 X2013.11
## 1 38.77 39.05 38.46 38.66 39.18
## 2 3.53 5.11 5.71 5.85 5.33
## 3 0.08 0.07 0.07 0.06 0.06
## 4 53.91 52.22 51.95 51.54 51.63
## 5 0.10 0.09 0.09 0.09 0.10
## 6 0.11 0.13 0.16 0.17 0.12
## X2013.12 X2014.01 X2014.02 X2014.03 X2014.04
## 1 39.25 38.09 40.37 40.72 41.38
## 2 5.01 3.79 2.96 1.88 1.62
## 3 0.05 0.04 0.04 0.04 0.04
## 4 51.75 54.49 52.87 53.02 52.50
## 5 0.09 0.08 0.08 0.08 0.08
## 6 0.10 0.08 0.07 0.09 0.09
## X2014.05 X2014.06 X2014.07 X2014.08 X2014.09
## 1 42.72 42.93 42.83 43.43 42.33
## 2 1.54 1.53 1.48 1.23 1.12
## 3 0.03 0.03 0.03 0.03 0.03
## 4 51.17 51.36 51.51 51.40 53.31
## 5 0.07 0.06 0.07 0.06 0.05
## 6 0.09 0.09 0.09 0.10 0.06
## X2014.10 X2014.11 X2014.12 X2015.01 X2015.02
## 1 41.36 45.66 47.32 45.40 44.73
## 2 0.92 1.03 0.88 0.80 0.73
## 3 0.02 0.03 0.02 0.02 0.02
## 4 55.22 51.00 49.54 51.38 52.18
## 5 0.04 0.04 0.03 0.03 0.03
## 6 0.04 0.04 0.03 0.02 0.02
## X2015.03 X2015.04 X2015.05 X2015.06 X2015.07
## 1 44.48 47.51 48.52 48.12 46.79
## 2 0.71 0.57 0.46 0.32 0.37
## 3 0.02 0.02 0.02 0.02 0.02
## 4 52.59 49.50 48.64 49.38 50.68
## 5 0.03 0.03 0.03 0.03 0.03
## 6 0.02 0.02 0.04 0.02 0.03
## X2015.08 X2015.09 X2015.10 X2015.11 X2015.12
## 1 46.37 47.15 47.25 46.03 45.65
## 2 0.34 0.30 0.33 0.28 0.37
## 3 0.02 0.02 0.02 0.02 0.02
## 4 51.17 50.37 50.09 51.64 51.86
## 5 0.03 0.04 0.04 0.03 0.03
## 6 0.04 0.05 0.05 0.03 0.03
## X2016.01 X2016.02 X2016.03 X2016.04 X2016.05
## 1 45.05 44.68 46.07 45.87 45.53
## 2 0.24 0.24 0.20 0.22 0.24
## 3 0.02 0.02 0.02 0.02 0.02
## 4 52.90 53.37 52.16 52.42 52.78
## 5 0.03 0.03 0.02 0.02 0.02
## 6 0.02 0.02 0.02 0.02 0.02
## X2016.06 X2016.07 X2016.08 X2016.09 X2016.10
## 1 44.76 46.19 44.13 46.30 43.75
## 2 0.19 0.17 0.17 0.16 0.15
## 3 0.02 0.02 0.02 0.02 0.01
## 4 53.67 52.41 54.47 51.72 54.77
## 5 0.01 0.01 0.01 0.02 0.01
## 6 0.02 0.02 0.02 0.03 0.01
## X2016.11 X2016.12 X2017.01 X2017.02 X2017.03
## 1 44.91 45.14 44.97 44.90 45.43
## 2 0.13 0.13 0.13 0.12 0.11
## 3 0.01 0.01 0.01 0.01 0.01
## 4 53.83 53.70 53.89 54.01 53.57
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.01 0.01 0.01 0.01
## X2017.04 X2017.05 X2017.06 X2017.07 X2017.08
## 1 45.07 45.40 46.04 46.91 45.09
## 2 0.11 0.10 0.08 0.08 0.08
## 3 0.01 0.01 0.01 0.01 0.01
## 4 53.88 53.49 53.07 52.17 54.12
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.00 0.00 0.00 0.01
## X2017.09 X2017.10 X2017.11 X2017.12 X2018.01
## 1 45.33 44.37 44.01 45.04 45.99
## 2 0.09 0.07 0.06 0.05 0.04
## 3 0.01 0.01 0.01 0.01 0.01
## 4 53.85 54.96 55.42 54.42 53.44
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.01 0.01 0.01 0.01
## X2018.02 X2018.03 X2018.04 X2018.05 X2018.06
## 1 45.48 43.24 45.79 45.09 45.94
## 2 0.03 0.03 0.03 0.03 0.03
## 3 0.01 0.01 0.01 0.01 0.01
## 4 54.02 56.26 53.60 54.40 53.60
## 5 0.01 0.01 0.01 0.02 0.02
## 6 0.00 0.01 0.06 0.01 0.01
## X2018.07 X2018.08 X2018.09 X2018.10 X2018.11
## 1 47.39 46.47 45.96 43.57 37.57
## 2 0.03 0.04 0.04 0.04 0.04
## 3 0.01 0.01 0.01 0.01 0.01
## 4 52.19 53.13 53.70 56.08 62.01
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.01 0.01 0.01 0.01
## X2018.12 X2019.01 X2019.02 X2019.03 X2019.04
## 1 42.75 42.43 42.29 44.89 45.07
## 2 0.04 0.05 0.06 0.06 0.06
## 3 0.01 0.01 0.01 0.01 0.01
## 4 56.91 57.22 57.33 54.74 54.61
## 5 0.00 0.00 0.00 0.00 0.00
## 6 0.01 0.01 0.01 0.01 0.01
## X2019.05
## 1 46.13
## 2 0.03
## 3 0.01
## 4 53.62
## 5 0.00
## 6 0.01
read.table()
: General function for reading tabular data, including CSV files. Example:
# Read data from a tab-delimited file
data <- read.table("data/tab_Sep_file", header = TRUE, sep = "\t")
head(data)
## Date.......Revenue......Expenses......Profit....
## 1 2023-01-01 50000 30000 20000
## 2 2023-01-02 48000 31000 17000
## 3 2023-01-03 52000 29000 23000
## 4 2023-01-04 55000 32000 23000
## 5 2023-01-05 49000 28000 21000
## 6 2023-01-06 51000 33000 18000
read.xlsx()
(from the readxl
package): Used for reading Excel files. Example:
# Read data from an Excel file
library(readxl)
data <- read_excel("data/MobileOS_US.xlsx")
head(data)
## # A tibble: 6 × 126
## MobileOS `2009-01` `2009-02` `2009-03` `2009-04`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Android 1.75 5.47 6.12 5.56
## 2 BlackBerry OS 2.16 12.4 4.74 5.52
## 3 Brew 0 0 0 0
## 4 iOS 59.2 60.5 68.1 69.8
## 5 LG 0 0 0 0
## 6 Linux 3.33 0 0 0
## # ℹ 121 more variables: `2009-05` <dbl>,
## # `2009-06` <dbl>, `2009-07` <dbl>, `2009-08` <dbl>,
## # `2009-09` <dbl>, `2009-10` <dbl>, `2009-11` <dbl>,
## # `2009-12` <dbl>, `2010-01` <dbl>, `2010-02` <dbl>,
## # `2010-03` <dbl>, `2010-04` <dbl>, `2010-05` <dbl>,
## # `2010-06` <dbl>, `2010-07` <dbl>, `2010-08` <dbl>,
## # `2010-09` <dbl>, `2010-10` <dbl>, …
Hands-on-exercise:
Let’s demonstrate how to read data from a CSV file. Assuming you have a file named “data.csv” with columns ‘Name’ and ‘Age,’ you can use the read.csv()
function:
# Read data from a CSV file
data <- read.csv("data/MobileOS_US.csv")
# View the first few rows
head(data)
## MobileOS X2009.01 X2009.02 X2009.03 X2009.04
## 1 Android 1.75 5.47 6.12 5.56
## 2 BlackBerry OS 2.16 12.36 4.74 5.52
## 3 Brew 0.00 0.00 0.00 0.00
## 4 iOS 59.18 60.48 68.06 69.79
## 5 LG 0.00 0.00 0.00 0.00
## 6 Linux 3.33 0.00 0.00 0.00
## X2009.05 X2009.06 X2009.07 X2009.08 X2009.09
## 1 5.47 0.30 5.19 5.45 5.94
## 2 12.98 20.07 20.13 17.69 16.80
## 3 0.00 0.00 0.00 0.00 0.00
## 4 63.10 59.98 56.64 57.60 55.92
## 5 0.00 0.00 0.14 0.75 0.79
## 6 0.00 0.00 0.00 0.00 0.00
## X2009.10 X2009.11 X2009.12 X2010.01 X2010.02
## 1 6.83 8.23 10.62 11.90 12.86
## 2 17.73 23.14 19.66 20.13 24.24
## 3 0.00 0.00 0.00 0.00 0.00
## 4 56.24 51.90 53.26 53.58 50.63
## 5 0.66 0.57 0.57 0.52 0.44
## 6 0.00 0.00 0.00 0.00 0.00
## X2010.03 X2010.04 X2010.05 X2010.06 X2010.07
## 1 14.44 13.37 9.75 9.63 16.85
## 2 28.67 30.46 30.56 31.80 32.77
## 3 0.00 0.00 0.00 0.00 0.00
## 4 45.49 43.33 42.58 40.06 37.99
## 5 0.40 0.26 0.06 0.05 0.18
## 6 0.00 0.00 0.00 0.00 0.00
## X2010.08 X2010.09 X2010.10 X2010.11 X2010.12
## 1 19.10 23.22 23.34 23.77 25.46
## 2 34.42 29.45 32.02 34.34 32.15
## 3 0.00 0.00 0.00 0.00 0.00
## 4 36.55 37.32 35.16 32.96 33.55
## 5 0.18 0.15 0.12 0.10 0.09
## 6 0.00 0.00 0.00 0.00 0.00
## X2011.01 X2011.02 X2011.03 X2011.04 X2011.05
## 1 26.68 29.31 31.83 30.81 32.23
## 2 27.78 23.39 20.66 23.31 23.45
## 3 0.00 0.00 0.00 0.00 0.00
## 4 35.04 36.74 39.01 36.80 34.09
## 5 0.08 0.07 0.06 0.05 0.05
## 6 0.00 0.00 0.00 0.00 0.00
## X2011.06 X2011.07 X2011.08 X2011.09 X2011.10
## 1 34.79 35.80 39.37 39.52 40.03
## 2 16.60 15.87 13.83 12.20 10.10
## 3 0.00 0.00 0.00 0.00 0.00
## 4 36.05 38.65 38.10 39.12 40.40
## 5 0.05 0.04 0.02 0.02 0.02
## 6 0.00 0.00 0.00 0.00 0.03
## X2011.11 X2011.12 X2012.01 X2012.02 X2012.03
## 1 39.67 40.92 41.77 41.66 40.83
## 2 7.60 6.52 5.83 5.23 4.93
## 3 0.00 0.00 0.00 0.00 0.01
## 4 42.79 42.52 45.37 47.42 48.35
## 5 0.02 0.01 0.01 0.01 0.01
## 6 0.05 0.05 0.05 0.05 0.06
## X2012.04 X2012.05 X2012.06 X2012.07 X2012.08
## 1 40.69 40.44 39.77 40.31 40.85
## 2 4.25 4.02 3.64 3.24 2.84
## 3 0.08 0.09 0.11 0.11 0.11
## 4 48.85 48.82 50.21 50.43 50.40
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.06 0.06 0.05 0.06 0.06
## X2012.09 X2012.10 X2012.11 X2012.12 X2013.01
## 1 41.30 40.30 39.83 40.22 39.40
## 2 2.64 2.39 2.23 1.89 1.71
## 3 0.12 0.12 0.12 0.11 0.10
## 4 50.46 50.75 50.63 50.49 52.69
## 5 0.00 0.01 0.01 0.01 0.01
## 6 0.06 0.05 0.05 0.06 0.05
## X2013.02 X2013.03 X2013.04 X2013.05 X2013.06
## 1 38.83 40.14 40.08 39.88 39.36
## 2 1.61 1.49 1.43 1.38 1.38
## 3 0.10 0.10 0.09 0.09 0.09
## 4 53.48 52.50 53.51 54.24 54.85
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.05 0.06 0.06 0.06 0.06
## X2013.07 X2013.08 X2013.09 X2013.10 X2013.11
## 1 38.77 39.05 38.46 38.66 39.18
## 2 3.53 5.11 5.71 5.85 5.33
## 3 0.08 0.07 0.07 0.06 0.06
## 4 53.91 52.22 51.95 51.54 51.63
## 5 0.10 0.09 0.09 0.09 0.10
## 6 0.11 0.13 0.16 0.17 0.12
## X2013.12 X2014.01 X2014.02 X2014.03 X2014.04
## 1 39.25 38.09 40.37 40.72 41.38
## 2 5.01 3.79 2.96 1.88 1.62
## 3 0.05 0.04 0.04 0.04 0.04
## 4 51.75 54.49 52.87 53.02 52.50
## 5 0.09 0.08 0.08 0.08 0.08
## 6 0.10 0.08 0.07 0.09 0.09
## X2014.05 X2014.06 X2014.07 X2014.08 X2014.09
## 1 42.72 42.93 42.83 43.43 42.33
## 2 1.54 1.53 1.48 1.23 1.12
## 3 0.03 0.03 0.03 0.03 0.03
## 4 51.17 51.36 51.51 51.40 53.31
## 5 0.07 0.06 0.07 0.06 0.05
## 6 0.09 0.09 0.09 0.10 0.06
## X2014.10 X2014.11 X2014.12 X2015.01 X2015.02
## 1 41.36 45.66 47.32 45.40 44.73
## 2 0.92 1.03 0.88 0.80 0.73
## 3 0.02 0.03 0.02 0.02 0.02
## 4 55.22 51.00 49.54 51.38 52.18
## 5 0.04 0.04 0.03 0.03 0.03
## 6 0.04 0.04 0.03 0.02 0.02
## X2015.03 X2015.04 X2015.05 X2015.06 X2015.07
## 1 44.48 47.51 48.52 48.12 46.79
## 2 0.71 0.57 0.46 0.32 0.37
## 3 0.02 0.02 0.02 0.02 0.02
## 4 52.59 49.50 48.64 49.38 50.68
## 5 0.03 0.03 0.03 0.03 0.03
## 6 0.02 0.02 0.04 0.02 0.03
## X2015.08 X2015.09 X2015.10 X2015.11 X2015.12
## 1 46.37 47.15 47.25 46.03 45.65
## 2 0.34 0.30 0.33 0.28 0.37
## 3 0.02 0.02 0.02 0.02 0.02
## 4 51.17 50.37 50.09 51.64 51.86
## 5 0.03 0.04 0.04 0.03 0.03
## 6 0.04 0.05 0.05 0.03 0.03
## X2016.01 X2016.02 X2016.03 X2016.04 X2016.05
## 1 45.05 44.68 46.07 45.87 45.53
## 2 0.24 0.24 0.20 0.22 0.24
## 3 0.02 0.02 0.02 0.02 0.02
## 4 52.90 53.37 52.16 52.42 52.78
## 5 0.03 0.03 0.02 0.02 0.02
## 6 0.02 0.02 0.02 0.02 0.02
## X2016.06 X2016.07 X2016.08 X2016.09 X2016.10
## 1 44.76 46.19 44.13 46.30 43.75
## 2 0.19 0.17 0.17 0.16 0.15
## 3 0.02 0.02 0.02 0.02 0.01
## 4 53.67 52.41 54.47 51.72 54.77
## 5 0.01 0.01 0.01 0.02 0.01
## 6 0.02 0.02 0.02 0.03 0.01
## X2016.11 X2016.12 X2017.01 X2017.02 X2017.03
## 1 44.91 45.14 44.97 44.90 45.43
## 2 0.13 0.13 0.13 0.12 0.11
## 3 0.01 0.01 0.01 0.01 0.01
## 4 53.83 53.70 53.89 54.01 53.57
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.01 0.01 0.01 0.01
## X2017.04 X2017.05 X2017.06 X2017.07 X2017.08
## 1 45.07 45.40 46.04 46.91 45.09
## 2 0.11 0.10 0.08 0.08 0.08
## 3 0.01 0.01 0.01 0.01 0.01
## 4 53.88 53.49 53.07 52.17 54.12
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.00 0.00 0.00 0.01
## X2017.09 X2017.10 X2017.11 X2017.12 X2018.01
## 1 45.33 44.37 44.01 45.04 45.99
## 2 0.09 0.07 0.06 0.05 0.04
## 3 0.01 0.01 0.01 0.01 0.01
## 4 53.85 54.96 55.42 54.42 53.44
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.01 0.01 0.01 0.01
## X2018.02 X2018.03 X2018.04 X2018.05 X2018.06
## 1 45.48 43.24 45.79 45.09 45.94
## 2 0.03 0.03 0.03 0.03 0.03
## 3 0.01 0.01 0.01 0.01 0.01
## 4 54.02 56.26 53.60 54.40 53.60
## 5 0.01 0.01 0.01 0.02 0.02
## 6 0.00 0.01 0.06 0.01 0.01
## X2018.07 X2018.08 X2018.09 X2018.10 X2018.11
## 1 47.39 46.47 45.96 43.57 37.57
## 2 0.03 0.04 0.04 0.04 0.04
## 3 0.01 0.01 0.01 0.01 0.01
## 4 52.19 53.13 53.70 56.08 62.01
## 5 0.01 0.01 0.01 0.01 0.01
## 6 0.01 0.01 0.01 0.01 0.01
## X2018.12 X2019.01 X2019.02 X2019.03 X2019.04
## 1 42.75 42.43 42.29 44.89 45.07
## 2 0.04 0.05 0.06 0.06 0.06
## 3 0.01 0.01 0.01 0.01 0.01
## 4 56.91 57.22 57.33 54.74 54.61
## 5 0.00 0.00 0.00 0.00 0.00
## 6 0.01 0.01 0.01 0.01 0.01
## X2019.05
## 1 46.13
## 2 0.03
## 3 0.01
## 4 53.62
## 5 0.00
## 6 0.01
This code reads the CSV file and loads it into the ‘data’ data frame. You can then explore and analyze the data within R.
Importing data is a crucial step in any data analysis project, as it allows you to work with real-world datasets. In the next section, we’ll cover handling missing data effectively.
4.4 Dealing with Missing Data
4.4.1 Detecting Missing Values
Missing data is a common challenge in data analysis. In R, you can identify missing values using the is.na()
function or the complete.cases()
function.
4.4.1.1 Example 1: Using is.na()
# Create a vector with missing values
data <- c(10, 15, NA, 25, 30)
# Check for missing values
missing_values <- is.na(data)
missing_values
## [1] FALSE FALSE TRUE FALSE FALSE
Example 2: Using complete.cases()
# Create a data frame with missing values
df <- data.frame(
Name = c("Alice", "Bob", "Charlie", NA, "Eve"),
Age = c(25, 30, NA, 28, 35)
)
head(df)
## Name Age
## 1 Alice 25
## 2 Bob 30
## 3 Charlie NA
## 4 <NA> 28
## 5 Eve 35
# Check for complete cases (rows without missing values)
complete_rows <- complete.cases(df)
complete_rows
## [1] TRUE TRUE FALSE FALSE TRUE
4.4.2 Handling Missing Data
Once you’ve identified missing values, you can choose how to handle them. Common strategies include removing rows with missing data or imputing missing values with a specific value or statistic.
Example 1: Removing Rows with Missing Data
## Name Age
## 1 Alice 25
## 2 Bob 30
## 5 Eve 35
Example 2: Imputing Missing Values
# Impute missing values with the mean
mean_age <- mean(df$Age, na.rm = TRUE)
df$Age[is.na(df$Age)] <- mean_age
df
## Name Age
## 1 Alice 25.0
## 2 Bob 30.0
## 3 Charlie 29.5
## 4 <NA> 28.0
## 5 Eve 35.0
4.5 Checking Data Types and Structure in R
4.5.1 Understanding Data Types
In R, it’s essential to understand the data types of variables in your data frames. Common data types include numeric, character, factor, logical, and more. The str()
function allows you to inspect the structure of your data frame and view data types.
4.5.2 Example 1: Using str()
# Create a sample data frame
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25.0, 30.6, 28.8),
Married = c(TRUE, FALSE, TRUE)
)
# View the structure of the data frame
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ Name : chr "Alice" "Bob" "Charlie"
## $ Age : num 25 30.6 28.8
## $ Married: logi TRUE FALSE TRUE
The output of str(df)
will display the data types and structure of the data frame.
Checking and Modifying Data Types
Sometimes, you may need to change the data type of a variable. R provides functions like as.numeric(), as.character(), and others for this purpose.
Example 2: Modifying Data Types
## num [1:3] 25 30.6 28.8
## 'data.frame': 3 obs. of 3 variables:
## $ Name : chr "Alice" "Bob" "Charlie"
## $ Age : int 25 30 28
## $ Married: logi TRUE FALSE TRUE
The as.integer() function is used to change the data type of the Age column to integer.
Example 3: Changing Character to Factor
# Create a character vector
colors <- c("Red", "Green", "Blue", "Red", "Green")
# Convert it to a factor
factor_colors <- as.factor(colors)
str(factor_colors)
## Factor w/ 3 levels "Blue","Green",..: 3 2 1 3 2
In this example, a character vector is converted to a factor using as.factor().
Example 4: Converting Factor to Character
# Convert factor_colors back to character
char_colors <- as.character(factor_colors)
str(char_colors)
## chr [1:5] "Red" "Green" "Blue" "Red" "Green"
Here, the as.character() function is used to convert a factor back to a character vector.
Example 5: Logical Data Type
# Create a logical vector
is_student <- c(TRUE, FALSE, TRUE)
# Check the data type and structure
str(is_student)
## logi [1:3] TRUE FALSE TRUE
This example demonstrates the logical data type.
Example 6: Date Data Type
# Create a date vector
dates <- as.Date(c("2023-09-01", "2023-09-15"))
# Check the data type and structure
str(dates)
## Date[1:2], format: "2023-09-01" "2023-09-15"
Here, we use as.Date() to create a date vector.
Example 7: Complex Data Type
# Create a complex vector
complex_numbers <- complex(real = c(1, 2, 3), imaginary = c(4, 5, 6))
# Check the data type and structure
str(complex_numbers)
## cplx [1:3] 1+4i 2+5i 3+6i
This example introduces the complex data type for handling complex numbers in R.
4.6 Data Cleaning Basics
Data cleaning is a critical step in data analysis to ensure that your dataset is accurate and ready for analysis. In R, you can perform various data cleaning tasks easily.
Hands-on:
(a) Use group_by()
and summarize()
commands on the Cars93 data set to create a table showing the average Turn.circle
of cars, broken down by vehicle Type
and DriveTrain
## `summarise()` has grouped output by 'Type'. You can
## override using the `.groups` argument.
## # A tibble: 14 × 3
## # Groups: Type [6]
## Type DriveTrain `mean(Turn.circle)`
## <fct> <fct> <dbl>
## 1 Compact 4WD 37
## 2 Compact Front 38.8
## 3 Compact Rear 35.5
## 4 Large Front 42
## 5 Large Rear 43.8
## 6 Midsize Front 40.5
## 7 Midsize Rear 39
## 8 Small 4WD 33.5
## 9 Small Front 35.3
## 10 Sporty 4WD 39.5
## 11 Sporty Front 37
## 12 Sporty Rear 41.2
## 13 Van 4WD 41.8
## 14 Van Front 41.8
(b) Are all combinations of Type and DriveTrain shown in the table? If not, which ones are missing? Why are they missing?
Some are missing. E.g., there is no entry for Large 4WD cars. This is because there are no vehicles in this category.
## [1] 0
(c) Add the argument .drop = FALSE
to your group_by
command, and then re-run your code. What happens now?
## `summarise()` has grouped output by 'Type'. You can
## override using the `.groups` argument.
## # A tibble: 18 × 3
## # Groups: Type [6]
## Type DriveTrain `mean(Turn.circle)`
## <fct> <fct> <dbl>
## 1 Compact 4WD 37
## 2 Compact Front 38.8
## 3 Compact Rear 35.5
## 4 Large 4WD NaN
## 5 Large Front 42
## 6 Large Rear 43.8
## 7 Midsize 4WD NaN
## 8 Midsize Front 40.5
## 9 Midsize Rear 39
## 10 Small 4WD 33.5
## 11 Small Front 35.3
## 12 Small Rear NaN
## 13 Sporty 4WD 39.5
## 14 Sporty Front 37
## 15 Sporty Rear 41.2
## 16 Van 4WD 41.8
## 17 Van Front 41.8
## 18 Van Rear NaN
The
.drop
argument, which is set toTRUE
by default, controls whether variable combinatinos that never appear together are dropped. When we set.drop = FALSE
the combinations with 0 counts still appear in the table, but the summary showsNaN
in that cell (not a number).
(d) Having a car with a small turn radius makes city driving much easier. What Type of car should city drivers opt for?
Answer:
Small cars appear to have smaller turn radii.
DriveTrain
appear to have an impact on turn radius?Answer:
There is no consistent association.
Understanding data types and ensuring they match the nature of your data is crucial for accurate analysis and visualization in R.