Data Manipulation in R β Master All Concepts in One Place!
this guide, we'll explore how to perform data manipulation using the R programming language.
What is Data Manipulation in R?
Data manipulation involves reshaping, cleaning, and transforming data using Rβs built-in structures, making it ready for analysis and visualization.
Before starting data manipulation, you should know how to import and export data in R (e.g., CSV, SPSS, text files).
Core Data Structures in R
1. Vectors
One-dimensional, ordered collections of elements.
Types: integer, numeric, logical, character, complex.
2. Matrices
Rectangular arrays where all elements are of the same type.
Useful for 2D or 3D data.
3. Lists
Flexible containers for elements of any type or structure.
Can store vectors, matrices, data frames, or even other lists.
4. Data Frames
Two-dimensional structures, like database tables or spreadsheets.
Suitable for storing datasets.
Creating Subsets of Data in R
As datasets grow, analyzing smaller samples becomes more efficient. This process is called subsetting.
Here are some common subsetting methods:
$ Operator
Accesses a single column of a data frame.
iris$Species
[[ ]] Operator
Returns a single element by position.
iris[[5]]
[ ] Operator
Returns multiple elements based on indices or conditions.
iris[1:5, ]
π² The sample() Function in R
Used to draw random samples from a dataset:
sample(1:6, 10, replace = TRUE)
Use set.seed() to ensure reproducible results:
set.seed(100) sample(1:5, 10, replace = TRUE)
π§ͺ Applications of Subsetting
1. Removing Duplicates
duplicated(c(1,2,1,3,1,4))
2. Identifying Missing Data
complete.cases(data) na.omit(data)
Example:
data <- read.table(header=TRUE, text=' subject sex size 1 M 7 2 F NA 3 F 9 4 M 11 ') write.csv(data, "table.csv", row.names=FALSE) file <- read.csv("table.csv") na.omit(file)
β Adding Calculated Columns
You can compute new fields directly from existing ones:
data(iris) x <- iris$Sepal.Length / iris$Sepal.Width head(x)
Using with() for cleaner syntax:
with(iris, Sepal.Length / Sepal.Width)
Using within() to add new column to dataset:
iris <- within(iris, ratio <- Sepal.Length / Sepal.Width)
π Creating Data Bins or Subgroups
cut() Function
Classifies numeric values into intervals:
frost <- c(1,2,3) cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High"))
table() Function
Counts number of items in each bin:
table(cut(frost, 3, include.lowest=TRUE, labels=c("Low", "Med", "High")))
π Combining and Merging Datasets in R
1. Add Columns with cbind()
cbind(df1, df2)
2. Add Rows with rbind()
rbind(df1, df2)
3. Merge with merge() Function
Example:
states <- as.data.frame(state.x77) states$Name <- rownames(state.x77) rownames(states) <- NULL freezing <- states[states$Frost > 150, c("Name", "Frost")] big <- states[states$Area > 100000, c("Name", "Area")] merge(freezing, big)
π Types of Joins with merge():
| Join Type | Parameter |
|---|---|
| Natural Join | all = FALSE (default) |
| Full Outer Join | all = TRUE |
| Left Outer Join | all.x = TRUE |
| Right Outer Join | all.y = TRUE |
π match() Function in R
Finds the position of elements from one vector in another:
index <- match(freezing$Name, big$Name)
Conclusion β Ready to Manipulate Data with R?
These concepts form the foundation for any data science task in R. Whether you're preparing for a data analysis project or enhancing your career in analytics, mastering data manipulation in R is a must.
Write A Comment
No Comments