Concatenating and Splitting Strings in R

Welcome!  Here we're going to go through a couple examples of concatenating, subsetting and splitting strings in R. The motivation for this article is to show how simple working with strings can be with stringr.  Working with strings was one of the areas that seemed intimidating and kept me from moving from Excel to R sooner, but many of the things I needed to do in Excel with strings are quite simple in R.

This article is really focusing on examples that I've experienced in Excel and how to do them in R. An example of when I've had to concatenate in the past is when someone handed me a dataset that included people's names and phone numbers, but they had not included a column with an id.  I concatenated names and phone numbers to create a unique id for users.  That's probably something you're not supposed to do (and I'd only recommend for an ad-hoc analysis without a ton of data), but it worked well enough for this particular use case. Using the "left" and "right" functions in Excel were also pretty common for me, and again, this is very easy to do in R. In this article we're going to cover:

  • Concatenating strings

  • Subsetting strings

  • Splitting strings

To do these string manipulations, we're going to be using the stringr and tidyr libraries. The cheat sheet for the stringr library can be found here. The tidyr cheat sheet can be found here. My friend Yujian Tang will be doing an almost similar article in python. You can find Yujian's article here.

 

Concatenate a string in r:

Concatenating is a fancy terms for just chaining things together.  Being able to manipulate strings is one of the skills that made me feel more comfortable moving away from Excel and towards using code for my analyses. Here, we're just going to call in the stringr, dplyr, and tidyr libraries, create some data, and then concatenate that data. I've chosen to add the code here in a way that is easy to copy and paste, and then I've also added a screenshot of the output.

### install and call the stringr library
#install.packages("stringr")
#install.packages("dplyr")
#install.packages("tidyr")
library(stringr)
library(dplyr)
library(tidyr)  # for the separate function in splitting strings section

####  Create data
column1 <- c("Paul", "Kristen", "Susan", "Harold")
column2 <- c("Kehrer", "Kehrer", "Kehrer", "Kehrer")

##concatenate the columns
str_c(column1, column2)

Super simple, but also rarely what we're actually looking to achieve.  Most of the time I'll need some other formatting, like a space between the names.  This is super easy and intuitive to do. You're also able to put multiple concatenations together using the "collapse" parameter and specify the characters between those.

## Put a space between the names
str_c(column1, " ", column2)

###  If you were trying to make some weird sentence, I added apostrophes for the names:
str_c("We'll put the first name here: '", column1, "' and we'll put the second name here: '", column2,"'")

###  Using the collapse parameter, you're also able to specify any characters between the concatenations.  So column 1 and 2 will be concatenated, 
###  but each concatenation will be separated by commas

str_c(column1, " ", column2, collapse = ", ")

NAs by default are ignored in this case, but if you'd like them to be included you can leverage the "str_replace_na" function.  This might be helpful if you're doing further string manipulation later on and don't want all your data to be consistent for future manipulations.

###  If you're dealing with NA's, you'll just need to add the "str_replace_na" function if you'd like it to be treated like your other data.
### Here is the default handling of NAs

column3 <- c("Software Engineer", "Data Scientist", "Student", NA)

str_c(column1, " - ", column3, ", ")

###  To make this work with the NA, just add "str_replace_na" to the relevant column

str_c(column1, " - ", str_replace_na(column3))

Subsetting a String in R:

Here, I was really focused on just sharing how to get the first couple elements or the last couple elements of a string.  I remember in my Excel days that there would sometimes be a need to keep just the 5 characters on the right (or the left), especially if I received data where a couple columns had already been concatenated and now it needed to be undone. In R indexes start with "1", meaning the first object in a list is counted at "1" instead of "0". Most languages start with "0", including python.  So here we go, looking at how you'll get the left and right characters from a string. First we'll get the original string returned, then we'll look at the right, then finally we'll do the same for the left.

### This will give me back the original string, because we're starting from the first letter and ending with the last letter
str_sub(column1, 1, -1)

### Here we'll get the 3 characters from the right.
### So this is similar to the "right" function in Excel.  
### We're telling the function to start at 3 characters from the end (because it's negative) and continue till the end of the string.
str_sub(column1, -3)

### The following would do the same, because the last element in the string is -1.
str_sub(column1, -3, -1)

###  Since the first input after the data is the "start" and the second is the "end", it's very easy to get any number of characters starting at the left of the string.
###  Here we're going from the first character to the third character.  So we'll have the first 3 characters of the string.
str_sub(column1, 1, 3)


Splitting A String in R:

When you have something like a column for the date that includes the full date, you might want to break that up into multiple columns; one for month, one for day, one for year, day of the week, etc. This was another task that I had previously done in Excel and now do in R. Any of these columns might be super useful in analysis, especially if you're doing time series modeling.  For this we can use the separate function from tidyr (that we already loaded above).  All we're doing here is passing the data to use and a vector of our desired column headings.

###  Create our data
dates <- c("Tuesday, 9/6/2022", "Wednesday, 9/7/2022", "Thursday, 9/8/2022")

###  Make this into a dataframe for ease of use
dates <- data.frame(dates)

### The separate function will create columns at each separator starting from the left.  If I only gave
### two column names I would be returned just the day of week and the month.
dates %>% separate(dates, c("day_of_week","month","day","year"))

And there you have it.  These were a couple examples of working with strings I experienced as a wee analyst in Excel, but now would perform these tasks in R.  Hope there is a person out there that needs to perform these tasks and happens to stumble upon this article. If you're looking to learn R, the best classes I've found are from Business Science.  This (affiliate) link has a 15% off coupon attached. Although it's possible to buy the courses separately, this would bring you through using R for data science, you'll learn all the way through advanced shiny, and time series. The link  to the Business Science courses is here. There is obviously so much more to working with strings than was explained here, but I wanted to just show a couple very clear and easy to read examples. Thanks for reading and happy analyzing.

Previous
Previous

Object Detection Using YOLOv5 Tutorial

Next
Next

Using Rename and Replace in Python To Clean Image Data