# Problem

We have a few CSV files that have the same columns. We want to merge them together while retaining information about which file each row came from.

# Data

Let’s say the CSV files are file{1,2,3,4}.csv. This is file1.csv:

library(tidyverse)
read_csv("file1.csv")

The other three files are similar.

# Solution summary

This is the complete solution using tidyverse libraries:

library(tidyverse)
csv_names = paste0("file", 1:4, ".csv")
csv_names %>% map(read_csv) %>% set_names(csv_names) %>% bind_rows(.id = "file")

# Breaking it down

## Making the list of files

The first step is to get all the file names in a list. If your filenames have a regular pattern, you can construct that easily with paste. paste0 is a special case of paste with the joining delimiter set to "".

csv_names = paste0("file", 1:4, ".csv")
csv_names
[1] "file1.csv" "file2.csv" "file3.csv" "file4.csv"

## Import each of them

Now we want to import each of these files into a data frame (technically a tibble since we’re using tidyverse). Fortunately purr has a map function that works very similarly to map function in Javascript, which I therefore found much more intuitive to use than the more popular lapply variety of functions in R.

csv_dfs = csv_names %>% map(read_csv)

length(csv_dfs)
[1] 4
csv_dfs[[3]]

## Joining them together

The bind_rows function binds rows together, and has a handy .id argument that creates a new column holding data that links back the rows to the data frames they came from.

csv_dfs %>% bind_rows(.id="filename")

This is still not quite exactly what we want, because unless we name the elements in our list, bind_rows does not does not know what to put in the .id column, so it just uses a numeric sequence instead. lists in R are almost like dictionaries in python, but some elements only has a value and no key. Right now our list has only values and no keys. Setting the name for all of them can be done in one shot using set_names.

names(csv_dfs)

[1] "file1.csv" "file2.csv" "file3.csv" "file4.csv"

Now bind_rows can use these names to set the values in the .id column:

csv_dfs %>% bind_rows(.id="filename")

We’re done!

## Modifying each csv before merging

You may want to modify each individual CSV before merging. For example, your original csv files may have 100s of columns but you only want to keep a few, and you don’t want to merge first and then select for efficiency. It is easy to do so by creating a wrapper function around read_csv:


csv_extract = function(csv_name) {
csv_names %>% map(csv_extract) %>% set_names(csv_names) %>% bind_rows(.id = "file")