Problem
We have a few CSV files that have the same columns. We want to merge them together while retaining information about which file each row came from.
Data
Let’s say the CSV files are file{1,2,3,4}.csv
. This is file1.csv
:
library(tidyverse)
read_csv("file1.csv")
The other three files are similar.
Solution summary
This is the complete solution using tidyverse libraries:
library(tidyverse)
csv_names = paste0("file", 1:4, ".csv")
csv_names %>% map(read_csv) %>% set_names(csv_names) %>% bind_rows(.id = "file")
Breaking it down
Making the list of files
The first step is to get all the file names in a list. If your filenames have a regular pattern, you can construct that easily with paste
. paste0
is a special case of paste with the joining delimiter set to ""
.
csv_names = paste0("file", 1:4, ".csv")
csv_names
[1] "file1.csv" "file2.csv" "file3.csv" "file4.csv"
Import each of them
Now we want to import each of these files into a data frame (technically a tibble since we’re using tidyverse). Fortunately purr
has a map function that works very similarly to map function in Javascript, which I therefore found much more intuitive to use than the more popular lapply
variety of functions in R.
csv_dfs = csv_names %>% map(read_csv)
length(csv_dfs)
[1] 4
csv_dfs[[3]]
Joining them together
The bind_rows
function binds rows together, and has a handy .id
argument that creates a new column holding data that links back the rows to the data frames they came from.
csv_dfs %>% bind_rows(.id="filename")
This is still not quite exactly what we want, because unless we name the elements in our list, bind_rows does not does not know what to put in the .id
column, so it just uses a numeric sequence instead. lists
in R are almost like dictionaries in python, but some elements only has a value and no key. Right now our list has only values and no keys. Setting the name for all of them can be done in one shot using set_names
.
names(csv_dfs)
[1] "file1.csv" "file2.csv" "file3.csv" "file4.csv"
Now bind_rows
can use these names to set the values in the .id
column:
csv_dfs %>% bind_rows(.id="filename")
We’re done!
Modifying each csv before merging
You may want to modify each individual CSV before merging. For example, your original csv files may have 100s of columns but you only want to keep a few, and you don’t want to merge first and then select for efficiency. It is easy to do so by creating a wrapper function around read_csv
:
csv_extract = function(csv_name) {
read_csv(csv_name) %>%
select(Red, Blue, Green) # Dropping Yellow because we're not interested.
}
csv_names = paste0("file", 1:4, ".csv")
csv_names %>% map(csv_extract) %>% set_names(csv_names) %>% bind_rows(.id = "file")
LS0tCnRpdGxlOiAiTWVyZ2luZyBDU1YgZmlsZXMiCm91dHB1dDoKICBodG1sX25vdGVib29rOgogICAgY29kZV9mb2xkaW5nOiBzaG93CiAgICB0b2M6IHllcwogICAgdGhlbWU6IGNlcnVsZWFuCiAgaHRtbF9kb2N1bWVudDoKICAgIGRmX3ByaW50OiBwYWdlZAogICAgdG9jOiB5ZXMKICBwZGZfZG9jdW1lbnQ6IGRlZmF1bHQKLS0tCgo8c3R5bGUgdHlwZT0idGV4dC9jc3MiPgoKYm9keXsgLyogTm9ybWFsICAqLwogICAgICBmb250LXNpemU6IDE0cHg7CiAgfQp0ZCB7ICAvKiBUYWJsZSAgKi8KICBmb250LXNpemU6IDE0cHg7Cn0KaDEudGl0bGUgewogIGZvbnQtc2l6ZTogMzhweDsKfQpoMSB7IC8qIEhlYWRlciAxICovCiAgZm9udC1zaXplOiAyOHB4Owp9CmgyIHsgLyogSGVhZGVyIDIgKi8KICAgIGZvbnQtc2l6ZTogMjJweDsKfQpoMyB7IC8qIEhlYWRlciAzICovCiAgZm9udC1zaXplOiAxOHB4Owp9CmNvZGUucnsgLyogQ29kZSBibG9jayAqLwogICAgZm9udC1zaXplOiAxNHB4Owp9CnByZSB7IC8qIENvZGUgYmxvY2sgLSBkZXRlcm1pbmVzIGNvZGUgc3BhY2luZyBiZXR3ZWVuIGxpbmVzICovCiAgICBmb250LXNpemU6IDE0cHg7Cn0KPC9zdHlsZT4KCiMgUHJvYmxlbQpXZSBoYXZlIGEgZmV3IENTViBmaWxlcyB0aGF0IGhhdmUgdGhlIHNhbWUgY29sdW1ucy4gV2Ugd2FudCB0byBtZXJnZSB0aGVtIHRvZ2V0aGVyIHdoaWxlIHJldGFpbmluZyBpbmZvcm1hdGlvbiBhYm91dCB3aGljaCBmaWxlIGVhY2ggcm93IGNhbWUgZnJvbS4gCgojIERhdGEKTGV0J3Mgc2F5IHRoZSBDU1YgZmlsZXMgYXJlIGBmaWxlezEsMiwzLDR9LmNzdmAuIFRoaXMgaXMgYGZpbGUxLmNzdmA6CgpgYGB7ciBtZXNzYWdlPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKcmVhZF9jc3YoImZpbGUxLmNzdiIpCmBgYApUaGUgb3RoZXIgdGhyZWUgZmlsZXMgYXJlIHNpbWlsYXIuIAoKIyBTb2x1dGlvbiBzdW1tYXJ5CgpUaGlzIGlzIHRoZSBjb21wbGV0ZSBzb2x1dGlvbiB1c2luZyB0aWR5dmVyc2UgbGlicmFyaWVzOgoKYGBge3IgbWVzc2FnZT1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmNzdl9uYW1lcyA9IHBhc3RlMCgiZmlsZSIsIDE6NCwgIi5jc3YiKSAKY3N2X25hbWVzICU+JSBtYXAocmVhZF9jc3YpICU+JSBzZXRfbmFtZXMoY3N2X25hbWVzKSAlPiUgYmluZF9yb3dzKC5pZCA9ICJmaWxlIikKYGBgCgojIEJyZWFraW5nIGl0IGRvd24gCgojIyBNYWtpbmcgdGhlIGxpc3Qgb2YgZmlsZXMgCgpUaGUgZmlyc3Qgc3RlcCBpcyB0byBnZXQgYWxsIHRoZSBmaWxlIG5hbWVzIGluIGEgbGlzdC4gSWYgeW91ciBmaWxlbmFtZXMgaGF2ZSBhIHJlZ3VsYXIgcGF0dGVybiwgeW91IGNhbiBjb25zdHJ1Y3QgdGhhdCBlYXNpbHkgd2l0aCBgcGFzdGVgLiBgcGFzdGUwYCBpcyBhIHNwZWNpYWwgY2FzZSBvZiBwYXN0ZSB3aXRoIHRoZSBqb2luaW5nIGRlbGltaXRlciBzZXQgdG8gYCIiYC4gCgpgYGB7cn0KY3N2X25hbWVzID0gcGFzdGUwKCJmaWxlIiwgMTo0LCAiLmNzdiIpIApjc3ZfbmFtZXMKYGBgCgojIyBJbXBvcnQgZWFjaCBvZiB0aGVtIAoKTm93IHdlIHdhbnQgdG8gaW1wb3J0IGVhY2ggb2YgdGhlc2UgZmlsZXMgaW50byBhIGRhdGEgZnJhbWUgKHRlY2huaWNhbGx5IGEgdGliYmxlIHNpbmNlIHdlJ3JlIHVzaW5nIHRpZHl2ZXJzZSkuIEZvcnR1bmF0ZWx5IGBwdXJyYCBoYXMgYSBtYXAgZnVuY3Rpb24gdGhhdCB3b3JrcyB2ZXJ5IHNpbWlsYXJseSB0byBtYXAgZnVuY3Rpb24gaW4gSmF2YXNjcmlwdCwgd2hpY2ggSSB0aGVyZWZvcmUgZm91bmQgbXVjaCBtb3JlIGludHVpdGl2ZSB0byB1c2UgdGhhbiB0aGUgbW9yZSBwb3B1bGFyIGBsYXBwbHlgIHZhcmlldHkgb2YgZnVuY3Rpb25zIGluIFIuICAKCmBgYHtyIG1lc3NhZ2U9RkFMU0V9CmNzdl9kZnMgPSBjc3ZfbmFtZXMgJT4lIG1hcChyZWFkX2NzdikKCmxlbmd0aChjc3ZfZGZzKQpjc3ZfZGZzW1szXV0KYGBgCgojIyBKb2luaW5nIHRoZW0gdG9nZXRoZXIKClRoZSBbYGJpbmRfcm93c2BdKGh0dHBzOi8vZHBseXIudGlkeXZlcnNlLm9yZy9yZWZlcmVuY2UvYmluZC5odG1sKSBmdW5jdGlvbiBiaW5kcyByb3dzIHRvZ2V0aGVyLCBhbmQgaGFzIGEgaGFuZHkgYC5pZGAgYXJndW1lbnQgdGhhdCBjcmVhdGVzIGEgbmV3IGNvbHVtbiBob2xkaW5nIGRhdGEgdGhhdCBsaW5rcyBiYWNrIHRoZSByb3dzIHRvIHRoZSBkYXRhIGZyYW1lcyB0aGV5IGNhbWUgZnJvbS4gCgpgYGB7ciByb3dzLnByaW50PTh9CmNzdl9kZnMgJT4lIGJpbmRfcm93cyguaWQ9ImZpbGVuYW1lIikKYGBgCgpUaGlzIGlzIHN0aWxsIG5vdCBxdWl0ZSBleGFjdGx5IHdoYXQgd2Ugd2FudCwgYmVjYXVzZSB1bmxlc3Mgd2UgbmFtZSB0aGUgZWxlbWVudHMgaW4gb3VyIGxpc3QsIGJpbmRfcm93cyBkb2VzIG5vdCBkb2VzIG5vdCBrbm93IHdoYXQgdG8gcHV0IGluIHRoZSBgLmlkYCBjb2x1bW4sIHNvIGl0IGp1c3QgdXNlcyBhIG51bWVyaWMgc2VxdWVuY2UgaW5zdGVhZC4gYGxpc3RzYCBpbiBSIGFyZSBhbG1vc3QgbGlrZSBkaWN0aW9uYXJpZXMgaW4gcHl0aG9uLCBidXQgc29tZSBlbGVtZW50cyBvbmx5IGhhcyBhIHZhbHVlIGFuZCBubyBrZXkuIFJpZ2h0IG5vdyBvdXIgbGlzdCBoYXMgb25seSB2YWx1ZXMgYW5kIG5vIGtleXMuIFNldHRpbmcgdGhlIG5hbWUgZm9yIGFsbCBvZiB0aGVtIGNhbiBiZSBkb25lIGluIG9uZSBzaG90IHVzaW5nIGBzZXRfbmFtZXNgLgoKYGBge3IgbWVzc2FnZT1GQUxTRX0KY3N2X2RmcyA9IGNzdl9uYW1lcyAlPiUgbWFwKHJlYWRfY3N2KSAlPiUgc2V0X25hbWVzKGNzdl9uYW1lcykKCm5hbWVzKGNzdl9kZnMpCmNzdl9kZnMkZmlsZTMuY3N2CmBgYAoKTm93IGBiaW5kX3Jvd3NgIGNhbiB1c2UgdGhlc2UgbmFtZXMgdG8gc2V0IHRoZSB2YWx1ZXMgaW4gdGhlIGAuaWRgIGNvbHVtbjogCgpgYGB7cn0KY3N2X2RmcyAlPiUgYmluZF9yb3dzKC5pZD0iZmlsZW5hbWUiKQpgYGAKCgpXZSdyZSBkb25lISAKCiMjICBNb2RpZnlpbmcgZWFjaCBjc3YgYmVmb3JlIG1lcmdpbmcKCllvdSBtYXkgd2FudCB0byBtb2RpZnkgZWFjaCBpbmRpdmlkdWFsIENTViBiZWZvcmUgbWVyZ2luZy4gRm9yIGV4YW1wbGUsIHlvdXIgb3JpZ2luYWwgY3N2IGZpbGVzIG1heSBoYXZlIDEwMHMgb2YgY29sdW1ucyBidXQgeW91IG9ubHkgd2FudCB0byBrZWVwIGEgZmV3LCBhbmQgeW91IGRvbid0IHdhbnQgdG8gbWVyZ2UgZmlyc3QgYW5kIHRoZW4gc2VsZWN0IGZvciBlZmZpY2llbmN5LiBJdCBpcyBlYXN5IHRvIGRvIHNvIGJ5IGNyZWF0aW5nIGEgd3JhcHBlciBmdW5jdGlvbiBhcm91bmQgYHJlYWRfY3N2YDogCgpgYGB7ciBtZXNzYWdlPUZBTFNFfQoKY3N2X2V4dHJhY3QgPSBmdW5jdGlvbihjc3ZfbmFtZSkgewogIHJlYWRfY3N2KGNzdl9uYW1lKSAlPiUgCiAgICBzZWxlY3QoUmVkLCBCbHVlLCBHcmVlbikgICMgRHJvcHBpbmcgWWVsbG93IGJlY2F1c2Ugd2UncmUgbm90IGludGVyZXN0ZWQuCn0KY3N2X25hbWVzID0gcGFzdGUwKCJmaWxlIiwgMTo0LCAiLmNzdiIpIApjc3ZfbmFtZXMgJT4lIG1hcChjc3ZfZXh0cmFjdCkgJT4lIHNldF9uYW1lcyhjc3ZfbmFtZXMpICU+JSBiaW5kX3Jvd3MoLmlkID0gImZpbGUiKQpgYGAK