I have been working with huge samples recently. When you work with large samples, memory leak is a common problem. I have been extensively using garbage collector, but it is not helping much. So, you need to write your codes efficiently.
No matter what you do, there are limits. I have been struggling with one of them. I resolved that problem using callr.
I often prefer to learn these packages with examples, but I could not find any. So, I here share an example for people who may need it.
p = callr::r_bg(fun = function(n){
library(dplyr)
library(data.table)
df = data.frame(matrix(data = rep(1, n), ncol = 10))
write.csv(x = df, file = 'xxxxx.csv')
write.csv(x = data.frame(matrix(data = rep(1, n), ncol = 10)), file = 'xxxxx2.csv')
df2 = df %>% summarise(x = sum(X1))
df3 = fread('xxxxx.csv')
fwrite(df3, file = 'x.csv')
}, args = list(n = 10000000))
p$wait()
print('finished')
I use r_bg
function here to run this process in the background. There are some features of the example above that I needed in my real work.
- It is possible to load libraries inside the function.
- You do not need to do it. You can simply, for example, call functions as follows:
data.table::fread()
. I wanted to do this in this way because I wanted to the pipes of thedplyr
package (%>%
).
- You do not need to do it. You can simply, for example, call functions as follows:
- You can pass arguments to the function easily. That means, for instance, you can run this code inside a for loop.
- You can wait for background process to be finished.
- You can also have multiple background works. See here.