I’m still working with that 6TB of data, and likely will be for a long time. The data are divided up by time: full spatial extent for a few time periods. The research group would also like to have time series for a particular location, but I can’t load all the data into memory. My current approach is to load a dataset, then save it as smaller chunks (expect more on file formats and save/load options later). The chunks are small enough in spatial extent that I can open them all and assemble a time series.
Looping over a large number of files in R, doing things with them, then writing them out again can lead to slow memory leaks, even if files are over-written. Hadley Wickham talks about memory management in R in Advanced R. I spent some time poking around with the pryr package, just out of curiousity, but there’s an easier solution: stick all the heavy lifting into a function. As long as the function doesn’t return something that includes its environment, the memory is freed upon exit.
All the file handling (reading and writing) goes into the function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
loadfn <- function(filepatt, outname) { # list all the files matching the specified pattern filelist <- list.files(pattern=filepatt) fulldata <- vector(length(filelist), mode="list") for(thisfileno in seq_along(filelist)) { load(filelist[thisfileno]) # loads as thissave fulldata[[thisfileno]] <- thissave } fulldata <- do.call("cbind", fulldata) # do some other processing # rename the object and save it assign(outname, fulldata) save(list=c(outname), file=paste0(outname, ".RDA")) invisible() } |
Then the function is called for the full list of possible patterns.
1 2 3 |
for(thispatt in filepatterns) { loadfn(thispatt, paste(thispatt, "series", sep=".") } |
No clean-up, no memory leaks. The operating system no longer kills my process every time I leave it.