I’m still working with that 6TB of data, and likely will be for a long time. The data are divided up by time: full spatial extent for a few time periods. The research group would also like to have time series for a particular location, but I can’t load all the data into memory. My current approach is to load a dataset, then save it as smaller chunks (expect more on file formats and save/load options later). The chunks are small enough in spatial extent that I can open them all and assemble a time series.
Looping over a large number of files in R, doing things with them, then writing them out again can lead to slow memory leaks, even if files are over-written. Hadley Wickham talks about memory management in R in Advanced R. I spent some time poking around with the pryr package, just out of curiousity, but there’s an easier solution: stick all the heavy lifting into a function. As long as the function doesn’t return something that includes its environment, the memory is freed upon exit.
All the file handling (reading and writing) goes into the function.
loadfn <- function(filepatt, outname) {
# list all the files matching the specified pattern
filelist <- list.files(pattern=filepatt)
fulldata <- vector(length(filelist), mode="list")
for(thisfileno in seq_along(filelist)) {
load(filelist[thisfileno]) # loads as thissave
fulldata[[thisfileno]] <- thissave
}
fulldata <- do.call("cbind", fulldata)
# do some other processing
# rename the object and save it
assign(outname, fulldata)
save(list=c(outname), file=paste0(outname, ".RDA"))
invisible()
}
Then the function is called for the full list of possible patterns.
for(thispatt in filepatterns) {
loadfn(thispatt, paste(thispatt, "series", sep=".")
}
No clean-up, no memory leaks. The operating system no longer kills my process every time I leave it.