Files and memory

I’m still working with that 6TB of data, and likely will be for a long time. The data are divided up by time: full spatial extent for a few time periods. The research group would also like to have time series for a particular location, but I can’t load all the data into memory. My current approach is to load a dataset, then save it as smaller chunks (expect more on file formats and save/load options later). The chunks are small enough in spatial extent that I can open them all and assemble a time series.

Looping over a large number of files in R, doing things with them, then writing them out again can lead to slow memory leaks, even if files are over-written. Hadley Wickham talks about memory management in R in Advanced R. I spent some time poking around with the pryr package, just out of curiousity, but there’s an easier solution: stick all the heavy lifting into a function. As long as the function doesn’t return something that includes its environment, the memory is freed upon exit.

All the file handling (reading and writing) goes into the function.


loadfn <- function(filepatt, outname) {

  # list all the files matching the specified pattern

  filelist <- list.files(pattern=filepatt)

  fulldata <- vector(length(filelist), mode="list")

  for(thisfileno in seq_along(filelist)) {

    load(filelist[thisfileno]) # loads as thissave

    fulldata[[thisfileno]] <- thissave

  }



  fulldata <- do.call("cbind", fulldata)



  # do some other processing

  # rename the object and save it

  assign(outname, fulldata)

  save(list=c(outname), file=paste0(outname, ".RDA"))

  invisible()

}

Then the function is called for the full list of possible patterns.


for(thispatt in filepatterns) {

  loadfn(thispatt, paste(thispatt, "series", sep=".")

}

No clean-up, no memory leaks. The operating system no longer kills my process every time I leave it.