In an earlier post, I used mclapply to kick off parallel R processes and to demonstrate inter-process synchronization via the flock package. Although I have been using this approach to parallelism for a few years now, I admit, it has certain important disadvantages. It works only on a single machine, and also, it doesn’t work on Windows.
Hence, to test the flock package on Windows, I had to resort to the alternative implementation. I liked it quite a bit and learned a few things on the way, so here it is.
require(RSQLite) require(parallel) require(flock) db.path = "C:/ttt.sqlite" lock.name = "C:/file.lock" con <- dbConnect(RSQLite::SQLite(), dbname=db.path) df <- data.frame(value=0) dbWriteTable(con, "test", df, overwrite=T) dbDisconnect(con) write.one.value <- function(val) { # Take an exclusive lock ll = lock(lock.name) # The critical section code con <- dbConnect(RSQLite::SQLite(), dbname=db.path) dbWriteTable(con, "test", data.frame(value=val), append=TRUE) dbDisconnect(con) # Release the lock unlock(ll) } write.values = function(cores, db.path, lock.name) { if(cores > 1) { cl = makeCluster(cores) # Load the packages into all slave processes clusterEvalQ(cl=cl, library(RSQLite)) clusterEvalQ(cl=cl, library(flock)) # Make variables visible in the work-horse function. clusterExport(cl, c("db.path", "lock.name")) tt = parSapply(cl=cl, 1:1000, write.one.value) stopCluster(cl) } else { environment(write.one.value) = environment() tt = lapply(1:1000, write.one.value) } } write.values(1, db.path, lock.name)
The interesting piece is write.values. First, it shows a neat R feature, namely, how to make variables visible in the callee without passing everything as arguments. It also shows how to branch between single core and single machine execution (the lapply branch) and a parallel execution (yes, via parSapply one can spawn process even across multiple machines, depends on the setup of the cluster). Why is that important? Often the errors returned from the single core (lapply) version are more clear and meaningful than the errors returned from the parallel version. In other words, until things are stable and bugs are out – the single core version is essential.
Hi, thanks for sharing your approach!
I have a silly question, but… Isn’t connecting/disconnecting from DB for each writing introducing too much overload?
In general, it could be. In my case – not at all. The time is dominated by the computations I do on each step. Also, keep in mind, SQLite is not a database server, so the overhead of connecting/disconnect is low.