Running GCAM Data System with drake

Introduction

The function driver_drake(), located in R/driver.R, runs the GCAM data system, like driver(). However, unlike driver(), driver_drake() skips steps that are already up-to-date, saving time. The central function of drake is drake::make(), which builds the data system. drake’s make is more sophisticated than the standard GNU make based systems because it checks the content has substantively changed instead of only if a file has been modified, and similarly only runs subsequent steps that are actually affected by the latest changes since the previous make().

In this vignette we will go through concrete examples which highlight these benefits. In addition we will give examples of common tasks when working with drake and provide links to further documentation.

Timing

Using driver_drake() significantly speeds up the process for making changes after the initial data system build. However due to the additional overhead of caching results the initial data system build may be slower (See Parallel Computing than regular driver(). On a local Windows machine, the initial run of driver_drake() took 25 minutes and 11 seconds, while the initial run of driver() took 17 minutes and 4 seconds. However, as an example, after editing a single input file A10.TechChange.csv, driver_drake updated 44 targets and took 1 minute and 7 seconds to run. With driver(), the full data system would have to be rerun.

Documentation

The drake package has many features that can be used with gcamdata that are not discussed here. drake’s documentation is very good and includes many helpful resources. To learn more about what drake can do and how it works, the The drake R Package Users Manual is a good place to start.

`drake`’s cache

When running make(), drake stores your targets in a hidden cache by default named .drake in your current working directory. Typically a user does not need to directly manipulate this cache but in some cases they may wish to. For recovery purposes, drake keeps all targets from all runs of make(). To delete this cache and rerun the data system from scratch, you can safely delete the .drake folder. Sometimes when a chunk errors during processing you may be left with a “locked” cache. If the cache is locked, you can force unlock with drake::drake_cache()$unlock().

Examples

Let’s explore driver_drake() and see how it can help us run the data system. We’ll start by doing an initial run which will create and store output in the drake cache and create all of our xml files. We run this just as we would driver().

# Load package and run driver_drake, output messages are hidden
devtools::load_all()
driver_drake()

On Windows we may run into the MAX_PATH file path limit after the package is installed. If you get the following error, make sure you load the package with devtools::load_all().

Error in file.rename() :
  expanded 'to' name too long

Next, we’ll explore two different edits of a chunk and see how driver_drake() responds.

Example 1

First, let’s edit the input, A61.globaltech_cost.csv by changing a cost value.

# Copy the file so we can get it back later
example_file <- find_csv_file("energy/A61.globaltech_cost", FALSE)[[1]]
#> Found ../inst/extdata/energy/A61.globaltech_cost.csv
file.copy(from = example_file, to = paste0(example_file, ".bak"))
#> [1] TRUE
# Change one value in file, then rewrite to same path
tmp <- readr::read_lines(example_file)
tmp[9] <- sub("211", "200", tmp[9])
readr::write_lines(tmp, example_file)
# Load and run driver_drake(). Print run time.
devtools::load_all(".")
#> i Loading gcamdata
t1 <- Sys.time()
driver_drake()
#> GCAM Data System v5.1
#> Found 420 chunks
#> Found 4267 chunk data requirements
#> Found 2416 chunk data products
#> 1452 chunk data input(s) not accounted for
#> Warning: missing file_in() files:
#>   R/constants.R
#> Warning: Do not run make() from a subdirectory of your project.
#>   running make() from: C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\vignettes
#>   drake project root:  C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata
#>   cache directory:     C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\.drake
#> > target energy.A61.globaltech_cost
#> > target module_energy_L261.Cstorage
#> > target L261.RsrcCurves_C_low
#> > target L261.RsrcCurves_C_high
#> > target L261.SubsectorShrwtFllt_C
#> > target L261.Rsrc
#> > target L261.StubTech_C
#> > target L261.GlobalTechCoef_C
#> > target L261.SubsectorLogit_C
#> > target L261.Supplysector_C
#> > target L261.GlobalTechShrwt_C_nooffshore
#> > target L261.GlobalTechCost_C_High
#> > target L261.RsrcCurves_C_lowest
#> > target L261.GlobalTechShrwt_C
#> > target L261.GlobalTechCost_C
#> > target L261.RsrcCurves_C
#> > target L261.UnlimitRsrc
#> > target L261.ResTechShrwt_C
#> > target module_energy_batch_Cstorage_xml
#> > target Cstorage.xml
#> > target xml.Cstorage.xml
#> All done.
print(Sys.time()-t1)
#> Time difference of 4.300751 mins

As expected, driver_drake() runs all dependencies of A61.globaltech_cost.csv since they need to be updated to the new cost value.

Example 2

Next, we will append another column to the end of that same file, but this time with a year outside of GCAM’s model year, so it will get filtered out and should have no further effect.

# Add a column with a year that will be filtered out
tmp[6] <- paste0(trimws(tmp[6]), "i") # tell gcamdata that there will be another integer column
tmp[8] <- paste0(tmp[8], ",2200") # year = 2200
tmp[9] <- paste0(tmp[9], ",211") # value = 211
readr::write_lines(tmp, example_file)
# Load and run driver_drake(). Print run time.
devtools::load_all(".")
#> i Loading gcamdata
t1 <- Sys.time()
driver_drake()
#> GCAM Data System v5.1
#> Found 420 chunks
#> Found 4267 chunk data requirements
#> Found 2416 chunk data products
#> 1452 chunk data input(s) not accounted for
#> Warning: missing file_in() files:
#>   R/constants.R
#> Warning: Do not run make() from a subdirectory of your project.
#>   running make() from: C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\vignettes
#>   drake project root:  C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata
#>   cache directory:     C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\.drake
#> > target energy.A61.globaltech_cost
#> > target module_energy_L261.Cstorage
#> All done.
print(Sys.time()-t1)
#> Time difference of 1.682541 mins

This time, driver_drake() only ran energy.A61.globaltech_cost and it’s R chunk since the change affects nothing downstream.

Let’s get our original file back and run driver_drake.

# Finally, clean up the changes from this example
file.rename(paste0(example_file, ".bak"), example_file)
#> [1] TRUE
driver_drake()
#> GCAM Data System v5.1
#> Found 420 chunks
#> Found 4267 chunk data requirements
#> Found 2416 chunk data products
#> 1452 chunk data input(s) not accounted for
#> Warning: missing file_in() files:
#>   R/constants.R
#> Warning: Do not run make() from a subdirectory of your project.
#>   running make() from: C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\vignettes
#>   drake project root:  C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata
#>   cache directory:     C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\.drake
#> > target energy.A61.globaltech_cost
#> > target module_energy_L261.Cstorage
#> > target L261.RsrcCurves_C_low
#> > target L261.RsrcCurves_C_high
#> > target L261.SubsectorShrwtFllt_C
#> > target L261.Rsrc
#> > target L261.StubTech_C
#> > target L261.GlobalTechCoef_C
#> > target L261.SubsectorLogit_C
#> > target L261.Supplysector_C
#> > target L261.GlobalTechShrwt_C_nooffshore
#> > target L261.GlobalTechCost_C_High
#> > target L261.RsrcCurves_C_lowest
#> > target L261.GlobalTechShrwt_C
#> > target L261.GlobalTechCost_C
#> > target L261.RsrcCurves_C
#> > target L261.UnlimitRsrc
#> > target L261.ResTechShrwt_C
#> > target module_energy_batch_Cstorage_xml
#> > target Cstorage.xml
#> > target xml.Cstorage.xml
#> All done.

Example 3

If you edit or delete an XML file, you can quickly and easily get the original file back by running driver_drake().

# Delete wind_reeds_USA.xml
file.remove("xml/wind_reeds_USA.xml")
#> [1] TRUE
# Load and run driver_drake(). Print run time.
devtools::load_all(".")
#> i Loading gcamdata
t1 <- Sys.time()
driver_drake()
#> GCAM Data System v5.1
#> Found 420 chunks
#> Found 4267 chunk data requirements
#> Found 2416 chunk data products
#> 1452 chunk data input(s) not accounted for
#> Warning: missing file_in() files:
#>   R/constants.R
#> Warning: Do not run make() from a subdirectory of your project.
#>   running make() from: C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\vignettes
#>   drake project root:  C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata
#>   cache directory:     C:\Users\horo597\OneDrive - PNNL\Documents\gcamdata\.drake
#> > target xml.wind_reeds_USA.xml
#> All done.
print(Sys.time()-t1)
#> Time difference of 1.654303 mins

Now we have our file back without re-running the entire datasystem.

Arguments to `driver_drake()`

driver_drake() supports the same arguments as driver() (see ?driver), except write_outputs (since drake must include all outputs in the cache). Thus users can still use stop_before or return_data_map_only as before except with the benefits of drake: if no modifications were made they can just be generated from cache. Users can also pass additional arguments to driver_drake() which will be forwarded on to make(). You can see ?drake::make for all available options, but some useful ones include:

verbose: integer, controls printing to the console/terminal (defualt: 1)
- 0: print nothing
- 1: print target names as they build
- 2: show a progress bar to track what percent of targets have been completed. Also shows a spinner bar during preprocessing tasks
history: logical, whether to record the build history of targets (default: TRUE). This is helpful if you need to recover old data. Or perhaps check how some outputs changed between commits. However, given the size of the data produced by gcamdata it may lead to very large cache sizes. Thus it may be beneficial to set it to FALSE or at least clean out the cache from time to time.
memory_strategy: character scalar, name of the strategy drake uses to load/unload a target’s dependencies in memory (default: "speed"). Some options include
- "speed": maximizes speed but hogs memory. Recommended for users with at least 5GB of available RAM.
- "autoclean": conserves memory but sacrifices speed by unloading outputs after no more targets depend on them. This behavior is similar to that of driver().

More Examples

Here are some additional examples of calling driver_drake() with some of the alternative arguments discussed above.

# Run with a progress bar
driver_drake(verbose = 2)
# Run, stop before a chunk and conserve memory
driver_drake(stop_before = "module_aglu_LA100.FAO_downscale_ctry", memory_strategy = "autoclean")

Parallel Computing

Parallel computing is supported by drake but requires some “backend” to do the work. The two primary backend R packages that are used are clustermq and future. In addition each of these packages support two different mechanisms to utilizing multiple cores: multisession - which launches multiple independent sessions of R and communicates between them using a message passing system; or multicore: just the one R session but multiple threads are created with in it. However, we found that when using the multisession mechanism with either package, you must do a full reinstall the gcamdata package each time you change a target (devtools::load_all() is not sufficient) for the targets to update and build correctly. Also, the multicore option is not supported on Windows.

Options

When running driver_drake() with parallelism, the following arguments to make() should be specified in driver_drake()

jobs: integer, maximum number of parallel workers for processing targets. A user should choose a value <= the number of CPU cores their system has.
caching: character string, should be set to "worker" as to avoid wasting time doing synchronization

The `clustermq` backend

See the clustermq installation guide for installation instructions and options. clustermq requires R version 3.5 or greater. Note, on Mac and Windows a simple install.packages("clustermq") is sufficient. For using clustermq on PIC (PNNL Institutional Computing) we have already installed the full set of required R packages in a shared space as successfully compiling clustermq was not straightforward due to compiler version issues. To use drake + clustermq in PIC a user can:

Get necessary libraries with: export R_LIBS=/pic/projects/GCAM/GCAM-libraries/R/x86_64-pc-linux-gnu-library/3.5
Load the zeromq library with: module load zeromq/4.1.4
Open R 3.5.1 with: module load R/3.5.1
Open an R session and set the global option below
Run driver_drake() with the arguments such as below

# Load clustermq and set type to multicore
library(clustermq)
options(clustermq.scheduler = "multicore")
# Load and run
devtools::load_all()
driver_drake(parallelism = "clustermq", caching = "worker", jobs = 48)

If you get the following error while trying to load clustermq, make sure you have the zeromq library loaded ( module load zeromq/4.1.4).

library(clustermq)
Error: package or namespace load failed for 'clustermq' in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/qfs/projects/ops/rh6/R/3.5.1/lib64/R/library/clustermq/libs/clustermq.so':
  libzmq.so.5: cannot open shared object file: No such file or directory

On PIC, we had good performance with clustermq, type multicore, and jobs = 48. The initial build took 5 minutes and 40 seconds as opposed to just over 30 minutes for the driver_drake() build without parallelism. For reference, the build with driver() on PIC took 22 minutes and 41 seconds.

Recall, if you are trying to use clustermq on Windows, multicore is not supported. To use multisession, make sure you reinstall after any changes are make before you try to run driver_drake().

The `future` backend

We did not have good performance with future. On a local Windows machine, the initial build with parallelism = future and type multisession took 1 hour and 14 seconds. It took over an hour on PIC as well. We still document it here in case the situation improves in the future.

To use this backend, install the future package and provide the following arguments:

# Run driver_drake with future plan multisession
future::plan(future::multisession)
devtools::load_all()
driver_drake(parallelism = "future", caching = "worker", jobs = 4)

See ?future::plan for all strategy options and explanations.

Loading data from the `drake` cache

You can use the typical arguments to driver_drake() such as stop_before or return_data_names to return outputs from cache after the initial run. And if you unsure if there have been any modifications since the last run that is the best way to load them. However, if you are sure it is up to date, we have provided a utility method load_from_cache for doing so and returning the data in the same format as data returned from driver(stop_after = "module_emissions_L121.nonco2_awb_R_S_T_Y"). Therefore we recommend using this utility rather than directly using drake::readd directly.

# We can give a list of files we want to load
data <- load_from_cache(c("L121.nonco2_tg_R_awb_C_Y_GLU"))
data
#> $L121.nonco2_tg_R_awb_C_Y_GLU
#> # A tibble: 2,995,650 x 7
#>    GCAM_region_ID Non.CO2   GCAM_commodity GCAM_subsector GLU     year      value
#>             <int> <chr>     <chr>          <chr>          <chr>  <dbl>      <dbl>
#>  1              1 BC_AWB    Corn           CornC4         GLU023  1971 0.0000501 
#>  2              1 CH4_AWB   Corn           CornC4         GLU023  1971 0.000389  
#>  3              1 CO_AWB    Corn           CornC4         GLU023  1971 0.00682   
#>  4              1 H2_AWB    Corn           CornC4         GLU023  1971 0.000166  
#>  5              1 N2O_AWB   Corn           CornC4         GLU023  1971 0.00000669
#>  6              1 NH3_AWB   Corn           CornC4         GLU023  1971 0.000145  
#>  7              1 NMVOC_AWB Corn           CornC4         GLU023  1971 0.00154   
#>  8              1 NOx_AWB   Corn           CornC4         GLU023  1971 0.000208  
#>  9              1 OC_AWB    Corn           CornC4         GLU023  1971 0.000154  
#> 10              1 SO2_AWB   Corn           CornC4         GLU023  1971 0.0000267 
#> # ... with 2,995,640 more rows
# We can also combine this with other gcamdata utilities to
# load all input or outputs of a chunk as well
data <- load_from_cache(outputs_of("module_emissions_L121.nonco2_awb_R_S_T_Y"))
#> Error in module_emissions_L121.nonco2_awb_R_S_T_Y("DECLARE_OUTPUTS"): could not find function "module_emissions_L121.nonco2_awb_R_S_T_Y"
data
#> $L121.nonco2_tg_R_awb_C_Y_GLU
#> # A tibble: 2,995,650 x 7
#>    GCAM_region_ID Non.CO2   GCAM_commodity GCAM_subsector GLU     year      value
#>             <int> <chr>     <chr>          <chr>          <chr>  <dbl>      <dbl>
#>  1              1 BC_AWB    Corn           CornC4         GLU023  1971 0.0000501 
#>  2              1 CH4_AWB   Corn           CornC4         GLU023  1971 0.000389  
#>  3              1 CO_AWB    Corn           CornC4         GLU023  1971 0.00682   
#>  4              1 H2_AWB    Corn           CornC4         GLU023  1971 0.000166  
#>  5              1 N2O_AWB   Corn           CornC4         GLU023  1971 0.00000669
#>  6              1 NH3_AWB   Corn           CornC4         GLU023  1971 0.000145  
#>  7              1 NMVOC_AWB Corn           CornC4         GLU023  1971 0.00154   
#>  8              1 NOx_AWB   Corn           CornC4         GLU023  1971 0.000208  
#>  9              1 OC_AWB    Corn           CornC4         GLU023  1971 0.000154  
#> 10              1 SO2_AWB   Corn           CornC4         GLU023  1971 0.0000267 
#> # ... with 2,995,640 more rows

The drake “plan”

To utilize drake’s features, driver_drake() must generate a drake plan to supply todrake::make(), which builds the data system. The plan is a data frame with columns “target” and “command”. Each row is a step in the workflow and the target is the return value of the corresponding command. drake understands the dependency relationships between targets and commands in the plan, regardless of the order they are written. The make() function runs the targets in the correct order and stores the results in a hidden cache. In gcamdata the targets are either inputs/outputs or chunk names. The plan can be obtained by calling driver_drake(return_plan_only = TRUE) and may be useful for debugging or when using additional drake features described next.

plan <- driver_drake(return_plan_only = TRUE)
#> GCAM Data System v5.1
#> Found 420 chunks
#> Found 4267 chunk data requirements
#> Found 2416 chunk data products
#> 1452 chunk data input(s) not accounted for
#> All done.
# Pick targets to show the commands that would be used to build them
plan %>%
  filter(target %in% c("socioeconomics.SSP_database_v9",
                       "L2052.AgCost_ag_irr_mgmt",
                       "module_aglu_batch_ag_cost_IRR_MGMT_xml",
                       "xml.ag_cost_IRR_MGMT.xml"))
#> # A tibble: 4 x 2
#>   target                                 command                                                                                                                                   
#>   <chr>                                  <chr>                                                                                                                                     
#> 1 L2052.AgCost_ag_irr_mgmt               "module_aglu_L2052.ag_prodchange_cost_irr_mgmt[\"L2052.AgCost_ag_irr_mgmt\"]"                                                             
#> 2 module_aglu_batch_ag_cost_IRR_MGMT_xml "gcamdata:::module_aglu_batch_ag_cost_IRR_MGMT_xml('MAKE', c(L2052.AgCost_ag_irr_mgmt,L2052.AgCost_bio_irr_mgmt,L2052.AgCost_For))"       
#> 3 socioeconomics.SSP_database_v9         "load_csv_files('socioeconomics/SSP_database_v9', FALSE, quiet = TRUE, dummy = file_in('../inst/extdata/socioeconomics/SSP_database_v9.cs~
#> 4 xml.ag_cost_IRR_MGMT.xml               "run_xml_conversion(set_xml_file_helper(ag_cost_IRR_MGMT.xml[[1]],\n             file_out('xml/ag_cost_IRR_MGMT.xml')))"

Additional Features

You can visualize targets and their dependency relationships with vis_drake_graph(). This function produces an interactive graph that shows how targets are connected within the plan. You can hover over nodes to see commands of a target and double click nodes to contract neighborhoods into clusters. To just see downstream nodes from a specific target, set from = <target_name>. See ?vis_drake_graph for all graph options. Here is an example of how vis_drake_graph could be used.

devtools::load_all()
#> i Loading gcamdata
# Get the drake plan
plan <- driver_drake(return_plan_only = TRUE)
#> GCAM Data System v5.1
#> Found 420 chunks
#> Found 4267 chunk data requirements
#> Found 2416 chunk data products
#> 1452 chunk data input(s) not accounted for
#> All done.
# Display the dependency graph downstream from module L210.RenewRscr
vis_drake_graph(plan, from = make.names("L210.RenewRsrc"))
#> Error in loadNamespace(name): there is no package called 'webshot'

See the drake documentation for other features. Some that may be useful with gcamdata include

outdated(plan): lists all of the targets that are outdated
predict_runtime(plan): drake records the time it takes to build each target and uses this to predict the runtime of the next make()

Writing csv outputs with driver_drake

Sometimes it is useful to write out intermediate outputs to csv files. This is done for all outputs when using driver(write_outputs = T), but is not necessary when using driver_drake() since the outputs are saved in the cache. However, if a user would still like to save these csv files, we offer a few examples of how to do this below. In all cases, we recommend running driver_drake() first to ensure the cache is up-to-date.

Saving one file

If there is one file that you would like to save from the cache, you can quickly access it and save it using load_from_cache() and save_chunkdata()

# Choose the output from the cache, which will be loaded as a list of tibbles
# (in this case a list of length 1)
load_from_cache("L2072.AgCoef_BphysWater_bio_mgmt") %>%
  save_chunkdata()

Saving all outputs from a specific chunk

To save all the outputs from one chunk, we can simply return those outputs from driver_drake().

# Here we can return all the outputs of a chunk using driver_drake
outputs_of("module_energy_L244.building_det") %>%
  load_from_cache() %>%
  save_chunkdata()

Saving all data system outputs

This is not recommended, as it is not usually necessary and will be fairly slow, but is possible by returning all the necessary data names and then loading them all from cache.

# Get the names of all outputs
all_output_names <- driver_drake(return_plan_only = T) %>%
  # Filter to non-xml module outputs (not from a data module)
  dplyr::filter(grepl('^module', command),
                grepl('^L[0-9]{3,}', target))
# Load all outputs
load_from_cache(all_output_names$target) %>%
  save_chunkdata()