15 Common Use Cases

This chapter includes two common use cases:

  1. The first case study example substitutes the default household population data (estimation dataset) with a locally-specific US Census data Public Use Microdata Sample (PUMS) - a valuable way to get your VE model to reflect local conditions - and then rebuilds all the necessary packages reliant on the PUMS data for some of the estimation work.

  2. The second case study example shows how to use different data that is used to build internal VisionEval modules – in this case to adjust future fleet composition information.

Both Use Cases will identify the differences in rebuilding the package data depending on what type of VisionEval install process that was used.

15.1 Case Study 1: Using local PUMS data

15.1.1 What are PUMS?

To summarize, the US Census Bureau provides anonymized data in two general forms:

  • Aggregated census tables - These tables provide the total or estimated counts by topic (e.g., total number of persons by age group). The smallest geographic unit are census blocks, but not all data are available at that level.

  • Disaggregated PUMS - A sample of individual record-level data for each person or household counted. (e.g., a persons age, gender, employment and the household they belong to.). The smallest geographic unit are Public Use Microdata Areas (PUMAs), which are aggregated areas to protect confidentiality and must include at least 100,000 persons.

Most people are at least somewhat familiar with the US Census and the information they collect. The primary function of the US Census is to collect a count of people living in the United States for federal allocation of political representatives and taxes. However, the US Census has since expanded to include a variety of other useful statistical information regarding demographics and employment. Census data are spatially organized into a hierarchy of sub-divided spatial areas, the smallest of which is called a Census Blocks, which aggregate into Block Groups, Tracts, Counties, and States. See the example figure below:

Example Census geographic hierarchy

source: US Census

The primary census program is the Decennial Census, which is a comprehensive count collected every 10 years. However, because populations can significantly change within a decade, the American Community Survey (ACS) was created to obtain data at more frequent intervals. Rather than a full census, the ACS collects ongoing samples on a monthly basis. These data are then used to publish statistically adjusted estimates in 1-year, 3-year, and 5-year estimates. 1-year estimates use the most recent data but are the least reliable because the sample is smaller, whereas the 5-year estimate uses data from the previous 5 years. Although not exactly equivalent, the 1- and 5-year estimates are often considered analogous with a 1% and 5% sample of the population.

The summary tables provide the total count of persons, but are aggregated, meaning that it only shows the total number of persons in each topic, but not the combination of topics. For example, we may know the count of people by age group, gender, occupation, and household size; but we do not know the count for a particular combination of those variables, or to which household they belong. For this reason, the US Census Bureau also releases what it calls a Public Use Microdata Sample (PUMS) using sample data from the ACS.

The generalized approach to updating data within a VE package is set out below.

15.1.2 Instructions

15.1.2.1 Step 1) Gather PUMS and replace data:

In this example we will be replacing the default PUMS data in the VESimHouseholds package with your project specific local PUMS data. Based on how you obtained VisionEval navigate to the src directory. The source code for this package should be located in the VESimHouseholds directory (e.g, C:/Users/<``user`` ``name``>/Documents/VisionEval``/``src``/``VESimHouseholds).

Packages will require the data to be in a certain format, and in this case the VESimHouseholds package requires two input data files pums_households.csv and pums_persons.csv.

15.1.2.1.1 A) Download PUMS data

US Census data are available from the Census’ website (https://www.census.gov/), which provides an interface to search, browse, and download Census data in a variety of formats, the most typical being Comma Separated Value (CSV) files. PUMS data can be filtered using the Census data browser, or the entire PUMS tables for States can be downloaded from the legacy FTP website: https://www2.census.gov/programs-surveys/acs/data/pums/

The files are named according to file type, (e.g., csv_), record type (“h” for household or “p” for persons), and then the State abbreviation. For example, "csv_haz.zip" are household PUMS data for Arizona. Additional documentation can be found here: https://www.census.gov/programs-surveys/acs/microdata/access.html

15.1.2.1.2 B) Process PUMS data.

VE was originally coded using an older PUMS file, which had slightly different field names and must be renamed. A name mapping key is in the table below:

Table name VESimHouseholds field New PUMS field Description
pums_households.csv SERIALNO SERIALNO Housing/Group Quarters Unit Serial Number
PUMA5 PUMA 5% Public Use Microdata Area code
HWEIGHT WGTP Housing unit weight
UNITTYPE TYPEHUGQ Type of housing unit
PERSONS NP Number of persons living in housing unit
BLDGSZ BLD Size of Building
HINC HINCP Household Total Income in 1999 US Dollar
pums_persons.csv AGE AGEP Age
WRKLYR WKL Worked in year
MILITARY MIL In military
INCTOT PINCP Person’s total employment

Depending on the file, other pre-processing may be required, such as removing NAs or converting categories. For example, missing NA values to 0 in HINC, shifting UNITYPE scale from {1,2,3} to {0,1,2}, or aggregating the 4-level WKL categories into 3-levels of WRKLYR. If these conversions are not made, issues may arise in the package building step.

15.1.2.1.3 C) Replace PUMS files

Once processing is complete, replace the old files in your src/VESimHouseholds/inst/extdata with the new pums_households.csv and pums_persons.csv. External data for VisionEval packages are typically located in the inst``/``exdata folder.

15.1.2.2 Step 2) Package building

The critical objective of re-building a package is to build a package from the package source to the VisionEval environment. This guide uses the RStudio interface and the procedure for rebuilding a single package.

15.1.2.2.1 A) Initialize the VisionEval Environment

To start the VisionEval environment, navigate to your VisionEval runtime directory (e.g., C:/Users/<``user name``>/Documents/VisionEval) and double click VisionEval.Rproj. The RStudio layout should look similar to the figure below (there may be minor differences):

Graphical user interface, text, application, email Description automatically generated

There are two options for the next step: (B1) using RStudio Build Tools, or (B2) using the R native install command. Instructions for both methods are included in steps B1 and B2 below.

15.1.2.2.2 B1) Using RStudio Build Tools
15.1.2.2.2.1 1) Select Configure Build Tools from the Build menu (image shows an RStudio window)
15.1.2.2.2.2 2) Configure Build tools from the build menu

i. From “Project build tools”, select “Package” from the drop-down.

ii. For “Package directory”, browse to your source package folder (e.g``C:/Users/<``user name``>/Documents/VisionEval``/``src``/modules/``VESimHouseholds).

Click OK. RStudio will flicker and restart.

15.1.2.2.2.3 3) Install from package source

Click the “Build” drop-down from the main banner menu again. This time there will be new options, select “Install Package”.

15.1.2.2.2.4 4) Build again

After one successful build, you must run build again to ensure that the new source data files have been (1) generated and (2) the new data files have been loaded into the VisionEval package.

At this point the new data should now be imported and usable through the VESimHouseholds package. The last step is to test if the updated data is available within the VESimHouseholds package by inspecting the data using the command VESimHouseholds``::``Hh_df in the RStudio console.

15.1.2.2.3 B2) Using R native install command

The R command “install.packages” is used to install any R packages. The command

install.package``(“C:/Users/<user name>/Documents/VisionEval/``src``/modules/``VESimHousehold``s``”``, repos=NULL, type=“source”)

within VisionEval environment will rebuild and install VESimHouseholds package into VisionEval.

15.1.2.2.4 C) Update Dependent Packages

The final step of incorporating local PUMS data is to update the packages that have in-built estimation processes and uses the PUMS for estimating models. The PredictHousing module from VELandUse package uses PUMS to estimate housing choice model. Thus, it is important to rebuild VELandUse package after rebuilding VESimHouseholds package where the updated PUMS is now available. Follow steps B1) or B2) to rebuild VELandUse package.

Done!

15.2 Case Study 2: VEPowertrainsandFuels

There may be scenarios where we may want to study a future fleet mix (penetration of electric vehicles) that is different than the default fleet mix which comes with the VEPowertrainsandFuels package. This was the motivation behind this case study. The updates to the default fleet mix can be done by simply replacing the hh_powertrain_prop.csv input file, similar to Case Study 1, with a version customized for the intended study. This input file needs the package to be ‘rebuilt’ in order to take effect in the VisionEval model run. The steps to rebuilding are similar to Case Study 1 and are outlined here.

The input data for the VEPowertrainsandFuels package is in the VEPowertrainsAndFuels``\``inst``\``extdata``\directory. Each of the input files can be updated to reflect changes in the fleet makeup as well as fuel types that the vehicles use. The hh_powertrain_prop.csv contains the proportion of household vehicles powertrain types by vehicle type and vehicle vintage year. This case study will present steps on how to update this input file. A more detailed description of the structure and content of the file can be found in the hh_powertrain_prop.``txt file in the same directory. The figure below shows where the input file is located within the source code of the VEPowertrainsandFuels package.

15.2.1 Instructions

This case study explores the basic level of analysis needed to update the data to ensure integrity and consistency between other data components within the package. Any spreadsheet application can be used to alter the default data values and perform analysis.

This section walks users through a brief analysis that is conducted to define a modifying function and demonstrate the effects if the modifications.

15.2.1.1 Step 1) Data

VEPowertrainsAndFuels``\``inst``\``extdata``\hh_powertrain_prop.csv are the default powertrain proportions contained in the package, which resembles the table below (the table is compressed to select years for clarity). The file’s purpose is to provide the sales by vehicle powertrain, vehicle type (auto and light trucks), and vehicle vintage year.

ModelYear AutoPropIcev AutoPropHev AutoPropPhev AutoPropBev LtTrkPropIcev LtTrkPropHev LtTrkPropPhev LtTrkPropBev
1975 1 0 0 0 1 0 0 0
2000 1 0 0 0 1 0 0 0
2010 0.8786 0.1213 0 0.0001 0.9820 0.0180 0 0
2020 0.8212 0.0788 0.0202 0.0798 0.9524 0.0143 0.0067 0.0266
2030 0.6676 0.0908 0.0358 0.2058 0.9093 0.0179 0.0106 0.0622
2040 0.5701 0.0922 0.0403 0.2974 0.8996 0.0191 0.0114 0.0698
2050 0.5198 0.0895 0.0407 0.3500 0.8916 0.0193 0.0119 0.0772

The table contains two powertrain proportions, the left-most four columns are for automobiles (i.e., AutoProp) and the right-most are for light trucks (i.e., LtTrkProp). Each will sum up to 1 (for a rowsum of 2).

15.2.1.2 Step 2) Analysis

Here we will conduct a brief exploratory analysis to demonstrate visually what the data look like and how they will be modified. Using standard spreadsheet application we can format and visualize the data as shown in the figure below.

We can see that battery electric vehicles (BEV), specifically automobiles, are projected to make up the majority of vehicles bought in future years. This causes the share of internal combustion engines to decline proportionally.

Let us assume that the state government is deciding whether to aggressively promote BEV cars starting in 2025. The policies cause the share of alternative powertrains (BEV, HEV, and PHEV) to increase more over time. To model this increase, we will use an arbitrary function which adds to the current value of $x$ (i.e., the proportion) at a quadratic rate.

$$
f(x) = x + (x^2) (1 - x)
$$

We use this function to adjust each of the alternative powertrains in the spreadsheet. To ensure that the proportions sum up to 1 for autos and light trucks, respectively, we then calculate the remaining proportion of ICE powertrains by subtracting the total proportion of alternative powertrains. The following figure shows the effect of increasing the share of alternative powertrain at a quadratic rate compared to default data.

We then update the existing hh_powertrain_prop.csv file for the year 2025 and above with the newly calculated values.

15.2.1.3 Step 3) Build Package

Once the data file has been updated you will need to re-build and re-install the VEPowertrainsAndFuels package for VisionEval to use this new fleet mix data.

We can follow the instructions listed in Step 2) of the Case Study 1 to rebuild the package.

Once the package re-build is complete, your new powertrain data will be ready to use in a VisionEval model run.

15.3 Miscellaneous Information

This section contains miscellaneous information that may be useful for more advanced users.

  1. VisionEval Package Structure

  2. Build from command line

  3. PUMS data processing helper scripts

  4. Modifying package code

15.3.1 VisionEval Package Structure

The source code of VisionEval packages will generally have a structure similar to the following:

src/VEGenericPackage
├───data
│   ├─ GenericPackageSpecifications.rda
│   ├─ GenericPackage_df.rda
│   └─ GenericPackage_ls.rda
├───R
│   ├─ CreateEstimationDatasets.R
│   └─ GenericModel.R
└───inst
    └─ extdata
        ├─ input_data1.csv
        └─ input_data2.txt
  • inst``\``extdata is where “external” input data sources and reference files will be placed

  • The R directory contains any R scripts used in the packages. These must be independent non-sequential scripts that do not depend on results from other scripts.

  • data contains the resulting data that VisionEval generates and utilizes.

  • man and inst``\``module_docs, contain the markdown documentation generated during the build process.

15.3.2 Build from command line

While the GUI method is intuitive, it can be convenient to simply execute a build command from a generic R session rather than navigating menu trees in the GUI.

The fundamental command to build an r package can be run from R console using system(``"R ``CMD`` INSTALL ``package_path`` -l ``lib_path``"). The GUI method essentially constructs this command and executes it.

  • package_path is the path to the package source code you are building for e.g. "C:\Users\<user name>\Documents\VisionEval\src\modules\VESimHouseholds". If your working directory is already located in the package, you can use “.``” to denote the local directory.

  • lib_path is the runtime environment, in this case the VisionEval environment for e.g. "C:\Users\<user name>\Documents\VisionEval\ve-lib":

Here’s an example of a command that is used to rebuild VESimHouseholds package from its source code into VisionEval.

system("R CMD INSTALL "C:\Users\<user name>\Documents\VisionEval\src\modules\VESimHouseholds" -l "C:\Users\<user name>\Documents\VisionEval\ve-lib")

15.3.3 PUMS data processing helper scripts

Processing PUMS data can be challenging for two reasons.

  1. PUMS data evolves, with some field names and levels changing.

  2. The 2000 PUMS are stored in a compressible serial text file structure, rather a common delimited file (e.g., CSV), making importing tedious.

Below are some helper scripts for future users to build upon:

NOTE: These may not work with all PUMS file years, operating systems, or R versions. Best effort was made to identify weak points (e.g., unzipping), but cannot be guaranteed. These scripts are meant to be a resource to you as a starting point, not a production level code.

15.3.4 PUMS File import and header processing

# IMPORTS
library(data.table)
library(tools)


# Function to process PUMS as it is read in
process_acs_pums <- function(PumsFile, type, GetPumas='ALL') {
  # ACS PUMS to legacy Census PUMS fields
  # Make any modifications here as necessary
  meta = list(
    'h' = list(
      SERIALNO = list(acsname = 'SERIALNO', class ='character'),
      PUMA5 = list(acsname='PUMA', class='character'),
      HWEIGHT = list(acsname='WGTP', class='numeric'),
      UNITTYPE = list(acsname='TYPE', class='numeric'),
      PERSONS = list(acsname='NP', class='numeric'),
      BLDGSZ = list(acsname='BLD', class='character'),
      HINC = list(acsname='HINCP', class='numeric')
    ),
    'p' = list(
      SERIALNO = list(acsname = 'SERIALNO', class ='character'),
      AGE = list(acsname='AGEP', class='numeric'),
      WRKLYR = list(acsname='WKL', class='character'),
      MILITARY = list(acsname='MIL', class='numeric'),
      INCTOT = list(acsname='PINCP', class='numeric')
    )
  )
  
  colNames <- lapply(meta, function(x) sapply(x, function(y) y[['acsname']]))
  colclass <- lapply(meta, function(x) sapply(unname(x), function(y) {
    setNames(y[['class']], y[['acsname']])
  }))



​ if(Sys.info()[‘sysname’] == ‘Windows’) { ​ cmd <- paste0(“unzip -p ‘“, PumsFile,”’”) ​ } ​
​ if(Sys.info()[‘sysname’] == ‘Linux’) { ​ cmd <- paste0(“gunzip -cq ‘“, PumsFile,”’”) ​ } ​
# Checks if it is a zip file or a bytefile if(grepl(‘.zip’, PumsFile)) { df <- fread(cmd = cmd, select = names(colclass[[type]]), colClasses = colclass[[type]]) } else { df <- fread(PumsFile, select = names(colclass[[type]]), colClasses = colclass[[type]]) }

  # Rename ACS PUMS fields to match legacy Census PUMS fields
  setnames(df, colNames[[type]], names(colNames[[type]]))
  
  return(df)
}

process_2000_pums <- function(PumsFile, GetPumas='ALL') {
  #Read in file and split out household and person tables
  Pums_ <- readLines(PumsFile)
  RecordType_ <- 
    as.vector(sapply(Pums_, function(x) {
      substr(x, 1, 1)
      }))
  H_ <- Pums_[RecordType_ == "H"]
  P_ <- Pums_[RecordType_ == "P"]
  rm(Pums_, RecordType_, PumsFile)
  
  #Define a function to extract specified PUMS data and put in data frame
  extractFromPums <- 
    function(Pums_, Fields_ls) {
      lapply(Fields_ls, function(x) {
        x$typeFun(unlist(lapply(Pums_, function(y) {
          substr(y, x$Start, x$Stop)
        })))
      })
    }
  
  #Identify the housing data to extract
  HFields_ls <-
    list(
      SERIALNO = list(Start = 2, Stop = 8, typeFun = as.character),
      PUMA5 = list(Start = 19, Stop = 23, typeFun = as.character),
      HWEIGHT = list(Start = 102, Stop = 105, typeFun = as.numeric),
      UNITTYPE = list(Start = 108, Stop = 108, typeFun = as.numeric),
      PERSONS = list(Start = 106, Stop = 107, typeFun = as.numeric),
      BLDGSZ = list(Start = 115, Stop = 116, typeFun = as.character),
      HINC = list(Start = 251, Stop = 258, typeFun = as.numeric)
    )
  
  #Extract the housing data and clean up
  H_df <- data.frame(extractFromPums(H_, HFields_ls), stringsAsFactors = FALSE)
  #Extract records for desired PUMAs
  if (GetPumas[1] != "ALL") {
    H_df <- H_df[H_df$PUMA5 %in% GetPumas,]
  }

  #Identify the person data to extract
  PFields_ls <-
    list(
      SERIALNO = list(Start = 2, Stop = 8, typeFun = as.character),
      AGE = list(Start = 25, Stop = 26, typeFun = as.numeric),
      WRKLYR = list(Start = 236, Stop = 236, typeFun = as.character),
      MILITARY = list(Start = 138, Stop = 138, typeFun = as.numeric),
      INCTOT = list(Start = 297, Stop = 303, typeFun = as.numeric)
    )
  
  #Extract the person data and clean up
  P_df <- data.frame(extractFromPums(P_, PFields_ls), stringsAsFactors = FALSE)
  #If not getting data for entire state, limit person records to be consistent
  if (GetPumas[1] != "ALL") {
    P_df <- P_df[P_df$SERIALNO %in% unique(H_df$SERIALNO),]
  }

  return( list('p' = P_df, 'h' = H_df) )
}

15.3.5 PUMS data web-scraping

This has been automated one step further by scraping the data and running the above functions on the files as they are read in.

# Downloads and processes legacy 2000 PUMS data 
getDecPUMS <- function(STATE, output_dir = NA){ 
  #VARS 
  state_codes <- fread('state.txt') 
  state_codes <- setNames(state_codes$STATE, state_codes$STUSAB) 
  base_url = 'https://www2.census.gov/census_2000/datasets/PUMS/FivePercent' 
   
  if(length(STATE) > 2 & !is.numeric(STATE)) { 
    STATE <- state.abb[match(toTitleCase(STATE),state.name)] 
  } 
  STATE_NAME <- state.name[match(toupper(STATE),state.abb)] 


​ if(!is.numeric(STATE)) STATE_NUM <- state_codes[toupper(STATE)]


​ # Download the PUMS data to tempfile and load directly to data table ​ url <- file.path(base_url, ​ STATE_NAME, ​ paste0(‘REVISEDPUMS5_’, sprintf(“%02d”, STATE_NUM), ‘.TXT’)) ​
​ temp <- tempfile() ​ download.file(url, temp) ​
# Read .txt to data frames PUMS <- process_2000_pums(temp)

  # SAVE OUTPUT 
  if(!is.na(output_dir)) { 
    if(!dir.exists(output_dir)) dir.create(output_dir) 
    fwrite(PUMS[['p']], file.path(output_dir, 'pums_persons.csv')) 
    fwrite(PUMS[['h']], file.path(output_dir, 'pums_households.csv')) 
  } else { 
    return(PUMS) 
  } 
} 
 
# Downloads and processes post-2000 PUMS 
getACSPUMS <- function(STATE, YEAR='2000', GetPumas='ALL', output_dir, save_zip = T){ 
  #VARS 
  try({ 
    state_codes <- fread('state.txt') 
    state_codes <- setNames(state_codes$STATE, state_codes$STUSAB) 
    }) 
  base_url = 'https://www2.census.gov/programs-surveys/acs/data/pums' 


​ if(length(STATE) > 2 & !is.numeric(STATE)) { ​ STATE <- tolower(state.abb[match(toTitleCase(STATE),state.name)]) ​ }


​ # Download the PUMS data to tempfile and load directly to data table ​ PUMS <- lapply(c(‘p’, ‘h’), function(f) { ​ url <- file.path(base_url, YEAR, ‘5-Year’, ​ paste0(‘csv_’, f, tolower(STATE), ‘.zip’)) ​
​ if(save_zip == F){ ​ temp <- tempfile() ​ } else { ​ temp <- file.path(output_dir, basename(url)) ​ } ​
download.file(url, temp) df <- process_acs_pums(temp, type=f, GetPumas)

    return(df) 
  }) 
  names(PUMS) <- c('p', 'h') 


​ # SAVE OUTPUT ​ if(!is.na(output_dir)) { ​ if(!dir.exists(output_dir)) dir.create(output_dir) ​ fwrite(PUMS[[‘p’]], file.path(output_dir, ‘pums_persons.csv’)) ​ fwrite(PUMS[[‘h’]], file.path(output_dir, ‘pums_households.csv’)) ​ } else { ​ return(PUMS) ​ } ​ }