Mills Community Cluster used to process 8 million speed measurements in DelDOT statewide speed survey

Presenter: David Racca, Policy Scientist in CADSR will present for the research team of Edward Ratledge, Director of CADSR (Center for Applied Demography and Survey Research)

David Racca, policy scientist at University of Delaware’s Center for Applied Demography and Survey Research (CADSR), led the September session of the High Performance Computing (HPC) Symposium Series on June 23. Racca’s presentation focused on one of CADSR’s recent projects, where Edward Ratledge’s research team used the Mills Community Cluster to process and analyze upwards of 10 million pieces of Delaware GPS travel-data per month, including location, speed, and “travel way” measurements.

Racca began the presentation by explaining that CADSR’s primary mission is to ensure that data and information regarding important public policy issues is actively developed and made readily available. One way the Center is doing that is by working with DelDOT to collect, process, and analyze the GPS signals sent from over 2,400 state owned and operated vehicles in order to develop a statewide travel speed survey. The data will be used in many applications like identifying problematic intersections, evaluating travel times through Delaware corridors, and understanding the relationship between land use and road performance.

The research team quickly realized that processing the data using personal computers was not going to cut it. It took the team’s PCs three-days to process just five-days worth of GPS travel-data. At that rate, it’d be impossible to process the millions upon millions of DelDOT data going back to 2007.

Instead, the team used the Mills Community Cluster with custom-developed Python scripts to handle the massive quantity of data. With their new system, one-week’s worth of data could be processed in just two-days—increasing processing time tenfold.

The system works by capturing raw GPS data from Networkfleet’s server and then creating GIS (geographic information system) files based on the data. Custom Python code then references the ArcPy library, which contains a large set of GIS processing functions. The code then processes and analyzes the data by running it through ArcGIS software on the Mills cluster. Using this process, the software can find paths between GPS readings, interpolate speeds, build specific “trips,” and screen for errors and anomalies. The team was able to run and time the jobs in batches, which kept the process going smoothly.

The system uses a non-parallel process since ArcGIS doesn’t currently support parallel processing, however the team thinks it is feasible to one day replace the Arc functions with other programmed functions to allow for it. Algorithms would be used to optimize paths on the link/node network.

GIS file processing on the Mills cluster could eventually be used for displaying other data sets like employment, property value, and accessibility. Mills’ sheer power allows for numerous applications yet to be imagined. The group also looks forward to using their nodes on Farber to compute this data.