Recently, we had the pleasure of sitting down with David Vennergrund, Director, Data and Analytics, at CSRA Inc. to talk about data-driven government and how data science is now fundamental to public policy. During our conversation David explained how disruptive technologies introduced the data-driven revolution and how government agencies are putting data to work to meet their missions.
Jenna Sindle (JS): David thanks for sitting down with us today. We’re interested in learning about how data is changing the ways in which agencies meet their missions in ways from how policy is informed to how agencies operate more efficiently and effectively in this time of tight budgets. Before we dive into the specifics, can you tell us about the origins of this data-driven revolution in the public sector?
David Vennergrund (DV): At its foundation the data-driven revolution for government agencies was caused by an over-abundance of data and needing the more powerful tools to leverage that data, to put it to work to make a positive impact on real world problems. Not so many years ago we could only leverage a limited amount of data that we generated. We collected structured data, stored it in databases and elaborate data warehouses, and analyzed a tiny portion of it – at great expense in time and money. We analyzed even less of the semi-structured data that came from texts, emails, logs, and other sources. All of this unexamined data had value – but was not leveraged. Now, thanks to inexpensive data storage and distributed processing frameworks we are able to leverage this ‘dark data’ as well, which makes the world a much more interesting place.
I like to think of it this way: we’ve recently evolved from an IT epoch where we examined small amounts of data, with limited IT resources —and all we could answer were the small questions like how many widgets were sold last year. Now that we are able to access far more data, and we have far more resources to ingest, integrate, store, and analyze data we are able to answer far more interesting questions to put this data to work to solve some of our biggest societal challenges. Our new-found power to analyze big data moves us from using a healthcare database to make insurance payments to using that same healthcare database and associated “dark” data to deliver personalized healthcare with potentially lifesaving outcomes.
JS: What are some of the tools that are fundamental to this era of big data analytics?
DV: Two major tool developments ushered in the era of big data analytics – open source software and inexpensive computing platforms.
Open-source software ecosystems like Apache Hadoop, Apache Spark, and In-Memory Databases created powerful distributed data storage and processing frameworks on inexpensive commodity hardware. Pioneers at Facebook, Google, and Amazon solved real-world big data problems, they then shared source code. Previously, organizations were constrained by expensive databases licenses, storage, and computing platforms that deterred data analysis, not only because the costs were prohibitive, but because they created silos, where data could not be used in different scenarios or environments, or combined with other pieces of information to illuminate new patterns or relationships.
The second development was elastic cloud storage and processing. While it seems like a pretty standard part of our environments now, it wasn’t so long ago that cloud storage was a truly disruptive technology. Amazon AWS, Microsoft Azure, and other cloud service providers have democratized data storage. Not only has data storage been commoditized and thus more economical, it also facilitated elastic resources where organizations can ramp up computing resources when they need to process data, and ramp it back down once the analytic phase is completed. In this way data storage and compute is no longer a significant capital expense, but a modest operating expense.
The combination of open-source software frameworks and elastic cloud storage and processing facilitated collaborative efforts that drove more innovative data science tools. For example, R and Python have thriving data science communities that create and share powerful data science packages by the hundreds. This drive for more powerful tools to analyze ever increasing data sets has spread far from its origins at Google, Facebook, and Amazon to all data-driven government agencies across the Civil, Health, Defense, Homeland Security, and Intelligence agencies.
JS: Speaking of federal agencies, how are they using data to meet the mission?
DV: There are literally hundreds of data science projects across federal agencies in production and under development. I will share a few examples we have contributed to: National Institutes of Health (NIH) genomic data analytics is leading to precision medicine; Centers for Medicare and Medicaid Services (CMS) uses data to improve healthcare delivery and reduce fraud, waste, and abuse; Federal Aviation Administration (FAA) integrates unstructured data, including weather information, flight data, and migratory bird patterns to improve flight safety; and Veterans Affairs data analytics ensure that veterans receive their benefits in a timely manner.
In addition, Federal agencies understand the value of data across an agency. They created and staffed a new “Chief” role – the Chief Data Officer – a leader in the organization who leverages, analyzes, and oversees the publication and sharing of data with other agencies and the public at Data.Gov
JS: With all this data-driven activity is there any truth to the claims that we’re facing a shortage of data scientists?
DV: On the one hand, we are facing a shortage of senior-level data scientists, the highly-trained Ph.D. who develop data algorithm and conduct advanced data analysis. Ph.Ds. take several years to complete, so it will take a while before we start seeing Ph.D. data scientists graduate at a significant rate. This shortage is real and well documented.
On the other hand, many people discount, if not miss, is that we have two untapped segments of our population that are ideally suited to cross-training in data science programs. First consider the digital natives recently graduated from colleges and universities – while they might not have much formal training in data sciences (programming, statistics, machine learning, database modeling) they have an innate understanding of how data is leveraged to meet their needs. They are surrounded by and use AI algorithms to suggest purchases, find travel routes, meet friends, and many more daily tasks and social interactions. These digital natives can be trained in self-service tools to integrate, analyze, and visualize data. And they can get that training on-line – YouTube, OpenStack and the like are today’s assistant professors. Next consider our existing workforce of data integration, database, report, and dashboard architects and developers. This workforce may be finding that their core skills are less needed as the demand for structured data warehouses drops. Yet, this workforce has a deep understanding of data and how to move, organize, and analyze it. This workforce is quite valuable and relevant – and with cross-training can fill the gap to become data scientists.