Analytica is building tools at the Centers for Medicare and Medicaid Services (CMS) Office of Information Technology that will help data flow seamlessly between patient, provider, caregivers, researchers, innovators, and payers.
At the Interoperability Forum in Washington DC, CMS Administrator Seema Verma said
The Analytica Data Architecture team at CMS OIT is supporting Administrator Verma’s vision of a common language between CMS and partner systems. The Analytica Data Architecture team is developing prototypes focused on finding ways to easily share data sets and build widely usable Application Program Interfaces (APIs).
Analytica also built the CMS Enterprise Logical Data Model (ELDM). This large model, implemented in erwin (https://www.erwin.com/) describes the interactions between CMS business systems at the data level. The intent of this work is to establish a common set of terms, metadata, and interface definitions. Now that this model is in place, it is possible to build systems that can leverage this data foundation.
As the head of CMS, one of my main missions is to break down any and all barriers to interoperability and create that one-stop shop for health data that will help inform our health care decisions with a complete picture of our medical history.
Today, there are two approaches that Analytica is pursuing:
- Amundsen Metadata Mapping and Discovery
- Amazon DeeQu – Data Quality as a Service
Amundsen Metadata Mapping and Discovery
Amundsen (amundsen.io) is an Open-Source data discovery and metadata engine written by Lyft (www.lyft.com) as a way to organize their large, disparate corpus of data. It was designed to create a way for data scientists to discover and trust data sources by establishing the data’s provenance and by mapping the linkages between data sets in a consistent way. At CMS, Amundsen is being linked to the ELDM to ensure that Data Scientist and Architects can accurately link datasets so that they can design interoperable tools and move data between systems reliably. Amundsen is built on a dual database platform: Text searches are performed using elasticsearch (https://www.elastic.co/) while a Neo4j (www.neo4j.com) graph database is used to show how the data is interconnected. This context makes it possible for CMS and its partners to make informed decisions about data sources and their suitability for specific uses.
Amazon DeeQu – Data Quality as a Service
In 2019, Amazon announced (https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) that is was making DeeQu available as an Open-Source project. DeeQu is an open source tool that allows you to calculate data quality metrics on datasets, define and verify data quality constraints, and set up ways to report changes in datasets. Deequ is implemented on top of Apache Spark and is designed to scale with large (terabyte-scale) datasets. Analytic’s prototype work at CMS is exploring ways to use DeeQu as a processing engine that would provide Data Quality as a Service (DQaaS) to the CMS data community. DeeQu would augment the Amundsen prototype by adding these kinds of insights about CMS datasets:
- Completeness (Are there a lot of blank values?)
- Compliance (Are all of the values in a column formatted as dates?)
- Correlation (Are dataset columns related to each other?)
- Distinctness (Are the dataset values all the same?)
- Unique Values (Are the values in a column all “True” or “False” or “x” and “ “?)
These insights are especially valuable when the datasets are huge or rapidly growing.
By pursuing these initiatives, Analytica is building tools and services that will bring Administrator Verma’s vision of interoperability and openness across CMS.