The IRS publishes Form 990 data in a machine-readable format through Amazon Web Services (AWS). However, the publicly available data omits donor information or other personally identifiable information. Furthermore, the data includes filings and associated data from filing year 2011 to the present and is available as a downloadable XML file from the web via AWS. Also, only 60 percent of all Form 990 returns are electronically filed with the IRS. Previously, this Form 990 data was only available in image files, and those filings that are not eFiled are still available only in this format.
The project involves extracting, transforming, and loading an IRS 990 dataset to a CDOS designated internal server, and then subsetting it to Colorado for publication to data.colorado.gov, the Colorado Information Marketplace (CIM).
Problem and Solution
CDOS required help in extracting, transforming and loading this dataset into their designated internal server. The database itself was created by the transformation of all available IRS Form 990 XML data available in the AWS S3 bucket into a database instance on a CDOS server. This also included monthly updates of the data and the subset of Charities doing business in Colorado. This was furthermore published with monthly updates to the Colorado Information Marketplace.
The solution would initially publish and later automate the extract of completed IRS Form 990 files (2011-current) for the US, transform the data subset to Colorado, update metadata to support publishing, and load into to the CDOS internal server (US data) and subsequently CIM (CO data).
- Create ETL IRS Form 990 files for the US.
- Upload of ETL scripts and .bak file to CDOS FTP
- Automate ETL IRS Form 990 files for the US
- Provide user manuals for script package to implement monthly updates from IRS AWS to CDOS
- Source Map
- Reverse Engineered xpaths
- Transform Map
- Reverse Engineered xpaths mapped to pdf lines
- Schema reconciliation between years and versions
- Fields to build into CIM query
- Selected thematic tables and their fields
- Publish data to CIM, metadata csv sheets, set up automation
- Complete metadata csv sheets, publish to CIM, connect
- Data Cleaning
- Comparison of data cleaning options
- Post database build table validation
- Database User Manual
- Project Management
- On site meetings and agenda logs
Outcome and Benefit
All charities across the nation file the IRS 990 form. Also, groups publish the data to an AWS bucket in XML format. Charities regulators in each state want access to the data so they can better assess the performance and ethical nature of charities operating in their states, but charities regulators cannot read XML data. Consequently, Xentity took the data from the IRS and converted it into data that people can easily access, subset it to Colorado and published it to the CIM https://data.colorado.gov/browse?q=990&sortBy=relevance – in 10 tables. Also, each of these pieces of data provides valuable insights into Colorado charities. Furthermore, the best part is that it is open and accessible by the general public.