I always wanted to be a data janitor

Blog post
edited by
Wiki Admin

You know the `data wrangling` field is becoming more mainstream if NYTimes is covering it at this level: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com  

The article emphasizes the point that the amount of time to get the data – sensors, web services feeds, corporate databases, smartphone data observations – prepared and ready to be consumed is still a huge effort. 
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

It hits home the point we love to belabor that the software companies do not. Be it ERP, GIS, MIS, analytics – they show great demos, but with data already prepared or some of your data, but only the easy stuff. It works great in their environment and as long as you give it good data, it performs and rocks! But, the demos continually underscore what it takes to wrangle, cleanup, and get that data prepared.
As a huge proponent of good data being a or the major problem or barrier, its nice to see software beginning to move into startup investments to help – ClearStory, Trifacta, Paxata and other start-ups in the field. In mean-time, we need make sure to always stay on top of best, approved approaches, and the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, its not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is:
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”

But the article does tend to focus on bigdata, and not bigdata in an opendata world

Even if the data is scrubbed for machine to machine processing as the article emphasizes, it still doesn`t address the fact with the push for data catalogs, that data curators and creators HATE – with a passion -metadata creation. Its one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.

Lets take the metadata needed to tag for proper use. Lets take a real-world example recently discussed with a biological data analytics expert. 

A museum wants to get some biological data to support an exhibit. They have a student collect 100 scientific observations records on larvae using well calibrated sensors. The data gets published. An academic study on climate change sees the data. The data shows that there is lower larvae count than historical and demonstrates heat changes impacting such. The data shows the sensors and tools used were scientifically accurate. The data is sourced and used. This is a great example of mis-use of data. True, the data was gathered using great sensors, right techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. Its use is vitally important to denote.

 

Another great example, but slightly different use angle. A major Government Records organization wanted to take their inventory of records data, which is meticulously groomed and its data is very accurate and entered and goes through a rigorous workflow to make sure its metadata and access to the actual data record is prepared to be accurate real-time. The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in proper use. But, the actual metadata to help discover something in the billions of records was not suitably prepared with more natural ways people would explore the data to discover and lookup the records.

Historically, armies of librarians would be tasked with searching the records, but have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills programmed (or NLP) and signals the librarion would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.

This is another great example of overlooked data preparation. `Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out`. Once again though, the search engine will just tell you the museum larvae record matches your science study need.

 

Go Code Colorado recognized with a StateScoop 50 award

Blog post
edited by
Wiki Admin

Colorado has done something quite innovative and has been recognized with a StateScoop 50 award for State Innovation of the Year. 

And Xentity was glad to be a part of it.
Colorado’s Office of the Secretary of State invented and delivered Go Code Colorado, an app development competition for using public data to address challenges identified by the business community.  Three teams built winning apps and business plans and will have their technology licensed by the state for a year.  These teams have formed businesses; some are in discussions with investors and some have left their day jobs to focus on building their new venture.
Here’s the genius of Go Code Colorado: the State took funds from the filing fees that businesses pay to register their businesses and GAVE IT BACK to the business community through Go Code Colorado… AND incubated three new promising businesses in the process.
Plans are to build on this first year and run Go Code Colorado for a second year… on a larger scale and improving on many aspects based on lessons learned from year one.
Xentity had the pleasure of participating in Go Code Colorado as the State’s data management provider — sourcing, curating, improving and publishing 38 datasets from state agencies from Higher Ed to Labor and Employment.
You can see more about Go Code and Xentity’s support in our April Blog Entry: Go Code Colorado Open Data Effort is going into its final weeks . Also, check out the photo gallery of the final event.

Go Code Colorado Open Data Effort is going into its final weeks

Blog post
edited by
Wiki Admin

States all-around have gotten into Open Data movements. Colorado has as well, and their recent Go Code Colorado effort is a very unique entry into this foray ( http://gocode.colorado.gov/)

Go Code Colorado was created to help Colorado companies grow, by giving them better and more usable access to public data. Teams will compete to build business apps, creating tools that Colorado businesses actually need, making our economy stronger.

 

The following is a great video that summarizes the event as produced by the State and one of Xentity’s colleagues, Engine7 Media.



Get Adobe Flash player

Xentity is very proud to be supporting this innovative Government Solution

Xentity was awarded IT consulting support for the the Business Intelligence Center platform and data catalog which supports the now branded Go Code Colorado initiative. Xentity’s consultants have provided the data and technology resources to manage and advise the publication of public sector data to the Colorado Information Marketplace and to provide technical support developers who participate in the Challenge. 

Xentity primarily has provided data platform support. We have provided data readiness analysis, data architecture guidance, project management, and the data analysts to “wrangle” the data (aka ETL) to get the datasets onto the platform. We also have provided the IT and data support on-site at the multiple locations and events to assure the challenge participants and finalists are getting the support they need to be successful in accessing and using the data and services. Finally, we are supporting the technical review of applications to assure these applications can have a life beyond the “hackathon” stage.

The final stages are coming the first 10 days of May. The 10 finalists have proven to demonstrate very viable solutions to achieve the goal of helping make our economy stronger. 

Some more background and detail on how we got here

(The following is from the State as guidance to this effort)

 

Colorado government agencies possess large volumes of public business and economic data. This data can help businesses with strategic planning, but it exists in so many different places and formats making it difficult for that most businesses to use it. The Secretary of State’s office will address this problem through the creation of the Business Intelligence Center (BIC). BIC seeks to aggregate and analyze data available to the business community.

This effort is led by the Colorado Secretary of State. The Secretary of State’s office interacts with hundreds of thousands of business entities, charities, and nonprofits in the state. The Secretary of State’s office collects, manages, and disseminates large amounts of basic data about those organizations and wanted to make the data useful to Colorado businesses. 

The Department sought to make this data more useful and collaborated with the Leeds School of Business at the University of Colorado to publish the Quarterly Business and Economic Indicator Report. This report combines Department data with other economic data collected by he Leeds School to provide meaningful economic information to the business community. For instance, new business filings are a leading indicator of job creation. With this and other information provided in the report, the business community can make smarter decisions that will grow the Colorado economy.

Since first publishing the report in 2012, the Secretary of State received comments from many members of the business community asking to see more detailed data regarding economic trends 
in order to better understand the distribution of commerce in Colorado. This includes access to the location, size, vibrancy, and concentration of key business nodes. While this level of detail would be tremendously helpful, the Department cannot provide the information because multiple state agencies collect the desired data and it is not readily available in a common place  or even a common format.

A central data collection point is needed. During meetings with other government agencies, Department staff concluded that these data requests could be met by aggregating all the information spread throughout various agencies and databases into a single tool by breaking down agency silos and better cataloging existing resources. Department staff also concluded that access and availability to the data is not enough. In order to make the raw data useful to the vast majority of business owners, data analysis and visualization tools are needed. These conclusions led to the Business Intelligence Center project.

The Business Intelligence Center consists of a centralized data catalog that combines public data into a meaningful tool for businesses. 

The vision for this project is two-fold. First, it consolidates public data relevant to businesses on a single platform. Second, it gives business the tools to make the data useful. The second goal is 
achieved through a civic apps challenge—the Colorado Business Innovation Challenge—that will give financial incentives to the technology community to build web and mobile applications that use state and other data to solve existing business challenges.

The data platform is akin to an information clearing house. It will make data sources currently dispersed over multiple government departments and agencies accessible in a common location. 
This platform will offer Colorado businesses unprecedented access to public data that is validated and relevant to short and long-term needs. Besides enhancing businesses’ access to state data, the BIC will also contribute to economic growth. The creation of the BIC will make data available to all Colorado businesses at no additional cost. Currently only large entities with the time, staff, and budget to engage in detailed statistical analysis can use these data sets. Providing this data to every type and size business in Colorado provides a unique opportunity to contribute to economic development. The BIC will nurture key industry networks and lay the foundation for a digital infrastructure that will continue to expand and improve over time.

The Colorado Business Innovation Challenge is an innovative way to create solutions and ensure the BIC is useful to Colorado businesses.

Simply making the data available is insufficient to most business owners. To truly help the vast majority of businesses—especially small businesses—tools must be developed to present the data in a useful and consumable form. Normally government agencies develop tools to fill this information vacuum, but historically the government has not been successful developing highly useful and effective tools. A new approach is needed—that approach is the Colorado Business Innovation Challenge.

Modeled after a “civ apps” challenge that has been run in multiple cities across the United States and internationally, the Challenge presents the software development community with problem 
questions and then asks that community to create possible solutions. At the end of the challenge, the Secretary of State will license the most innovative and implementable web or mobile application. The best design will receive a contract with the Secretary of State to make the application available to the public on the Business Intelligence Center platform. The Department will also pursue partnerships with the Colorado technology and startup industry to provide additional incentives, such as mentoring, hosting, and office space to the Challenge winners. The long-term intent of the program is to not only create an environment for fostering community involvement through the Challenge, but to develop sustainable tools that are  developed in the Challenge.

Exploring Public Domain Maps and Imagery – Historic Denver West

Blog post
edited by
Matt Tricomi

Doing some light examination of Denver West, the following is just two aspects of free public domain data. There is so much more beyond imagery and maps, and even more maps and more imagery (i.e. MODIS, LANDSAT, other NASA products).

Historic Maps

USGS Topo Map 1-meter Quads are clockwise:

(Download this KML file to see all options or click the years above the thumbnails below to grab that historic map)

Golden – Download – 1939194219441957

6 Diff dates for 15 Quads 1939-1965

Arvada – Download – 19411944195019571965

5 Diff dates for 12 Quads 1941-1965 

Morrison – Download – 19381942194719571965

5 Diff dates for 11 Quads 1938-1965 

Fort Logan – Download – 1941194819571965

4 Diff dates for 8 Quads 1939-1965 

Historic Imagery

Historic Imagery typeYearMetadataThumbnail (Actual Files range from 20-200 MB)
Digital Orthophoto Quadrangles (DOQs)1994
  • Entity ID: DI00000000912683
  • Acquisition Date: 23-SEP-94
  • Map Name: FORT LOGAN
  • State: CO

 

 

 
National Aerial Photography Program (NAPP)1988
  • Entity ID: NP0NAPP001033224
  • Coordinates: 39.8125 , -105.03125
  • Acquisition Date: 10-JUL-88
 
National High Altitude Photography (NHAP)1983
  • Coordinates: 39.671135 , -105.043794
  • Acquisition Date: 25-JUN-80
  • Scale: 120400

Download ~30MB file

 

Space Acquired Photography   

Single Frame Records

Black-and-white, natural color, and color

infrared aerial photographs

400 or 1,000 dpi.

1978
  • Entity ID: AR1VEQP00010142
  • Coordinates: 39.688989 , -105.053705
  • Acquisition Date: 01-SEP-78
  • Scale: 78000
Aerial Photo Mosaics (Used when creating early/mid-USGS Topo Maps)1953
  • Entity ID: ARDDA001260930776
  • Coordinates: 39.5 , -105.5
  • Acquisition Date: 25-SEP-53
  • Scale: 63299

Download the 30MB file

 

    
High Resolution OrthoImagery (Corrected – Generally .3 meter, color) 

2002 – HistoricOrthoCoverageAreas.kml

 

 
Declass 1 (1996) Stereo Images1965
  • Entity ID: DS1027-1015DA011
  • Coordinates: 39.69 , -104.516
  • Camera Resolution: Stereo Medium
  • Acquisition Date: 10-DEC-65
Declass 2 (2002) Stereo Images1966
  • Entity ID:DZB00403500080H001015
  • Coordinates: 42.49 , -103.45
  • Acquisition Date: 10-DEC-66
  • Camera Resolution: 2 to 4 feet

How Open Data Contributes Toward Better Interagency Collaboration and Orchestration at all Levels

Blog post
edited by
Matt Tricomi

In a recent email thread with Xentity, NASCIO, and members of Smart Lean Government, the following thoughts were offered on OpenData by NASCIO Program Director, Enterprise Architecture & Governance for National Association of State Chief Information Officers (NASCIO), Eric Sweden and republished with his permission

I believe open data contributes toward better inter agency collaboration and orchestration at all levels – notwithstanding PII is specifically removed from open data initiatives and must be.  But there is a place for open data in serving individual needs of citizens – for example – clinical epidemiology.  Employing population data – and even specific population data in evaluating prognosis and treatment regimes.  Think of the value in public health and medical services to underserved populations AND really anyone else.  Trends, patterns, correlations will surface for a similar approach / strategy in other government lines of business – we’re just at the brink of this kind of use data exploitation.

I’m looking beyond life events and also considering the complete Smart Lean Government concept.  Life events are a critical element – but there are also events abstracted up from individuals to communities.  So we move up an upside down pyramid from life events to “community events” or “community issues.”   Consider open data – and the larger concept of open government – in enabling better government.  Thus, a necessary part of Smart Lean Government.  Think about how government is able to work better together in collaboration and that leads to sharing data and information.

Example, Minnesota Department of Public Safety and Department of Transportation working together in drawing necessary correlations between crash data (from DPS) and speed/road conditions/weather data from DOT to develop strategy for safer roads and highways.

This particular example resonates with the “Imperatives of 21st Century Government Services” from volume one of the practical guide; steps 1-4 of the “Sustainable Shared  Services Lifecycle Model” from volume two of the practical guide.

This example is at the community event level – but impacting every individual and family that uses those roads and highways.