I always wanted to be a data janitor

Blog post
edited by
Wiki Admin

You know the `data wrangling` field is becoming more mainstream if NYTimes is covering it at this level: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com  

The article emphasizes the point that the amount of time to get the data – sensors, web services feeds, corporate databases, smartphone data observations – prepared and ready to be consumed is still a huge effort. 
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

It hits home the point we love to belabor that the software companies do not. Be it ERP, GIS, MIS, analytics – they show great demos, but with data already prepared or some of your data, but only the easy stuff. It works great in their environment and as long as you give it good data, it performs and rocks! But, the demos continually underscore what it takes to wrangle, cleanup, and get that data prepared.
As a huge proponent of good data being a or the major problem or barrier, its nice to see software beginning to move into startup investments to help – ClearStory, Trifacta, Paxata and other start-ups in the field. In mean-time, we need make sure to always stay on top of best, approved approaches, and the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, its not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is:
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”

But the article does tend to focus on bigdata, and not bigdata in an opendata world

Even if the data is scrubbed for machine to machine processing as the article emphasizes, it still doesn`t address the fact with the push for data catalogs, that data curators and creators HATE – with a passion -metadata creation. Its one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.

Lets take the metadata needed to tag for proper use. Lets take a real-world example recently discussed with a biological data analytics expert. 

A museum wants to get some biological data to support an exhibit. They have a student collect 100 scientific observations records on larvae using well calibrated sensors. The data gets published. An academic study on climate change sees the data. The data shows that there is lower larvae count than historical and demonstrates heat changes impacting such. The data shows the sensors and tools used were scientifically accurate. The data is sourced and used. This is a great example of mis-use of data. True, the data was gathered using great sensors, right techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. Its use is vitally important to denote.

 

Another great example, but slightly different use angle. A major Government Records organization wanted to take their inventory of records data, which is meticulously groomed and its data is very accurate and entered and goes through a rigorous workflow to make sure its metadata and access to the actual data record is prepared to be accurate real-time. The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in proper use. But, the actual metadata to help discover something in the billions of records was not suitably prepared with more natural ways people would explore the data to discover and lookup the records.

Historically, armies of librarians would be tasked with searching the records, but have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills programmed (or NLP) and signals the librarion would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.

This is another great example of overlooked data preparation. `Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out`. Once again though, the search engine will just tell you the museum larvae record matches your science study need.

 

Go Code Colorado recognized with a StateScoop 50 award

Blog post
edited by
Wiki Admin

Colorado has done something quite innovative and has been recognized with a StateScoop 50 award for State Innovation of the Year. 

And Xentity was glad to be a part of it.
Colorado’s Office of the Secretary of State invented and delivered Go Code Colorado, an app development competition for using public data to address challenges identified by the business community.  Three teams built winning apps and business plans and will have their technology licensed by the state for a year.  These teams have formed businesses; some are in discussions with investors and some have left their day jobs to focus on building their new venture.
Here’s the genius of Go Code Colorado: the State took funds from the filing fees that businesses pay to register their businesses and GAVE IT BACK to the business community through Go Code Colorado… AND incubated three new promising businesses in the process.
Plans are to build on this first year and run Go Code Colorado for a second year… on a larger scale and improving on many aspects based on lessons learned from year one.
Xentity had the pleasure of participating in Go Code Colorado as the State’s data management provider — sourcing, curating, improving and publishing 38 datasets from state agencies from Higher Ed to Labor and Employment.
You can see more about Go Code and Xentity’s support in our April Blog Entry: Go Code Colorado Open Data Effort is going into its final weeks . Also, check out the photo gallery of the final event.

Go Code Colorado Open Data Effort is going into its final weeks

Blog post
edited by
Wiki Admin

States all-around have gotten into Open Data movements. Colorado has as well, and their recent Go Code Colorado effort is a very unique entry into this foray ( http://gocode.colorado.gov/)

Go Code Colorado was created to help Colorado companies grow, by giving them better and more usable access to public data. Teams will compete to build business apps, creating tools that Colorado businesses actually need, making our economy stronger.

 

The following is a great video that summarizes the event as produced by the State and one of Xentity’s colleagues, Engine7 Media.



Get Adobe Flash player

Xentity is very proud to be supporting this innovative Government Solution

Xentity was awarded IT consulting support for the the Business Intelligence Center platform and data catalog which supports the now branded Go Code Colorado initiative. Xentity’s consultants have provided the data and technology resources to manage and advise the publication of public sector data to the Colorado Information Marketplace and to provide technical support developers who participate in the Challenge. 

Xentity primarily has provided data platform support. We have provided data readiness analysis, data architecture guidance, project management, and the data analysts to “wrangle” the data (aka ETL) to get the datasets onto the platform. We also have provided the IT and data support on-site at the multiple locations and events to assure the challenge participants and finalists are getting the support they need to be successful in accessing and using the data and services. Finally, we are supporting the technical review of applications to assure these applications can have a life beyond the “hackathon” stage.

The final stages are coming the first 10 days of May. The 10 finalists have proven to demonstrate very viable solutions to achieve the goal of helping make our economy stronger. 

Some more background and detail on how we got here

(The following is from the State as guidance to this effort)

 

Colorado government agencies possess large volumes of public business and economic data. This data can help businesses with strategic planning, but it exists in so many different places and formats making it difficult for that most businesses to use it. The Secretary of State’s office will address this problem through the creation of the Business Intelligence Center (BIC). BIC seeks to aggregate and analyze data available to the business community.

This effort is led by the Colorado Secretary of State. The Secretary of State’s office interacts with hundreds of thousands of business entities, charities, and nonprofits in the state. The Secretary of State’s office collects, manages, and disseminates large amounts of basic data about those organizations and wanted to make the data useful to Colorado businesses. 

The Department sought to make this data more useful and collaborated with the Leeds School of Business at the University of Colorado to publish the Quarterly Business and Economic Indicator Report. This report combines Department data with other economic data collected by he Leeds School to provide meaningful economic information to the business community. For instance, new business filings are a leading indicator of job creation. With this and other information provided in the report, the business community can make smarter decisions that will grow the Colorado economy.

Since first publishing the report in 2012, the Secretary of State received comments from many members of the business community asking to see more detailed data regarding economic trends 
in order to better understand the distribution of commerce in Colorado. This includes access to the location, size, vibrancy, and concentration of key business nodes. While this level of detail would be tremendously helpful, the Department cannot provide the information because multiple state agencies collect the desired data and it is not readily available in a common place  or even a common format.

A central data collection point is needed. During meetings with other government agencies, Department staff concluded that these data requests could be met by aggregating all the information spread throughout various agencies and databases into a single tool by breaking down agency silos and better cataloging existing resources. Department staff also concluded that access and availability to the data is not enough. In order to make the raw data useful to the vast majority of business owners, data analysis and visualization tools are needed. These conclusions led to the Business Intelligence Center project.

The Business Intelligence Center consists of a centralized data catalog that combines public data into a meaningful tool for businesses. 

The vision for this project is two-fold. First, it consolidates public data relevant to businesses on a single platform. Second, it gives business the tools to make the data useful. The second goal is 
achieved through a civic apps challenge—the Colorado Business Innovation Challenge—that will give financial incentives to the technology community to build web and mobile applications that use state and other data to solve existing business challenges.

The data platform is akin to an information clearing house. It will make data sources currently dispersed over multiple government departments and agencies accessible in a common location. 
This platform will offer Colorado businesses unprecedented access to public data that is validated and relevant to short and long-term needs. Besides enhancing businesses’ access to state data, the BIC will also contribute to economic growth. The creation of the BIC will make data available to all Colorado businesses at no additional cost. Currently only large entities with the time, staff, and budget to engage in detailed statistical analysis can use these data sets. Providing this data to every type and size business in Colorado provides a unique opportunity to contribute to economic development. The BIC will nurture key industry networks and lay the foundation for a digital infrastructure that will continue to expand and improve over time.

The Colorado Business Innovation Challenge is an innovative way to create solutions and ensure the BIC is useful to Colorado businesses.

Simply making the data available is insufficient to most business owners. To truly help the vast majority of businesses—especially small businesses—tools must be developed to present the data in a useful and consumable form. Normally government agencies develop tools to fill this information vacuum, but historically the government has not been successful developing highly useful and effective tools. A new approach is needed—that approach is the Colorado Business Innovation Challenge.

Modeled after a “civ apps” challenge that has been run in multiple cities across the United States and internationally, the Challenge presents the software development community with problem 
questions and then asks that community to create possible solutions. At the end of the challenge, the Secretary of State will license the most innovative and implementable web or mobile application. The best design will receive a contract with the Secretary of State to make the application available to the public on the Business Intelligence Center platform. The Department will also pursue partnerships with the Colorado technology and startup industry to provide additional incentives, such as mentoring, hosting, and office space to the Challenge winners. The long-term intent of the program is to not only create an environment for fostering community involvement through the Challenge, but to develop sustainable tools that are  developed in the Challenge.

In 2014 Every Business will be Disrupted by Open Technology

Blog post
added by
Wiki Admin

The article “In 2014 Every Business will be Disrupted by Open Technology” raised some very key points on some key disruptive patterns. A key theme we picked up on was the movement from bleeding and leading edge to more popular adoption of the open data and Open platforms. 

As the article notes:

Yet the true impact begins not with invention, but adoption.  That’s when the second and third-order effects kick in. After all, the automobile was important not because it ended travel by horse, but because it created suburbs, gas stations and shopping malls

A few tangible themes we picked up on are:

  1. Stand-alone brands are shifting to open platforms for co-creation
  2. Open platforms built on a brand`s intellectual property enhance the value, not threaten it
  3. Open platforms provide enterprise-level capabilities to the smallest players, leveling the playing field, accelerating innovation, and amplifying competition
  4. Open platforms for co-creation shifts the focus away from driving out inefficiencies, and toward the power of networking and collaboration to create value.

Where its happening already: Federal, State, City, Commercial

In FedFocus 2014, they also emphasized that with the budget appropriations for 2014 and 2015, two big disruptive areas will continue to be in Open Data and Big Data. Especially, with the May 2013 release of the WhiteHouse Open Data memorandum for going into effect in November 2013, it will impact Open Data by:

Making Open and Machine Readable the New Default for Government Information, this Memorandum establishes a framework to help institutionalize the principles of effective information management at each stage of the information`s life cycle to promote interoperability and openness. Whether or not particular information can be made public, agencies can apply this framework to all information resources to promote efficiency and produce value.

We are seeing states get into the mix as well with Open Data movements like http://gocode.colorado.gov/

Go Code Colorado was created to help Colorado companies grow, by giving them better and more usable access to public data. Teams will compete to build business apps, creating tools that Colorado businesses actually need, making our economy stronger.

Also, at the city level with the City of Raleigh, North Carolina which is well recognized for its award-winning Open Data Portal.

 

We had previously tweeted on how IBM opened up their Watson cognitive computing API for developers…. publicly. This is a big deal. They know with open data platforms as an ecosystem, they not only get more use, which means more comfort, which means more apps, but every transaction that happens on it, that is legally allowed, they to improve their interpretative signals that make Watson so wicked smart. This article points this key example out as well. 

 

And back to National Data Assets moving ahead to make their data more distributable over the cloud, moving data closer to cloud applications, offering data via web services where they are too large or updated too often to sync, download, or sneakernet.

Xentity and its partners have been at the forefront of all these movements.

We have enjoyed being on the leading edge since the early leading edge phases of this movement. Our architectures are less on commodity IT, which not to undersell the importance of affordable, fast, robust, scalable, enabling IT services and data center models. Our architectures have been more focused on putting the I back in IT.

We have been moving National Geospatial Data Assets into these delivery models as data products and services (Xentity is awarded USGS IDIQ for Enterprise and Solution Architecture), supporting the architecture of data.gov (Xentity chosen to help re-arch data.gov), and recently supporting the data wrangling on Colorado`s State OpenData efforts. We are examining Can a predictable supply chain for geospatial data be done and actively participating in NSF EarthCube which looks to “Imagine a world…where you can easily plot data from any source and visualize it any way you want.” We have presented concepts

Our architecture methods (Developing a Transformation Approach) are slanted to examine mission oriented performance gains, process efficiencies, data lifecycle management orientation, integrating service models, and balancing the technology footprint while innovating. For instance, we are heavily involved in the new ACT-IAC Smart Lean Government efforts to look at aligning services across government and organizational boundaries around community life events much like other nations are beginning to move to.

Xentity is very excited about the open data movements and supported platforms and the traction it is getting in industry. This may move us forward from information services into the popular space to and for knowledge services (Why we focus on spatial data science)


GAO releases report on FGDC Role and Geospatial Information

Blog post
edited by
Wiki Admin

GAO release report on us of geospatial information with title “OMB and Agencies Can Reduce Duplication by Making Coordination a Priority”. Readers Digest – focus on integrating data. 

Click to download PDF

We tend to agree. FGDC is currently very focused on a service enabling management model (Geoplatform) to accomplish this. It is bold, but if their role of being a service provisioner can directly or indirectly get them in the game to address the real problem of data lifecycle management, they will have a chance to address this. 

Point being, FGDC knows its role is not to be in IT Operations as its direct goal. But, they also saw that being a sideline judge with no carrot or stick role would not garner the direction and recommendations that GAO suggests. They are getting on the playing field, taking advantage of the open service provider role, being that broker, and using that role to move IT costs down, and also enabling those shifts in monies to then focus on the data issues cited. Its bold, and a unique approach, and there are many questions can a traditionally non-operational group develop that culture to be effective. Proof will show over the next 2 years.

Below find our summary of strategic direction for FGDC’s geoplatform.

The challenges and recommendation sections are:

  1. FGDC Had Not Made Fully Implementing Key Activities for Coordinating Geospatial Data a Priority
  2. Departments Had Not Fully Implemented Important Activities for Coordinating and Managing Geospatial Data
  3. Theme-lead Agencies Had Not Fully Implemented Important Activities for Coordinating and Managing Geospatial Data
  4. OMB Did Not Have Complete and Reliable Information to Identify Duplicative Geospatial Investments

Our review of Background – then and now

The foundation the FGDC has put in place. The Federal Geographic Data Committee (FGDC) has always been a catalyst and leader enabling the adoption and use of geospatial information.

The Federal Geographic Data Committee (FGDC) has been successfully creating the geospatial building blocks for the National Spatial Data Infrastructure (NSDI) and empowering users to exploit the value of geospatial information.  The FGDC has been leading the development of the NSDI by creating the standards and tools to organize the asset inventory, enhance data and system interoperability and increase the use of national geospatial assets. The FGDC has successfully created policy, metadata, data and lifecycle standards, clearinghouses, catalogs, segment architectures and platforms that broaden the types and number of geospatial users while increasing the reuse of geospatial assets. [1] 

What is next? The Geospatial Platform and NGDA portfolio will be the mechanism for adoption of shared geospatial services to create customer value

Recently, the FGDC and its’ partners, have expanded their vision to include the management and development of a shared services platform and a National Geospatial Data Asset (NGDA)portfolio.  The goals are to “develop National Shared Services Capabilities, Ensure Accountability and Effective Development and Management of Federal Geospatial Resources, and Convene Leadership of the National Geospatial Community benefitting the communities of interest with cost savings, improved process and decision making”.[2]

As the FGDC continues on the road to establish a world class geospatial data, application and service infrastructure, it will face significant challenges “where the Managing Partner, along with a growing partner network, will move from start‐up and proof‐of‐concept to an operational Geospatial Platform”.[3]

Xentity has reviewed the FGDC’s current strategy, business plan and policies and identified the following critical issues that need to be solved to attain the goals:

  • Building and maintaining a federated, “tagged”[4] standards-based NGDA and an open interoperable Geospatial platform. The assets need to provide sufficient data quantity and quality with service performance to attract and sustain partner and customer engagement[5]
  • Developing a customer base with enough critical mass to justify the FGDC portfolio and provide an “Increased return on existing geospatial investments by promoting the reuse of data application, web sites, and tools, executed through the Geospatial Platform” [6]
  • Improving Service Management and customer-partner relationship capabilities to accelerate the  adoption of interoperable “shared services” vision and satisfy customers [7]
  • Executing simple, transparent and responsive Task Order and Requirements management processes that result in standards based interoperable solutions.  [8]

The Big Challenges

Establish the financial value and business impact of the FGDC’s Portfolio!

The Geospatial Platform and NGDA will provide valuable cost saving opportunities for its adopters.  It will save employee’s time; avoid redundant data acquisition and management costs, and improve decision making and business processes.  The financial impact to government and commercial communities could be staggering. It is a big and unknown figure.

The Geospatial Platform by definition and design is a powerful efficient technology with the capacity to generate a significant return on investment.  It is a community investment and requires community participation to realize the return.  The solution will need to assist the communities with the creation and sharing of return on investment information, cost modeling, case studies, funding strategies, tools, references and continue to build the investment justification.  The solution will need to optimize funding enhancement and be responsive to shorter term “spot” or within current budget opportunities while always positioning for long term sustainability.  The FGDC Geospatial Platform Strategic Plan suggests a truly efficient capability could create powerful streamlined channels between much broader stakeholder communities including citizens, private sector, or other government-to-government interfaces. Similar to the market and business impacts of GPS, DOQ, satellite imaging technology, the platform could in turn promote more citizen satisfaction, private sector growth, or multiplier effects on engaged lines of business.

To get a big return, it will demand continuous creative thinking to develop investment, funding, management and communication approaches to realize and calculate the value.  It is a complex national challenge involving many organizations, geospatial policy, conflicting requirements, interests and intended uses.

The key is demonstrable successes.  Successes become the premise for investment strategy and cost savings for the customers.  Offering “a suite of well‐managed, highly available, and trusted geospatial data, services, and application, web site for use by Federal agencies—and their State, local, Tribal, and regional partners” [9] is the means to create the big value.  

”A successful model of enterprise service delivery will create an even greater business demand for these assets while reducing their incremental service delivery costs.” [10]

FGDC has to create and tell a compelling “geospatial” value proposition story

To successfully implement the FGDC’s vision, it will demand a robust set of outreach and marketing capabilities.  The solution will need to help construct the platforms value proposition and marketing story to build and inform the community.  The objective is to ensure longer term sustainable funding and community participation.  The solution will need to bring geospatial community awareness, incentive modeling, financial evaluation tools, multi-channel communication and funding development experience to the FGDC.  The solution will need to have transparently developed and implemented communication and marketing strategies that have led to growth in customer base, alternative portfolio funding models and shared services environments for the geospatial communities.  The solution will need to have an approach that will be transparent, engage the customer and partners and continuously build the community.

This is a challenging time to obtain needed capital and win customers even for efficient economic engines like shared geospatial data and services.  The solution will need to approach the community outreach is impactful, trusted and will tell the story of efficiencies, cost savings, and higher quality information.  The platform and NGDA must impact the customer program objectives. Figure 1 – FGDC Performance and Value framework shows how the platform’s value chain aligns with the types of performance benefits that can be realized throughout its inherent processes. The supporting team’s understanding of this model will need to organize the “Story” to convince the customer and partners that the platform can:

  • Provide decision makers with content that they can use with confidence to support daily functions and important issues,
  • Provide consistency of base maps and services that can be used by multiple organizations to address complex issues,
  • Eliminate the need to choose from redundant geospatial resources by providing access to preferred data, maps and services[11] 

As the approach is implemented, the FGDC, its partners and the Communities of Interest will have successfully accelerated the adoption and use of location based information.  Uses will recognize the value offering and reap the benefits to their operations and bottom line.   The benefits will be measurable and support the following FGDC business case objectives:

  • Increasing Return on Existing Investments, Government Efficiency, Service Delivery
  • Reducing (unintentional) Redundancy and Development and Management Costs
  • Increasing Quality and Usability[12]

Our Suggested Solution

FGDC’s challenges requires PMO, integrated lifecycle management, partner focus, and blend experience with an integrated approach and single voice designed to meet the FGDC’s strategic objectives and provide a world-class service shared services and data portfolio.  Doing this, they can integrate organizations, data, and service provision.

A solution like this would provide the program, partner and customer relationship management, communications, development and operational capabilities required to successfully implement the FGDC’s vision and business plan. The focus will need to 

  1. Coordinate cross-agency tasks, portfolio needs in agile prgoram management coordination with a single voice,
  2. implement an understanding of critical lifecycle processes to manage and operate the data, technology, capital assets and development projects for a secure cloud-based platform
  3. have communications and outreach focused on communities for partner and customer engagement in the lifecycle decisions
  4. Finally, make sure secretariat staff and team has rotating collective experience with representatives and contractors who hav esuccessfully performed at this scale across all functional areas with domain knowledge in Geospatial, technology, program, service, development and operations.

The strategy and collective experience and techniques will enable FGDC to provide a single voice from all management domains (PMO, Development, Operations and Service Management) for customer engagement. The approach will be need to be integrated with the existing FGDC operating model creating a sum value greater than that of its individual parts. This approach will help create the relationship to develop trusted partner relations services. 


[1]  (page 7 – Geospatial –Platform-Business-Plan-Redacted-Final)

[2]  (page 2 – Draft NSDI –Strategic Plan 2014-2016 V2)

[3]  (page 28 – Geospatial –Platform-Business-Plan-Redacted-Final)

[4]  (page 11 – Ibid )

[5]  (page 9 – Ibid)

[6]  (page 26 – Ibid)

[7]  (page 4 – Ibid)

[8]  (page 6 – Ibid)

[9]  (page 2 – Geospatial –Platform-Business-Plan-Redacted-Final)

[10]  (DOI Geospatial Services Blueprint – 2007)

[11]  (page 13 – Geospatial –Platform-Business-Plan-Redacted-Final)

[12]  (Appendix A – Geospatial –Platform-Business-Plan-Redacted-Final)

[13]  (Page 12 – OMB Circular A-16 Supplemental Guidance)

[14]  (page 12 – Geospatial –Platform-Business-Plan-Redacted-Final)

[15]  (page 36 – Ibid)

[16]  (ITSM – Service Operations V3.0)

[17]  (page 26 – Ibid)