Presented: Knowledge, Information, and Data Architecture Implementations at Transportation Research Board Annual Meeting (TRB)

As part of the TRB annual meeting, Xentity Architect, Jim Barrett, presented, as part of a workshop on one of Xentity’s Concepts on shifting where data, information and knowledge is invested in support of shifting needs. This session was titled “Knowledge, Information, and Data (KID) Architecture Implementations”. First, the Concept Presentation explains the general concept, then […]

EarthCube Architecture Implementation Planning is Underway

View Original Article The National Science Foundation (NSF) has tasked the EarthCube Science Support Office (ESSO) with creating a detailed architecture and implementation plan from the EarthCube architecture blueprint defined at the May 2016 Architectural Framework Workshop and in discussions at the 2016 EarthCube All Hands meeting. NSF has outlined an aggressive schedule, and ESSO is working with a […]

Can a predictable supply chain for geospatial data be done

Blog post
added by
Wiki Admin

Creating Predictability in the Government’s Geospatial Data Supply Chain…

This article expands upon the presentation on What does geodata.gov mean to data.gov presented at the First International Open Government Data Conference in November 2010 and as well the GAO releases report on FGDC Role and Geospatial Information which emphasizes similar focus on getting the data right.

Would it be valuable to establish predictability in the government geospatial data supply chain? 

As examples, what if one could be guaranteed that every year or two the United States Census Bureau produced in cooperation with state and local authorities or that HHS produced a high quality updated county boundary dataset would produce a geocoded attributed list of all the hospitals in the country validated by the health care providers.  Of course it would be valuable and could provide the means to minimize redundant data purchasing, collection and processing.

If the answer is “of course”, then why haven’t we done so already? 

It is a simple concept, but one without an implementation strategy.  Twenty years after the establishment of Circular A-16 and FGDC metadata content standards, we are still looking at metadata from a dataset centric point of view -that is for “what has been” and not for “what will be”.  Knowing what is coming and when it is coming enables one to plan.

The model can be shifted to the “what will be” perspective, if we adopt a system’s driven data lifecycle perspective. Which would mean we look at Data Predictability and Crowdsourcing.

It may seem ironic, in the age of crowd sourcing, to argue for predictable data lifecycle releases of pedigreed information and seemingly deny the power of the crowd.  But the fact remains, the civilian government entities in the US systematically collect and produce untold volumes of geospatial information (raster, vector, geo-code able attributes) through many systems including earth observation systems, mission programs using human capital, business IT systems, regulatory mandates, funding processes and cooperative agreements between multiple agencies and all levels of government. The governments in the US are enormous geospatial data aggregators but much of this work is accomplished in systems that owners and operators view as special but not “spatial”.  

An artificial boundary or perception has been created that geospatial data is different than other types of data and by extension so are the supporting systems. 

There remain challenges with data resolution, geometry types and attribution etc., but more importantly there is a management challenge here.  All of these data aggregation systems have or could have a predictable data lifecycle accompanied by publishing schedules and processing authority metadata. Subsequently, the crowd and geospatial communities could use its digital muscle to complement these systems resources if that is their desire and all government programs would be informed by having predictable data resources.

What is required is communicating the system’s outputs, owner and timetables. 

Once a data baseline is established, the geospatial users and crowd could determine the most valuable content gaps and use their resources more effectively; in essence, creating an expanded and informed community. To date, looking for geospatial information is more akin to an archaeological discovery process than searching for a book at the library.

What to do?

Not to downplay the significance of the geospatial and subject matter experts publishing value added datasets and metadata into clearinghouses and catalogs,  but we would stand to gain much more by determining which finite number of systems aggregate and produce the geospatial data and creating a predictable publishing calendar. 

In the current environment of limited resources, Xentity seeks to support efforts such as the FGDC, data.gov, and other National Geospatial Data Assets and OMB to help shift the focus on these primary sources of information that enable the community of use and organize the community of supply. This model would include publishing milestones from both past and futures that could be used to evaluate mission and geospatial end user requirements, allow for crowd sourcing to contribute and simplify searching for quality data.

Xentity makes Top 50 Fastest Growing Tech Companies List

Silicon Review has selected Xentity to be included in their Top 50 Fastest Growing Tech Companies List. Silicon described its selection approach on multiple dimensions on size, markets, and penetration factors to help determine those acknowledged in the list. SR conducted a detailed interview with Xentity and published an article on Xentity : Web: How are […]

I always wanted to be a data janitor

Blog post
edited by
Wiki Admin

You know the `data wrangling` field is becoming more mainstream if NYTimes is covering it at this level: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com  

The article emphasizes the point that the amount of time to get the data – sensors, web services feeds, corporate databases, smartphone data observations – prepared and ready to be consumed is still a huge effort. 
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

It hits home the point we love to belabor that the software companies do not. Be it ERP, GIS, MIS, analytics – they show great demos, but with data already prepared or some of your data, but only the easy stuff. It works great in their environment and as long as you give it good data, it performs and rocks! But, the demos continually underscore what it takes to wrangle, cleanup, and get that data prepared.
As a huge proponent of good data being a or the major problem or barrier, its nice to see software beginning to move into startup investments to help – ClearStory, Trifacta, Paxata and other start-ups in the field. In mean-time, we need make sure to always stay on top of best, approved approaches, and the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, its not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is:
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”

But the article does tend to focus on bigdata, and not bigdata in an opendata world

Even if the data is scrubbed for machine to machine processing as the article emphasizes, it still doesn`t address the fact with the push for data catalogs, that data curators and creators HATE – with a passion -metadata creation. Its one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.

Lets take the metadata needed to tag for proper use. Lets take a real-world example recently discussed with a biological data analytics expert. 

A museum wants to get some biological data to support an exhibit. They have a student collect 100 scientific observations records on larvae using well calibrated sensors. The data gets published. An academic study on climate change sees the data. The data shows that there is lower larvae count than historical and demonstrates heat changes impacting such. The data shows the sensors and tools used were scientifically accurate. The data is sourced and used. This is a great example of mis-use of data. True, the data was gathered using great sensors, right techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. Its use is vitally important to denote.

 

Another great example, but slightly different use angle. A major Government Records organization wanted to take their inventory of records data, which is meticulously groomed and its data is very accurate and entered and goes through a rigorous workflow to make sure its metadata and access to the actual data record is prepared to be accurate real-time. The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in proper use. But, the actual metadata to help discover something in the billions of records was not suitably prepared with more natural ways people would explore the data to discover and lookup the records.

Historically, armies of librarians would be tasked with searching the records, but have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills programmed (or NLP) and signals the librarion would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.

This is another great example of overlooked data preparation. `Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out`. Once again though, the search engine will just tell you the museum larvae record matches your science study need.