Why we focus on spatial data science

Blog post
edited by
Matt Tricomi

The I in Information Technology is so broad – why is our first integrated data science problem focus on spatial data? It doesnt fit when looking on face of our Services Catalog . We get asked this a lot and this is our reason, and like Geospatial, its multi-dimensional spanning different ways of thinking, audiences, maturity, progressions, science, modeling, and time:

 

In green, x-axis, is the time progression of public web content. The summary point is data took the longest period – about 10-15 years. And data can only get better as it matures into being popular 25 years old on the web. We are in the information period now, but moving swiftly into the knowledge period. Just see how much more scientific data visualizations, and dependence we are on the internet. Just think how much you were on the web in 1998 compared to 15 years later – IT IS IN YOUR POCKET now. 

This isn’t just our theory.

RadarNetworks put together the visual of progressing through the web eras. Web 1.0 was websites or Content and early Commerce sites. Web 2.0 raised the web community with blogs and the web began to link collaboratively built information with wikis. Web 3.0 is ushering in the semantic direction and building integrated knowledge.

Even scarier, Public Web Content progression lags several business domains, but not necessarily in this leading order: Intelligence, Financial, Energy, Retail, and Large Corporate Analytics. Meaning, this curve reflects the Public maturity, and those other domains have different and faster curves. 

The recent discussions on intelligence analysis linking social/internet data with profile, Facebook/Google Privacy and use for personalized advertising, level of detail SalesForce knows about you and why companies pay so much for a license/seat, how energy exploration is optimizing where to drill in some harder to find areas, or the absolute complexity and risk of the financial derivatives as the world market goes – these technologies usually lag in how we integrate public content for googling someone, or using the internet to learn more and faster. Reason: Those do not make money. Same reason why the DoD invented the internet – it was driven by security of the U.S. which makes money which makes power. 

So, that digression aside (as we have been told “well, my industry is different”), the public progression does follow a parabolic curve that matches Moore’s Law driving factor in IT capability – every 2 years, computing power doubles in power, at same cost (paraphrasing). The fact that we can do more faster at quality levels means we can continue to increase our complexity of analysis in red. And there appears to be a stall not moving towards wisdom, but as we move toward knowledge. Its true our knowledge will continue to increase VERY fast, but what we do with that as a society is the “fear” as we move towards this singularity so fast. 

Fast is an understatement, very fast even for logarithmic progression as its hard to emote and digest the magnitude of just how fast it is moving. We moved from

  • The early 90s simply placing history up there and experimentation and having general content with loose hyperlinking and web logs
  • to the late 90s conducting eCommerce and doing math/financial interaction modeling and simulations and building product catalogs with metadata that allowed us to relate and say if a user found that quality or metdata in something, it might liek something else over here
  • to the early 2000s to engineering solutions including social and true community solutions that began to build on top of relational and the network effect and use semantics and continually share content on timelines and where a photo was taken as GPS devices began to appear in our pockets
  • To the 2010s or today where we are looking for new ways to collaborate, find new discoveries in cloud, and use the billions and billions on sensors and data streams to create more powerful more knowledgable applications

Another way to digest this progression is via the table below.

Web VersionTimeDIKWWeb MaturityKnowledge Domain Leading WebData Use Model on WebData Maturity on Web
.9early 90sDataContentHistoryExperimentalLogs
1.01995+Info HistoryExperimentalContent
1.11997  MathExperimentalRelational
1.21999 +CommerceMathHypotheticalMetadata
1.32002  EngineeringHypotheticalSpatial
2.02005+Knowledge+CommunityEngineeringComputationalTemporal
2.12010s  EngineeringComputationalSemantic
3.02015 and predictable webKnowledge+CollaborationScienceData as 4th paradigm notTempoSpatial (goes public)
4.02020 -2030Wisdom in sectorsAdvancing Collaboration with 3rd world coreAdvancing Science into Shared Services – Philsophical is out yearRobot/Ant data qualitySentiment and Predictive (goes public/useful) – Sensitive is out year

Now, think of the last teenager that could maintain eye contact in a conversation with an adult while holding phone in their hand and not be distracted by the pavlovian response of a text, tweet, instagram, etc. Now imagine, ten years from now, when its not tidbits of data, but as a call comes up, auto-searching on terms they arent aware of come up in augmented reality. Advice on how to react on the sentiment they just received – not just the information. The emotional knowledge quotient will be google now – “What do I do when?” versus critical thinking and live and learn.

So, taking it back to the “now”, though this blog is lacking the specific citations (blogs do allow us to cheat, but our research sources will make sure to detail and source our analysis), if you agree that spatial mapping for professional occurred in early 2000s and agree now that it has hit the public and understand that spatially tagging data has pass the tipping points with advent of smartphones, map apps, local scouts, augmented reality directions, and multi-dimensionl modeling integrating GIS and CAD with web, then you can see the data science maturity stage we are in that has the largest impact right now is – Geospatial.

Geospatial data is different. Prior to geospatial, data is non-dimension-based. It has many attributable and categorical facets, but prior to spatial data, that data does not have to be stored as a mathematical or picture form with specific relation to earth position. Spatial data – GIS, CAD, Lat/Longs, have to be stored in numerical fashion in order to calculate upon it. Further more, it hasnt be be related to a grounding point. Essentially, geospatial is storing vector maps or pixel maps. When you begin to put that together for 10s of millions of streams, you get a a very large complicated spatially referenced hydrography dataset. It gets even more complicated when you overlay 15-minute time-based data such as water attributes (flow, height, temperature, quality, changes, etc.) with that. Even more complicated when you combine that data with other dimensions such as earth elevations and need to relate across domains of science, speaking different languages to be able to calculate how fast water may flow a certain contaniment down a slope after a river bank or levy collapses.

Before we can get to those more complex scenarios, geospatial data is the next progression in data complexity .

That said, definitely check out our Geospatial Integrated Services and Capabilities

Can a predictable supply chain for geospatial data be done

Blog post
added by
Wiki Admin

Creating Predictability in the Government’s Geospatial Data Supply Chain…

This article expands upon the presentation on What does geodata.gov mean to data.gov presented at the First International Open Government Data Conference in November 2010 and as well the GAO releases report on FGDC Role and Geospatial Information which emphasizes similar focus on getting the data right.

Would it be valuable to establish predictability in the government geospatial data supply chain? 

As examples, what if one could be guaranteed that every year or two the United States Census Bureau produced in cooperation with state and local authorities or that HHS produced a high quality updated county boundary dataset would produce a geocoded attributed list of all the hospitals in the country validated by the health care providers.  Of course it would be valuable and could provide the means to minimize redundant data purchasing, collection and processing.

If the answer is “of course”, then why haven’t we done so already? 

It is a simple concept, but one without an implementation strategy.  Twenty years after the establishment of Circular A-16 and FGDC metadata content standards, we are still looking at metadata from a dataset centric point of view -that is for “what has been” and not for “what will be”.  Knowing what is coming and when it is coming enables one to plan.

The model can be shifted to the “what will be” perspective, if we adopt a system’s driven data lifecycle perspective. Which would mean we look at Data Predictability and Crowdsourcing.

It may seem ironic, in the age of crowd sourcing, to argue for predictable data lifecycle releases of pedigreed information and seemingly deny the power of the crowd.  But the fact remains, the civilian government entities in the US systematically collect and produce untold volumes of geospatial information (raster, vector, geo-code able attributes) through many systems including earth observation systems, mission programs using human capital, business IT systems, regulatory mandates, funding processes and cooperative agreements between multiple agencies and all levels of government. The governments in the US are enormous geospatial data aggregators but much of this work is accomplished in systems that owners and operators view as special but not “spatial”.  

An artificial boundary or perception has been created that geospatial data is different than other types of data and by extension so are the supporting systems. 

There remain challenges with data resolution, geometry types and attribution etc., but more importantly there is a management challenge here.  All of these data aggregation systems have or could have a predictable data lifecycle accompanied by publishing schedules and processing authority metadata. Subsequently, the crowd and geospatial communities could use its digital muscle to complement these systems resources if that is their desire and all government programs would be informed by having predictable data resources.

What is required is communicating the system’s outputs, owner and timetables. 

Once a data baseline is established, the geospatial users and crowd could determine the most valuable content gaps and use their resources more effectively; in essence, creating an expanded and informed community. To date, looking for geospatial information is more akin to an archaeological discovery process than searching for a book at the library.

What to do?

Not to downplay the significance of the geospatial and subject matter experts publishing value added datasets and metadata into clearinghouses and catalogs,  but we would stand to gain much more by determining which finite number of systems aggregate and produce the geospatial data and creating a predictable publishing calendar. 

In the current environment of limited resources, Xentity seeks to support efforts such as the FGDC, data.gov, and other National Geospatial Data Assets and OMB to help shift the focus on these primary sources of information that enable the community of use and organize the community of supply. This model would include publishing milestones from both past and futures that could be used to evaluate mission and geospatial end user requirements, allow for crowd sourcing to contribute and simplify searching for quality data.