The Four V's of BigData – Variety Veracity Velocity Volume

Blog post
edited by
Matt Tricomi

If we are simply talking about lots of retail data, lots of sales data, lots of management data, lots of metadata, we aren’t talking BigData. Though for some reason, those data are going through and into the new architectures. Yeah, sure the retail, defense, and intelligence worlds have been sifting through huge data stores for years. 

But the marketing coined term BigData is not just referring to the Volume of the data. There are 4 V’s of big data . We have been enjoying using as learned through our information exchanges and partnering with IBM.

The first two v’s focus on mission side: Variety and Veracity

Variety (how many categories of data does it cover) includes as well technology (sources, formats (

(e.g. numeric, text, objects, geocoded, vector, raster, structured, unstructured – email, video, audio, etc.), methods) and legal  (complexity, privacy, jurisdiction). Essentially working with various types of data including various dimensions (temporal, geospatial, sentiment, metadata, logs, etc.) and 

Veracity (understanding authority of data) includes known data quality, type of data, data management maturity so that you can understand how much you can trust the data is right and accurate

The other focus on the speed and amount: Velocity and Volume

Volume (how much data). – Capturing, Processing, Reporting and managing a large volume of data

Velocity (how often it changes or real-time) – Analyzing and exploiting lots and lots of new data in real-time

Changes to the V’s over the last 15 years

 

 

Veracity is the newer one. The concept of data quality has always been the orphaned step-child. It is the I in IT. Its the part of the iceberg under the water. All IT vendors want to sell speed, and handles lots of data, and some commodities of variety support, but once sold, you are on your own (or $300/hour for mission customization). But, we are happy to see IBM got there and added it.

As of late, Value has been introduced as well. Paraphrasing the spanish article, even if you can produce information, is there is no real action that can be done with it, its not Valuable to the organization.

Then again, some have accused IBM of stealing the V’s from 2001 Gartner V’s development . This V-gate controversy :

Nonetheless, if anything it proves that its a concept that is worth fighting over meaning and digging into it, you see it has its merits. 

Will geoscience go for a shared service environment

Blog post
edited by
Wiki Admin

Will geoscience go for a shared service environment?

As the previous “How can we help geoscience to move their data to shared services” blog noted, unless we align the stakeholders, get a clear line of sight on their needs, and focus on earning trust and demonstrating value, the answer is no. But let’s say we are moving that way. How do we get started to fund such an approach?

Well, first off, the current grant and programmatic funding models are not designed to develop shared services or interoperable data for the geosciences.  Today, there are many geoscientists who are collaborating between disciplines and as a result improving the quality of knowledge and scaling the impact of their research.  It is also well established that the vast majority operate individually or in small teams.  Geoscientists, rightly so, continue to be very focused on targeted scientific objectives and not on enabling other scientists. It is a rare case when they have the necessary resources or skills.  With the bright shiny object of data driven science /Big data; do we have the Big Head wagging the body of the geoscientist community?  Xentity sees opportunities to develop funding strategies to execute collaborative performance-based cross discipline geosciences. It has been this way since World War II really expanded upon its war-time successful onesy-twosy grants to universities since then. There has been some movement towards hub and spoke grant funding models, but we are still out to get our PhD stripes, get our CVs bigger, and keep working with the same folks. I know it is a surly and cynical view. OK, the real is they are doing amazing work, but in their field, and anything that slows down their work, for greater good, is lacking incentive.

Also, there are few true shared services that are managed and extended to the community as products and services should be. Data driven science, which is out fourth paradigm of science, has been indirectly “demanding” scientific organizations and their systems to become 24×7 service delivery providers. We have been demanding IT programmers to become service managers, scientist to become product managers, or data managers.  With a few exceptions, it has not worked. Geoscientists are still struggling to find and use the basic data/metadata, produce quality metadata (only 60% meet quality standards per EarthCube studies) for their own purposes, let alone making the big leap to Big data and analytics. Data driven science requires not only a different business or operating model, but a much clearer definition of the program, as well as scientist’s roles and expectations.  It requires new funding strategies, incentive models and a service delivery model underpinned by the best practices of product management and service delivery.

Currently, and my favorite, there is limited to no incentive for most geoscientist to think beyond their immediate needs.  If geoscientists are to be encouraged to increase the frequency and volume of cross-discipline science, there needs to be enablement services, interoperable data and information products that solve repetitive problems and provide incentive for participation.  We need to develop the necessary incentive and management models to engage and motivate geoscientist, develop a maturity plan for the engineering of shared geoscience services and develop resourcing strategies to support its execution. Is this new funding models, new recognition models, new education, gamification, crowdsourcing, increasing competition, changing performance evaluation? Not sure as any changes to “game” rules can and usually introduces new loopholes and ways to “game” the system.

The concept of shareable geoscience data, information products and commodity or analytical computing services has an existing operating precedent in the IT domain –shared services.  Shared services could act as a major incentive for participation.  An approach would identify the most valuable cross cutting needs based on community stakeholder input. The team would use this information to develop a demand driven plan for shared service planning and investment. As an example, a service-based commodity computing platform can be developed to support both the Big Head and Long Tail and act as incentive to participation and perform highly repetitive data exchange operations.

How does one build and sustain a community as large and diverse as the geosciences? 

The ecosystem of geoscience is very complex from a geographic, discipline and skill level point of view. How does one engage so diverse a community in a sustainable manner?  “Increased visibility of stakeholder interests will accelerate stakeholder dialogue and alignment – avoiding “dead ends” and pursuing opportunities.” The stakeholders can range from youthful STEM to stern old school emeritus researchers; from high volume high frequency data producers of macro scale data to a single scientist with a geographically targeted research topic. It is estimated that between 80-85% of the science is done in small projects.  That is an enormous intellectual resource that if engaged can be made more valuable and productive.

Here is a draft target value chain :

The change or shift puts a large emphasis on upfront collaborative idea generation, team building, and knowledge sharing via syndication, and new forms of work decomposition in the context of crowd participation (Citizen Science and STEM).  The recommended change in the value chain begins to accommodate the future needs of the community.  However, the value chain becomes actionable based on the capabilities associated to the respective steps.  Xentity has taken the liberty to alliteratively define these four classes of capabilities or capability clusters as:

Encouragement, Engagement, Enablement, and Execution.

Encouragement capabilities are designed to incentivize or motivate the scientist and data suppliers to participate in the community and garner their trust. They are designed to increase collaboration, the quality and value of idea generation and will have a strong network and community building multiplier effect. 

Questions

Capabilities

  • How can new scientific initiatives be collaboratively planned for and developed?
  • How can one identify potential collaborators across disciplines?
  • How can one’s scientific accomplishments and recognition be assured and credited?
  • What are the data possibilities and how can I ensure that it will be readily available?
  • How can scientific idea generation be improved?
  • Incentives based on game theory
  • Collaboration, crowd funding, crowd sourcing and casting
  • Needs Analysis
  • Project Management and work definition
  • Credit for work Services

Engagement Capabilities include the geoscience participant outreach and communication capabilities required to build and maintain the respective communities within the geoscience areas.  These are the services that will provide the community the ability to discuss and resolve where the most valued changes will occur within the geosciences community, who else should be involved in the effort?  

Questions

Capabilities____________________

  • What participants are developing collaborative key project initiatives?
  • What ideas have been developed and vetted within the broadest set of communities?
  • Who, with similar needs, may be interested in participating in my project?
  • How can Xentity cost share?
  • Customer Relationship Management
  • Promotions
  • Needs Analysis
  • Communications and Outreach
  • Social and Professional Networking

Enablement capabilities are technical and infrastructure services designed to eliminate acquisition, data processing and computing obstacles and save scientist time and resources.  They are designed to solve frequently recurring problems that affect a wide variety and number of geoscience stakeholders from focusing on their core competencies – the creation of scientific knowledge. Enablement services will have a strong cost avoidance multiplier effect for the community on the whole if implemented and supported.

Questions

Capabilities

  • How does one solve data interoperability challenges for data formats and context?
  • How do I get data into the same geographic coordinate system or scale of information?
  • How can I capture and bundle my Meta information and scientific assets to support publication, validation and curation easily?
  • How can I get access extensible data storage throughout the project lifecycle?
  • Where and how can I develop an application with my team?
  • How can I bundle and store my project datasets and other digital assets later retrieval?
  • How can I get scalable computing resources without having to procure and manage servers to complete my project?
  • Workflow
  • Process Chaining
  • Data Interoperability
    • Data transformations
    • Semantics
    • Spatial Encoding and Transformation
    • Data Services
  • Publishing
  • Curation

Syndication

Execution Capabilities are comprised of the key management oriented disciplines that are required to support shared infrastructure, services or to help evolve a highly federated set of valuable assets “edges” to be more useable and valuable to the evolving community over time.

Questions

Capabilities____________________

  • How do we collectively determine what information might require a greater future investment?
  • What are the right incentives in the grant processes?
  • What are the future funding models?
  • What models should be invested in?
  • Which technologies should be evaluated for the shared assets?
  • What upcoming shared data or technology needs are in common to a large number of participants?
  • Governance,
  • IT Service Management (ITSM),
  • Product Management,
  • Performance Management,
  • Requirements Management,
  • Data Management,
  • Data Supply Management,
  • Data Life Cycle Management
  • Funding
  • Grants and processing

So, why did we develop these classes of capabilities? 

They represent, at the macro level, a way to organize a much larger group of business, operating and technical services that have been explicitly discussed in NSF EarthCube efforts over the last 3-4 years. We then been derived these outputs from analysis and associate them to the most important business drivers. Check out this “draft” relationship of capabilities drivers and rational 

 RationaleDrivers
EngageThe best way to create communities and identify common needs and objectives, begin to build trust and value awareness; bring the respective communities into an environment where they can build out their efforts and sustain collaborative approaches.Agency (how to navigate planned versus emergent change), intellectual property rights, infrastructure winners and losers, agreement on data storage, preservation, curation policies and procedures, incentives to share data and data sharing policies, and trust between data generators and data users.
EncourageThe best models to incentivize scientist’s and data producers to participate and collaborate. Xentity have developed game theory based approaches and large scale customer relationship management solutions Social and cultural challenges: Motivations and incentives, self-selected or closely-held leadership, levels of participation, types of organizations, and collaboration among domain and IT specialists)
EnableThe most costly data processing obstacles – The lowest common denominator – highest impact problem.  A common problem found in shared service environments. We have developed enterprise service analysis tools for cost benefit for the DOI geospatial community, so we have seen this work80% of scientist data needs can be expressed as standard data product, and 80 % of scientist time is spent getting data into proper form for research analysis
ExecuteA governance model that will increase the “edge effect” between the legacy and future capabilities and a very diverse set of communities. Simple planning capabilities that empower scientist to work complex cross disciplines ideas amongst themselves, define work and coordinate with the power of the crowd. We have designed collaborative environments and crowd based frameworks for data collection and analysis with corresponding performance management system.Conceptual and procedural challenges: Time (short-term funding decisions versus the long-term time-scale needed for infrastructures to grow); Scale (choices between worldwide interoperability and local optimization);

So why don’t we do it?

Well, this does introduce an outside approach into a closed knit geoscience community who is very used to solving for themselves. Having a facilitated method from outside consulting or even teaming with agency operations who have begun moving this route for their national geospatial data assets is not seen as something fits their culture. We are still learning of hybrid ways we can collaborate and help the geoscientists setup such a framework, but for now it is still a bit foreign of a concept, and while there is some awareness by the geoscientist community to adopt models that work for other sectors, industries, operational models, the lack of familiarity is causing a lot of hesitation – which goes back to the earn trust factor and finding ways to demonstrate value.

Til then, we will keep plugging away, connecting with the geoscience community in hopes that we can help them advance their infrastructure, data, and integration to improve earth science initiatives. Until then, we will remain one of the few top nations without an operational, enterprise national geoscience infrastructure.

Why we focus on spatial data science

Blog post
edited by
Matt Tricomi

The I in Information Technology is so broad – why is our first integrated data science problem focus on spatial data? It doesnt fit when looking on face of our Services Catalog . We get asked this a lot and this is our reason, and like Geospatial, its multi-dimensional spanning different ways of thinking, audiences, maturity, progressions, science, modeling, and time:

 

In green, x-axis, is the time progression of public web content. The summary point is data took the longest period – about 10-15 years. And data can only get better as it matures into being popular 25 years old on the web. We are in the information period now, but moving swiftly into the knowledge period. Just see how much more scientific data visualizations, and dependence we are on the internet. Just think how much you were on the web in 1998 compared to 15 years later – IT IS IN YOUR POCKET now. 

This isn’t just our theory.

RadarNetworks put together the visual of progressing through the web eras. Web 1.0 was websites or Content and early Commerce sites. Web 2.0 raised the web community with blogs and the web began to link collaboratively built information with wikis. Web 3.0 is ushering in the semantic direction and building integrated knowledge.

Even scarier, Public Web Content progression lags several business domains, but not necessarily in this leading order: Intelligence, Financial, Energy, Retail, and Large Corporate Analytics. Meaning, this curve reflects the Public maturity, and those other domains have different and faster curves. 

The recent discussions on intelligence analysis linking social/internet data with profile, Facebook/Google Privacy and use for personalized advertising, level of detail SalesForce knows about you and why companies pay so much for a license/seat, how energy exploration is optimizing where to drill in some harder to find areas, or the absolute complexity and risk of the financial derivatives as the world market goes – these technologies usually lag in how we integrate public content for googling someone, or using the internet to learn more and faster. Reason: Those do not make money. Same reason why the DoD invented the internet – it was driven by security of the U.S. which makes money which makes power. 

So, that digression aside (as we have been told “well, my industry is different”), the public progression does follow a parabolic curve that matches Moore’s Law driving factor in IT capability – every 2 years, computing power doubles in power, at same cost (paraphrasing). The fact that we can do more faster at quality levels means we can continue to increase our complexity of analysis in red. And there appears to be a stall not moving towards wisdom, but as we move toward knowledge. Its true our knowledge will continue to increase VERY fast, but what we do with that as a society is the “fear” as we move towards this singularity so fast. 

Fast is an understatement, very fast even for logarithmic progression as its hard to emote and digest the magnitude of just how fast it is moving. We moved from

  • The early 90s simply placing history up there and experimentation and having general content with loose hyperlinking and web logs
  • to the late 90s conducting eCommerce and doing math/financial interaction modeling and simulations and building product catalogs with metadata that allowed us to relate and say if a user found that quality or metdata in something, it might liek something else over here
  • to the early 2000s to engineering solutions including social and true community solutions that began to build on top of relational and the network effect and use semantics and continually share content on timelines and where a photo was taken as GPS devices began to appear in our pockets
  • To the 2010s or today where we are looking for new ways to collaborate, find new discoveries in cloud, and use the billions and billions on sensors and data streams to create more powerful more knowledgable applications

Another way to digest this progression is via the table below.

Web VersionTimeDIKWWeb MaturityKnowledge Domain Leading WebData Use Model on WebData Maturity on Web
.9early 90sDataContentHistoryExperimentalLogs
1.01995+Info HistoryExperimentalContent
1.11997  MathExperimentalRelational
1.21999 +CommerceMathHypotheticalMetadata
1.32002  EngineeringHypotheticalSpatial
2.02005+Knowledge+CommunityEngineeringComputationalTemporal
2.12010s  EngineeringComputationalSemantic
3.02015 and predictable webKnowledge+CollaborationScienceData as 4th paradigm notTempoSpatial (goes public)
4.02020 -2030Wisdom in sectorsAdvancing Collaboration with 3rd world coreAdvancing Science into Shared Services – Philsophical is out yearRobot/Ant data qualitySentiment and Predictive (goes public/useful) – Sensitive is out year

Now, think of the last teenager that could maintain eye contact in a conversation with an adult while holding phone in their hand and not be distracted by the pavlovian response of a text, tweet, instagram, etc. Now imagine, ten years from now, when its not tidbits of data, but as a call comes up, auto-searching on terms they arent aware of come up in augmented reality. Advice on how to react on the sentiment they just received – not just the information. The emotional knowledge quotient will be google now – “What do I do when?” versus critical thinking and live and learn.

So, taking it back to the “now”, though this blog is lacking the specific citations (blogs do allow us to cheat, but our research sources will make sure to detail and source our analysis), if you agree that spatial mapping for professional occurred in early 2000s and agree now that it has hit the public and understand that spatially tagging data has pass the tipping points with advent of smartphones, map apps, local scouts, augmented reality directions, and multi-dimensionl modeling integrating GIS and CAD with web, then you can see the data science maturity stage we are in that has the largest impact right now is – Geospatial.

Geospatial data is different. Prior to geospatial, data is non-dimension-based. It has many attributable and categorical facets, but prior to spatial data, that data does not have to be stored as a mathematical or picture form with specific relation to earth position. Spatial data – GIS, CAD, Lat/Longs, have to be stored in numerical fashion in order to calculate upon it. Further more, it hasnt be be related to a grounding point. Essentially, geospatial is storing vector maps or pixel maps. When you begin to put that together for 10s of millions of streams, you get a a very large complicated spatially referenced hydrography dataset. It gets even more complicated when you overlay 15-minute time-based data such as water attributes (flow, height, temperature, quality, changes, etc.) with that. Even more complicated when you combine that data with other dimensions such as earth elevations and need to relate across domains of science, speaking different languages to be able to calculate how fast water may flow a certain contaniment down a slope after a river bank or levy collapses.

Before we can get to those more complex scenarios, geospatial data is the next progression in data complexity .

That said, definitely check out our Geospatial Integrated Services and Capabilities

Data Science Research Areas Punch List

Blog post
added by
Wiki Admin

In Launching this Data Science Research service area, the following are areas of data science research we are actively pursuing.

Solutions 

  • Tactical Industry and Trend Context Reports: Data Visualizations
  • Implementable Changes in Industrial Engineering Practices
  • Linking Research & Commercial Industries
  • Crowd and Commercial Effectivity
  • Place-based and Geospatial business cases and impact levels by data theme, product, and service types
  • Semantic vs. machine learning applications for integrating large Corporate or Government common datasets
  • Real-Time National or World Scale Topological Big Data Modeling and Decision Support 
  • Proving existing major corporate and government datasets, social information data quality and semantic readiness, and existing or new platforms and applications to support Smart Cities in simulation environment such as urban planning, decision making, and policy/rules-based intelligence (aka Real-world SimCity and Civilization models)
  • Remote Sensing Integration with BigData Sources and Analytics
  • New Energy Model Research & Development Repository and Social Network Enhancement
  • Information Patterns & Historical Analysis
  • Integrating Computer and Library Science Techniques
  • Blending Machine Learning and Semantic Web
  • Historical Timeline Visualizations – knowledge, technical evolution
  • Roadmap Prediction Visualizations
  • AI/Robotic Integration with Decision Making
  • Data Supply Chain models analysis in support of creating data ecosystem flow for major static and real-time datasets.
  • Impacts of Next Generation or Internet2 architectures on existing content and dataset

Management

Data Science and Architecture Management Research

  • Integrating Academic GeoScience Communities using Architecture Methods 
  • Investigating how Bill/Policy Motives align with Federal Portfolio
  • Leveraging Architecture concepts to advise and improve bills
  • Real-World Enterprise Architecture analysis
  • Federal readiness for architecture and change management maturity by agency using 
  • Performance Measurement analysis for management and budget policies
  • Reduction and impact evaluation of burden on government agencies for data calls – 
  • Value-measurement on policies and metadata
  • Strategic progression of maturing datasets (i.e. What dataset to build next and butterfly effect?)
  • Realistic blending of private sector and public sector best of breed techniques
  • Historical context analysis for current information management policy and bills for future decisions
  • Analyze policy shaping techniques (i.e. market-driven policy, policy reformation, protectionism policy, new value transition or adoption) diversity by industry.
  • Improving Product Management Subjectivity
  • Agile Project Management
  • Architecture Methodology
  • Data Supply Chain Management efficiency patterns
  • Integrating Geospatial Architectures into Industry
  • Industry Acceleration & Stabilization Evaluations
  • Gaming theory application readiness for Corporate and Government policy and increasing energy and quality output (i.e. MMORPG, Social Network, Strategy games, incentive models, talent/skill development, state of integration such as Mechanical Turk models).

Data Science Research Concepts

Blog post
added by
Wiki Admin

Our corporate goal is two-fold:

  • Analyze and assure forward-thinking concepts are applicable
  • Increase our expert staff knowledge base

We first need to assure our concepts and consultants are current, relevant to our partners, clients and industry, and forward thinking; second, allowing for us to excite, retain and increase our talent knowledge base for longer period of time than typical consulting-only firms which allows us to lower personnel costs and maintain lower overhead costs. 

We are actively working and establishing new research partnership academia (i.e. major university research hubs, local STEM programs), government (i.e. Federal Programs, State, Local), municipal services (i.e. water utilities), and engineering/scientific service companies.

In considering the types of innovative data science research that Xentity is seeking to make a transformative impact upon Data Science Solutions Research and Data Science and Architecture Management Research