How Do We Gain Data Scientists Trust In The Data

The Divide Between Scientists and Data Providers

Trusting a person you have never met or an organization you have never collaborated with is tough enough. Now compound that with trusting 0s and 1s generated by one of those people or groups. What were their methods? Also, what were their sources? How do I find it? What area does it cover? And what groupings/lists were they using and how’d they choose those? How up to date is it? Is it related to that other group who did it X way?.

Trust requires confidence in consistent clear application that creates reliability. And that trust is a scary thing for someone in the field of science. Especially given the science community is tribal, scientists tend to source their data from a ‘buddy’ scientist. We’re not saying they are mob-like, but there is a sense of “He’s with me” ‘so you can trust that data’. However, it goes the other way as well. A scientist who messes up their data sourcing – who accidentally or intentionally changed their sensor configuration or survey method will be professionally ‘drawn and quartered’. If you thought the current #CancelCulture was bad, the science community is even more brutal with excommunication – take Galileo who was only validated 350+ years after the fact or even dozens of examples of non-scientific reasons such as Turing.

The Thing About Scientists

Scientists are passionate about their studies. They receive funds in onesie-twosies over. Much like information and technology architecture was impacted by World War II, so was the funding models for scientists. NSF started Post World War II after understatedly keen investments that advanced the US and Allied cause – i.e. the Manhattan Project – became popular. Sciences and engineering in aerospace engineering, computer science, biological sciences, etc. were funded in specific labs and seen as the largest period of science advancement in US and possible world history.

Those labs created crack teams that would be set up to compete against other labs to be first. We see this in the marketplace today on races for vaccines between labs, companies, and countries. That competition consequently leads to lower cooperation at times. Furthermore, it leads to tribal protectiveness. And while that protectiveness is good to protect integrity, it also means service based science – meaning sharing data and re-using others is not in their blood.

We are hopeful though. Moreso, albeit, data scientists in service industry and product operations will likely be the earlier adopter. However, to do that, data providers and deliverers need to step up our game. As such, here are five ways data programs gain the trust of data scientists.

#1 Lower IT Burden And Offer Easy To Access Infrastructure.

It’s odd that stating to make data more trustworthy is more about setting up infrastructure parameters first. Lowest cost, easiest to access is the #1 we hear. The reason is that the maturity of IT environments is a burden, a pain, “the company computer guy”. We need to make their life easier.

Provisioning simplicity and federated integration is key. Can you get them on and off a free or credit-based infrastructure for some quick testing, validating. This is even more important than setting them up in production environments. They need that trust testing space first.
Can you have on that environment a workbench of tools, data slices, and services setup like a well-organized workbench of containers to rapidly test. If data computing and storage are on a shared platform, that makes it far more easily accessible to data scientists. Meaning, you find ways to lower the barrier for entry and on-going use. If they can access shared platforms that bring computing and storage closer, they are more likely to trust the results and the program that led to it. Seeing is believing. Make cloud access and setup easy. Also, make sign-ons more integrated. Assure Provisioning and web form request based.
Improved IT Operations and Reliable Service Level Management. Continue to be IT Service Management oriented moving from reactive to proactive service models and in many advanced Tech companies, from managed to culture space.

While maintaining information resource management compliance in planning, budgeting, architecting, maintaining privacy, security, and records, we need to make sure the mission it is serving comes first. Avoid Medieval IT Project and Program Management.

#2 Demonstrate And Automate Your Readiness Assessments

Assess your data for readiness. Then show it. Afterwards, have ways your organization can automate data readiness assessments. If you publish your data in open data venues, said venues should look to promote readiness assessment scoring or ratings. Should highlight whether done or not. Allow inspection of those results.

The data provider, when creating and aggregating, should reduce burden on labor to do – as they don’t have time to do any ways based on the quality of metadata out there. Can they automate assessments during or before transformations? For instance, when data is brought in via the pipeline, can we make that extract and load process do rule testing before even transforming. In many cases we can ask “do we need to transform bad data?”. So now we ask, can we also save a history of the assessment, imports, uploads, extracts?

Likely, with rules-based processing and simple regular expression engines, 80% of the evaluations pre-done to help them dig into data. Get the evaluations pre-done to weed out the “muck” and they can analyze the “best” that is left over.

#3 Make Data FAIR – (FAIR Data Principles)

The last five years, there has been a rallying point around the well-known FAIR Data Principles. These are a set of principles that makes data Findable, Accessible, Interoperable and Reusable. If organizations were to focus on aggregating rather than developing data, understanding and applying the FAIR Data Principles would be a good first step in this process.

Findable – The successful entities of the internet have made discovery the number one priority for engagement. SEO, metadata, schema.org, find in-the-wild – and find in your targeted registered catalogs with more provenance will get you very far.

<https://www.youtube.com/watch?v=FRP0MBNoieY>

Accessible – Make it in as few clicks as possible to get what they need. Furthermore, create a great ‘shopping’ experience to explore the metadata, preview products, lower level of access based on IT and business policy bounds. Ask yourself, can you make the data available in various forms such as real-time services or downloads allowed to move easily across federated cloud beyond the traditional server to client download.
Interoperable – Make it easy to ‘drag and drop’ services into the application or modeling environment or bring that data quickly to the storage closest to compute.
Reusable – Data needs to be replicable and combined/transferrable into different settings (new platforms, new applications, etc.). As such, data scientists need data that will remain: Relevant (even during future studies), Accessible (crucial when conducting new data-related experiments in the future).

#4 Have Robust, Responsive, Configurable and Fast Data in Movement and at Rest Solutions

We have wondered if pure scientists – such as earth scientists – will go for solutions for shared environments. What about the data scientists who support the domain scientists or end industry analysts in energy, transportation, defense, agriculture, finance, emergency/hazards, CIO Services, Land Management and more?

It’s less about a single platform environment and more about exposing data in loosely coupled ways such that it can be integrated in federated ways. Platforms many times will be organizational specific, and only collaborative communities with lots of “have nots” (e.g., don’t have a budget) will want shared platforms. After infrastructure, Data scientists need data that is stored and moved in a quick but secure fashion. If you deliver data and avoid redundancies, data scientists have more confidence in the program as a result. Data scientists are analytical people who follow specified methods/instructions. So if they want to trust data, you have to pipeline data efficiently, securely and easily.

The Great Thing About These Platforms

These platforms need to handle the 4 V’s. Big Data (Volume), Fast Moving (Velocity), Lots of flavors and features (Variety) and demonstrate stewardship and reliable handling (Veracity). As such, our platforms need to handle the data side of architecture and the knowledge side and hook into the Management Information System (MIS) information products and services. Our solutions need:

Smarter data pipelines to handle updating and validating real-time
Data Stream hubs for pipelining big, fast data
Data lakes to save data native in natural/raw format in readiness for future questions – and not pre-transform
Also, Data Analytics toolset to support the various modeling, linking, brokering, visualization
Graph Databases behind analytics to support generational leagues faster edge and node processing
Data Microservices to access operational data stores directly
Even integration into traditional warehouses

#5 Make It Easy To Join Birds Of Feather Communities And Plug Into Existing Ones

We need ways to bring ‘tribes’ together. Ways for people to learn about each other. Can you plug your data into communities? Where can you score yourself or your organization? Perhaps somewhere that can recognize you for scoring others? Or where you can participate in the virtual forums to develop new tribal or broader relationships. Is that places like ResearchGate?

Experience and Confidence Equal Trust

Building trust is about the experience. Also, building confidence in reliability in the data is multidimensional. Ease is huge – easy to find, access, integrate, validate. Proof is next so that it passes the various scrutiny. Finally, the understated aspect of bringing the science communities together by making data open to communities. Scientists are not just tribal, but they’re also analytical people. They want to see proof, more than anything before they can trust you. So show them that proof and show them your processes.