Integrating Data Streams To Reduce Redundancy At Work
In our previous blog, we discussed data pipelines and how they improve efficiency in the workforce. In summary, the concept of data pipelines is to “leverage sets of tools and processes that extract data from multiple sources and insert it into a data warehouse or some other kind of tool or application”. The term ‘Data Pipeline’ has historically been overshadowed by the most popular data pipeline pattern ETL – Extract, Transform, Load. Yet, now with compute, network, and storage so advanced the other less popular patterns of the four key actions (Collect, Govern, Transform, and Share) are making a run again. These new patterns create new ways to establish a truly robust, bullet-proof data pipeline that allow organizations to move data safely, efficiently, and thoroughly. There are also many more players in the data integration and pipeline world in many different flavors.
Find Redundant Activities And Replace Them
Given the varying patterns, we thought we’d share what we beat on our drum the most about when it comes to data pipeline solutions: Find redundant activities and replace them. What we mean by this is we find those redundant work activities, which are basically activities with little or no impact on intended end results. We define redundancy itself applies as something unnecessary or expendable. A primary redundancy is “Workforce Task Redundancy” where we will conduct data collection, governance, transformation, and data sharing in separate silos, thus creating redundant data calls (collect), data streams, data stores, and points of delivery.
How Redundancy Issues Typically Look
Typically, organizations dealing with redundancy face a few common pain points. For starters, data is not in a form that is easily consumable by an analyst. Not only that, but that data may be in various business applications and not the same format. Oftentimes, various IoT devices send or hold data. That means there is no actual central source of information, making it very difficult to trace. Finally, raw data access is often restricted by Firewall or security restrictions that have been implemented.
From there, organizations often run into various “redundancy statistics”, which are the direct results of these pain points. There are two typical ones to keep in mind. First, there is the quality that is lost through redundancy. Also, workforce costs increase such as when workforce costs suffer as a result of redundancy. A solution would include dividing the effort of pulling the data and transforming it so that analysts can focus on using the quality data to perform their analysis.
These implementations lead to five typical tactical improvements that arise as a result of integrating data pipelines:
- Reliability – The process performs accurately with each execution.
- Flexibility – Pipelining allows the opportunity to create new derivatives or data products.
- Scalability – The ability to handle changing demands of increasing or decreasing scale through the use of automated computing.
- Economy – Going to a fully automated process helps cut back on costs and errors.
- Transparency – Pipelining allows for a more traceable method of moving datasets.
Case Studies
With that in mind, here are a couple examples of redundancy we’ve analyzed, and have identified opportunities for improvement:
Science Data Agencies
In science data agencies, Xentity has often seen very common pain points. Datasets are often not fully indexed and easily accessible. Also, the sources that data formats are from are often inconsistent. Thus when agencies reached the transformation step in the process it is unfortunately not traceable and transparent, which can make repeating the process difficult at times. And all of this ties back into the fact that many of these agencies run into the problem of manually compiling a final dataset. It’s inefficient, untraceable, and there is always a question of consistency amongst the data that is being asked.
One science data agency in particular ran into these pain points. The results were losses in quality and time, along with unnecessary workforce costs. With staff running a manual process, the result was an intensive transformation of data that was inefficient. This lead to a loss in quality because the process was not consistently repeatable. Also, only one staff member performed the action. The focus, at the time, was on performing the actions and not expanding other ways that analysts could use the data product, given the limited staff resources. So, not only do we see a loss of quality and unnecessary costs, but also a loss in time due to not using that precious commodity to try and expand methods of using the available data.
As Such…
Given these issues, we’ve found the following steps critical in implementing solutions. Keep in mind, many of these were implemented through the use of fantastic architecture technologies such as AWS Lambda, SNS, SQS, AWS Gateway, GDAL, PostgreSQL & PostGIS, NodeJS and using the Amazon AWS Infrastructure as an environment for our efforts:
- Going from a manual process to a fully automated process provides a more efficient method of pipelining data. This gives science agencies like the one in this case a more economical approach to drive down costs.
- Grant agencies the reliability of an accurately performed process by integrating data pipelines.
- Flexibility to create new derivatives or data products, different types of analysts.
- Introducing the ability to scale without introducing more staff investment. It is no longer that one guy serving as the proverbial bottleneck. Rather, it is just a computer to do the job.
- Introducing transparency & traceability for every transformation in each step, allowing the organization to validate each step to ensure repeatability and that the actual process is running smoothly.
CIO Services Redundancies
In the CIO Office, there are various information management functions where the governance policy created each function separately. For example, Record Management has some ties to it’s ‘cousin’ in Open Records Requests (e.g. in Federal, FOIA). Privacy is an attribute extension of records, yet also governs the system access itself. Security covers all levels of architecture in the functional sense, yet not in the content sense. Capital Planning is somewhat tied to budget, yet not necessarily on the same calendars, so information sharing tends to be manual. Enterprise Architecture tends to be triggered on ad-hoc analysis, so unless there is an on-going integration of data streams, this tends to be as needed, then becomes shelfware. Then there are other side policies in accessibility, audit reporting, compliance reporting, and legal cases that also collect similar data streams.
Obviously the overlapping of collecting system, application, hosting, dataset, functionality, and technology information with their attribution creates major workforce redundancies, and loses major amounts of time when it comes to agility of response. Finally and likely most important, due to the time spent on manual data pipelining lends less time spent on analyzing and creating a better solution, product or service which is why the CIO exists.
Given, we’ve found the following steps critical in implementing solutions:
- Invest in maintaining Enterprise Metadata data pipelines through a single point of discovery
- Leverage existing data streams before starting to create new data calls
- Write the ‘toolset’ into policy to make allowing collection of metadata required
- Add data streams based on the user stories priority balanced with the complexity (including data readiness)
- Finally, in investing in maintenance, invest in BI analyst, data janitors, and a lead that is a ‘listening’ analyst on the governance teams. This increases agility in creating new capabilities for the toolset.
What’s The Big Deal Here?
Quite a bit, actually. But the most important thing is this big conclusion. Integrating data pipelines does indeed work in improving redundancies in work activities. It brings back the quality value lost from redundancies. It also provides a reliable, economical, flexible, traceable and transparent solution. One that allows organizations and agencies like the ones mentioned in the two case studies above us to implement and integrate a transformative solution into the data portion of their industry. At Xentity, we are all about transformation for the sake of efficient solutions, so we absolutely love stuff like this. If you want to know more, check out our website. Our services section can provide a lot more insight on that matter.