To Be a Data Scientist You Need to Be a Data Engineer First

In the data industry, there are many positions that require an expertise in data methodology in order to be successful. Two positions in particular deserve our absolute attention at the moment: data scientists and data engineers. As a data consultancy, Xentity recognizes the need to differentiate between these two positions. A fantastic blog from Cognitive Class lays out the basic definition for data scientists and engineers, and in the article below we take a look at the basic assertions from this blog article.

For definition, data engineers are the data professionals who prepare big data for analysis. They are the software engineers who design, build and integrate data from various resources. They work to optimize the ‘data ecosystem’ of their company for proper, more efficient use. By comparison, data scientists take the volumes of raw data integrated by data engineers and transform them into useful insights through analysis and modeling. They are problem solvers and presenters of information who need experience working with different kinds of datasets. These are two very different positions that have a sort of synergy, a relationship which they both benefit from.

So What?

At Xentity, we try to keep a lot of other factors in mind when trying to nurture successful data scientists and data engineers. To us, data engineers need to be capable of cleaning and integrating what is known as ‘dirty data’ which is usually inaccurate, incomplete or inconsistent. By comparison, data scientists must be capable of developing advanced algorithms and models to support analysis. Both positions however need a keen eye for anomalies and a certain mastery at the technical level.

One question that comes up from reading this blog is, in order to be an effective data scientist do you need to be a data engineer first? Our conclusion is, when you become a data scientist you need to always think like a data engineer first and foremost.

What Their Duties Bring To Their Organizations

To understand why data scientists need to be data engineers first, we need to identify what each role brings to the table. Data engineers, for example, recommend and design data systems. This can involve a lot of data gathering, data transformation, and business rules about data evaluation or data format. And, as mentioned before, good data engineers have a keen eye and an analytical mind to recommend and design those systems. They also need to be constantly thinking about the purpose of the data, such as how users need to use it.

For instance, in analyzing the download statistics for the US National Map Service (in particular, analyzing hydrography), the value of a data engineer is in designing the data downloaded to answer users questions (hydrography, topography, etc.) about the data itself. This involves cleaning up and analyzing that data to better understand how to provide insight into the services it provides, thereby improving business processes and creating efficiencies. In other words, data engineers strive to prepare and understand the data itself so that it can be used in the most effective and efficient manner. This definition applies to just about any business, as organizations are always looking to better serve customers with their services.

By comparison, Data scientists are trying to answer business questions created by them or by the business. They need to evaluate the quality of the data, it’s completeness, and the 4 V’s of data including the volume, velocity, variety and veracity. Data scientists strive to evaluate and use data in order to provide the greatest value to the organization. Data scientists provide value through holding something of an expertise in both evaluating the data itself and answering any business questions that may arise as a result.

When Garbage Becomes a Problem

A complication arises when a data scientist attempts to answer questions of the data if the data you are analyzing is–for lack of a better term–total garbage. A term at Xentity we love to use is ‘garbage in, garbage out’ and if you work with us or for us you’re going to hear it a lot. It is a pretty common phrase stating If your data source is ‘garbage’ then the result after data processing will be ‘garbage’ as well. For example, consider what would happen if you suddenly received bad data that was not properly prepped for any kind of analysis. If that data has not been properly integrated, cleaned, designed, etc., the only kind of analysis you as a data scientist can possibly put out is…well, garbage. If stick figures are your pleasure, this awesome comic could probably explain it even better.

(Credit to https://xkcd.com/)

This is also why Xentity strongly believes in ‘checkpointing’ often. Basically, the more we checkpoint the more likely we catch the garbage early prior to data analysis. Without the expertise and know-how of a data engineer, it becomes extremely hard to avoid poor quality data.

For Example…

Imagine if a restaurant looked at the data on how people were responding to the ingredients they were using. What if they not only kept the ingredients people did not like, but also removed ingredients they did? Or, imagine if a system trying to gather data failed to ask proper questions about statistics from that data? If the US National Map Service was asking the wrong kinds of questions about download and usage statistics, data scientists would be doomed to give out the wrong conclusions. It is the same with any kind of organization. If the data is analyzed, gathered or prepared incorrectly, any conclusion that results is going to be tainted, incorrect. Garbage in…you get the idea.

Think Like a Data Engineer, Even If You Are a Data Scientist

What exactly do we mean when we say: in order to be a data scientist, you need to be a data engineer first? It is important to recognize the planning and use of data. Also, to be able to clean and organize this data prior to data analysis. In other words, you have to understand the data before you attempt to use it. Data engineers provide value to their organizations as they work with their user community to better understanding data uses. If users can’t answer any questions about the purpose behind capturing data, the analysis will not be as successful. Again, it’s garbage in and garbage out.

As an aside, you can also be a data custodian whose purpose is cleaning data. What that entails is reformatting and standardizing data so that is useful for data science. Otherwise, it’s just unstructured garbage. There are a lot of iterations involved in cleaning data, and this is where custodians come into play.

Furthermore…

As a reminder, it’s not necessarily that you need to be a data engineer before you are a data scientist. We are not trying to be super literal here. You just need to think from the perspective of a data engineer in order to be an effective data scientist. Take the State of Colorado Business Intelligence Center (BIC) for example. The entire goal of the BIC is publishing open data, captured from various governmental and commercial entities. Also, making it freely available for developers, businesses, and the community to utilize in order to make better decisions. Having a data engineer responsible for processing, validating and publishing this data is incredibly valuable in accomplishing the agencies’ goals.

Note that you can be an effective data engineer and data scientist at the same time. However, they require two very different skill sets. Also, you absolutely need the data engineer’s skills to be an effective scientist. Keep these concepts in mind for your future endeavors:

Always dive deep and look for errors in the data itself.
Strive to not manipulate data to provide the conclusion you want. Instead, represent it as it is, even if it’s not necessarily what you want.
Learn everything about the data while cleaning and organizing it, because learning the issues it has and the tools to fix it is crucial. In other words, keep an open mind and a keen eye.

Turning Thoughts Into Action

It ultimately comes back to the fact that being a data scientist is just like any other career path. And on that path is learning to think like a data engineer at all times. And you have to do such even after you have become a data scientist. As a data scientist, you simply have to avoid garbage in, garbage out. And to do that, you need to think like a data engineer. Analyze the data with a keen eye and an open mind. Clean out the “muck” that would invalidate your data, not necessarily the conclusion you want to reach. There is no point in reaching any kind of conclusion if your data has been “tainted”, so to speak. Seek out errors like a data engineer, and then present it like a data scientist. That is how you play the role properly.

Regarding all of this, Xentity is all about the idea of Less Sooner. And being data engineers and data scientists aligns with this concept such that we provide iterations with meaningful, valuable data change. This brings us back to the engineer/custodian concept discussed earlier. Checkpointing often is also very crucial in this, as I have mentioned before. Checkpointing often helps avoid the ‘garbage in, garbage out’ issue that troubles both data engineers and scientists.

As Such…

Going off the idea of checkpointing often, Xentity strives to help our clients by breaking down issues into manageable chunks. In other words, we strive to provide timely iterations in our services. As a data engineer, when cleaning out data,you need to make it manageable for data scientists. Most importantly, in data science, it’s about providing answers to goal-oriented questions. In other words, it is about providing valued output to the questions that arise from data.

We at Xentity are extremely passionate about our four major values which are shown here as our “delivery purpose”. So you can imagine we are passionate about a portion of the industry that falls in line with those values. And we are just as passionate about data. That means you can bet we are one-hundred percent convinced of the value that data engineers and their skills and knowledge provide to the data industry. And we are more than willing to make that provision ourselves.