First, What Is Data Wrangling?
Data wrangling is the process of transforming and mapping data from one “raw” data format into another format. The intent being to make it more appropriate and valuable for a variety of downstream purposes, such as analytics. And it is starting to grow as a field into something big. In fact, you know that the “data wrangling” field is becoming more mainstream if NYTimes is covering it. For example, reference the article: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com.
This NYTimes article emphasizes a key point that data scientists wrestle with. Basically, we see a laborious the amount of time they spend to get the data from sensors, web services feeds, corporate databases, smartphone data observations, etc. Then they must prepare for consumption. In fact, the article states…“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
We are a huge proponent of good data being a barrier. So, it is nice to see software beginning to move into startup investments to help. This includes ClearStory, Trifacta, Paxata and other startups in the field. In the meantime, we need to make sure to always stay on top of the best approved approaches. With the hope being that the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, we find it is not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is: “You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”
But The Article Does Tend To Focus On “Big Data”, And Not “Big Data In An Open Data World”
Sure, we can scrub the data for machine to machine processing as the article emphasizes. However, it still does not address the push for data catalogs and metadata creation. And data curators and creators hate this push for data catalogs with a passion. It is one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentives other than the right thing to do to assure the data is tagged properly.
Let Us Take The Metadata Needed To Tag For Proper Use By Using A Real-world Example.
Let’s say a museum wants to get some biological data to support an exhibit. So, they have a student collect 100 scientific observations records on larvae using well calibrated sensors. Then we publish the data and an academic study on climate change uses the data. The data shows that there is a lower larvae count than previous historical records and demonstrates heat changes impacting such. Next, the data shows the sensors and tools used were scientifically accurate. Then, you source and use the data.
This is a great example of data misuse. True, the data was gathered using great sensors and the correct techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. However, its use is vitally important to denote.
A Different Approach
Here is another great example, taken from a slightly different use angle. A major government records organization wanted to take their inventory of records data. We meticulously groom the data and find it very accurate. We enter the data into a database and put it through a rigorous workflow to make sure its metadata and access to the actual data record is accurate in real-time.
The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in its proper use. But, the actual metadata to help discover something in the billions of records lacks suitable prepared with more natural ways people would explore the data to discover and lookup the records.
Historically, armies of librarians search for these records. But, they have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills (or NLP) programmed. Also, they lacked signals the librarian would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.
This is another great example of overlooked data preparation. “Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out”. Once again though, the search engine will just tell you the museum larvae record matches your science study need.