International Data Week was held in Denver during the week of September 11-17, 2016. A triumvirate of three organizations including […]
View Original Article The National Science Foundation (NSF) has tasked the EarthCube Science Support Office (ESSO) with creating a detailed architecture […]
You know the `data wrangling` field is becoming more mainstream if NYTimes is covering it at this level: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com
The article emphasizes the point that the amount of time to get the data – sensors, web services feeds, corporate databases, smartphone data observations – prepared and ready to be consumed is still a huge effort.
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
It hits home the point we love to belabor that the software companies do not. Be it ERP, GIS, MIS, analytics – they show great demos, but with data already prepared or some of your data, but only the easy stuff. It works great in their environment and as long as you give it good data, it performs and rocks! But, the demos continually underscore what it takes to wrangle, cleanup, and get that data prepared.
As a huge proponent of good data being a or the major problem or barrier, its nice to see software beginning to move into startup investments to help – ClearStory, Trifacta, Paxata and other start-ups in the field. In mean-time, we need make sure to always stay on top of best, approved approaches, and the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, its not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is:
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”
But the article does tend to focus on bigdata, and not bigdata in an opendata world
Even if the data is scrubbed for machine to machine processing as the article emphasizes, it still doesn`t address the fact with the push for data catalogs, that data curators and creators HATE – with a passion -metadata creation. Its one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.
Lets take the metadata needed to tag for proper use. Lets take a real-world example recently discussed with a biological data analytics expert.
A museum wants to get some biological data to support an exhibit. They have a student collect 100 scientific observations records on larvae using well calibrated sensors. The data gets published. An academic study on climate change sees the data. The data shows that there is lower larvae count than historical and demonstrates heat changes impacting such. The data shows the sensors and tools used were scientifically accurate. The data is sourced and used. This is a great example of mis-use of data. True, the data was gathered using great sensors, right techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. Its use is vitally important to denote.
Another great example, but slightly different use angle. A major Government Records organization wanted to take their inventory of records data, which is meticulously groomed and its data is very accurate and entered and goes through a rigorous workflow to make sure its metadata and access to the actual data record is prepared to be accurate real-time. The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in proper use. But, the actual metadata to help discover something in the billions of records was not suitably prepared with more natural ways people would explore the data to discover and lookup the records.
Historically, armies of librarians would be tasked with searching the records, but have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills programmed (or NLP) and signals the librarion would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.
This is another great example of overlooked data preparation. `Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out`. Once again though, the search engine will just tell you the museum larvae record matches your science study need.
The workshop theme and community output notes may be of high interest. The focus was more on how Federal Geodata “Operations” / Assets can improve and help Geoscientists through improved interagency coordination.
There are excellent breakout notes on roadblocks, geoscience perspective, concrete steps, etc. across the following topics on the following URL (Google Docs under “notes” links): http://tw.rpi.edu/web/Workshop/Community/GeoData2014/Agenda
Day 1 Breakouts (Culture/Management)
Governmental open data
Interagency coordination of geodata – progress and challenges
Feedback from the academic and commercial sectors
Collaborating environment and culture building
Day 2 Breakouts (Tech)
data citation and data integration frameworks – technical progress
Experience and best practices on data interoperability
Connections among distributed data repositories – looking forward
The workshop has some fruits coming out of it.
- About 50 people. NOAA and USGS on Fed side primarily.
- Pushing forward on agenda to see if we have progressed on ideation pragmatism since Geodata2011.
- Focus is on Cultural and Financial issues limiting inter-agency connection.
- Term agile government came up often… with some laughs, but some defenders (Relates to our smartleangovernment.com efforts with ACT-IAC)
- Scientists hear Architecture as Big IT contracts and IT infrastructure, not process improvement, data integration, goal/mission alignment, etc., so there is clear vernacular issues.
- FGDC and tons of other standards/organizing bodies seen as competing and confusing
- data.gov and open data policy hot topic (Seen as good steps, low quality data) – “geoplatform” mentioned exactly “zero” times (doh!)
- geo data lifecycle primarily on the latter end of cycle (citations, discovery, publication for reusability, credit) but not much on the project coordination, data acq coordination, no marketplace chatter, little on coordinating sensor investment
- General questions on how scientists were interested on how intel groups can be reached
- Big push on ESIP
- Concrete steps suggested were best practices to agencies and professors
- Data Management is not taught, so what do we expect? You get what you pay for.
- Finally, big push on how to tie grassroots efforts and top-down efforts together – grassroots agreed we need to showcase more, earlier, and get into the communities top-down folks are looking at.
- Not high Federal representation there, and agreed with limited Government travel budgets, we need to bring these concepts to them and meet on their meetings, agendas, conferences, circuits, and push these concepts and needs.
Again, lots of great notes from breakouts on roadblocks, geoscience perspective, concrete steps, etc. across the following topics on the following URL (Google Docs under “notes” links):
Questions we posed in general sessions:
- Of Performance, Portfolio, Architecture, Evangelism, Policy, or other what is more important to the GeoScientist that needs to be addressed in order to improve inter-agency coordination
- You noted you want to truly disrupt or re-invent the motivation and other aspects in the culture, what was discussed related to doing such – inter-agency wiki commons a la intelipedia? Gamification and how incents resource Management MMORPGs – i.e. transparent and fun way to incentivize data maturity? Crowdsourcing a la mechanical turk to help in cross-agency knowledge-sharing? Hackathon/TC Disrupt Competitions to help in showcase? Combindation – i.e. Gamify metadata lifecycle with crowd model?
- After registering data, metadata, good citations, and doing all the data lifecycle management, and if we are to “assume internet”, who is responsible for the SEO Rank to help people find scientific data in the internet – who assures/enhances schema.org registrations? who aligns signals to help with keywords, and thousands of other potential signals? especially in response to events needing geoscience data? who helps push data.gov and domain catalogs to be harvested by others?
We tweeted a ‘harumph’ for Geekwire’s article on First, we kill all the ‘futurists’ . Then the Policy Horizons Canada group puts out a fantastic emerging tech futures map. Futurists be damned if they do or don’t. On the new study published by Envisioning and Policy Horizons Canada, a blog on Business Insider notes:
On Friday, the group published a giant graphic summarizing emerging technologies and showing when they could become scientifically viable, mainstream, and financially feasible. This follows more detailed graphics (pdf files) showing future innovations in agricultural and natural manufacturing, neurology and cognition, nanotechnology and materials,health, digital and communication technology, and energy.
These predictions may not be so far off.
Moore’s Law is accelerating digital processing well into the hockey stick shift. The web and flat world is kicking in Metcalfe’s law of network interconnection creating a tipping point for rapidly adapting to new tech globally.
So, some of these information related futures will only be delayed by political, geopolitical, or epidemic disruptions at this point. That all said, we’re keeping in mind the main premise of First, we kill all the ‘futurists’ – that it needs to be more than a smart guy presenting ideas ‘ripped from the pages of Google News Alerts. Some of the references did feel a bit like that, but it is a conversation starter to get the mind ‘context switched’ from the day-to-day rat race to what could be.
Call it our guilty pleasure or call it – regardless of pinpointing “futurist” timelines – a great way to help teach awareness of the pace of emerging and disruptive tech.
A few findings we enjoyed
First the graphics are presented in how we love to present tech – they change business models:
The near future of technology promises change at an ever-increasing pace while rapidly transforming business models, governments and institutions worldwide. In order to help us make sense of our uncertain future, Policy Horizons Canada engaged Michell Zappa of Envisioning Technology to explore key technologies that are likely to have a profound eﬀect on humanity on a global level and generational timeframe.
In the six areas, the focus is on economic impact, geopolitical (energy), and human-computer interaction and societal impacts.
Neuro and AI
Looking at the slices related to information progression, completeness, and how it gets more compelling and knowledgeable is of course our lens. As noted in Why we focus on spatial data science, we are very interested in the path from research to main stream of data to information to knowledge to wisdom. We also continuously discover it is true that our graphics are truly still at the whiteboard.
So, we of course are enthralled and drooling over the neurology and cognition aspects. It is great to see the agreement with our leanings and concepts that we must invoke sentiment (emotion tracking) prior to having prediction (crime prevention). Yet, it looks like the focus is on facial recognition aspects for emotions, but given there are so many other pantomimes of liars and other emotions and not too mention composite emotion detection in verbal, setting, background, environment, contemporary context, this does appear a bit aggressive. Not to mention there is now an abstraction of emotion through devices (txt, twitter, facebook, etc.) that create different faces of a person and emotion. This will take large data to help integrate the HUMINT concepts that the intelligence agencies have access to on the civilian level.
While they nailed some interesting concepts of physical, physiology, and neuro interactions – human-computer interaction, what felt missing in the Neuro area, was the concept that computers like Watson went from multiples of servers to one server and now is open-source in a matter of five years (From Jeopardy champ to cloud service). When will that capability make 2010 Siri look like in ten years – a novelty, a joke? Already Microsoft’s Contana in late 2014 has progressed from lookup and secretarial duties to executive administrative assistant. What will happen in another 10 years? What will happen when major brain mapping or DARPA’s brain mimic efforts produce its research in that time period? What will happen when the storage capacity of the web can handle brain storage?
Will we have personalized sensitive advisor, therapists? Have the slew of updated sci-fi movies on such cognitive devices painted that new picture (i.e. Transcendence (flop or not), Her). To believe we can get the emotion in ten years is very bold, but we will have the power of watson in our tablet or smartphone-like devices in 10 years. What that will bring for intelligence and information will be interesting.
The full publication is at http://www.horizons.gc.ca/eng/content/metascan-3-emerging-technologies-0 and a great way to learn more about the study quickly is at http://envisioning.io/horizons/.
There is so much more. This was a couple notes on 1/6 of the study. But to not spoil your exploration too much more, we’ll just summarize by saying, go in and explore and get your mind on the possible. As an IBM colleague of ours used to put in his email signature:
A pessimist sees the difficulty in every opportunity; an optimist sees the opportunity in every difficulty.