Ten thoughts on fixing search on our opendata catalogs

Blog post
edited by
Wiki Admin

If you go to any data catalog: academic publication catalogs, Government agency opendata clearhinghouses, and federated catalogs, marketing lists, metadata search sites, or even popular sites, most have actually a lot of great data, but it is extremely hard to make sure you are pulling down data that can actually be informative – that truly information – without spending an increasing amount of time.

So, we are on this great opendata train. The phrase du jour is too much information, I say too much data.  Data is different from information:

Data are values or sets of values representing a specific concept or concepts. Data become “information” when analyzed and possibly combined with other data in order to extract meaning, and to provide context. The meaning of data can vary according to its context (Source: Federal Enterprise Architecture Data Reference Model).

These sites are more like an eHoarder of data in hopes of being an information destination. There is a lot of junk, while at same time, it all started because some things had value, and well, we are losing perspective.

I think part of the problem is that the metadata is bad. But even when good, it sits beside data that is bad. The internet “click” folks rely on this and hijack data discovery on search engines for this exact purpose. They hijack typo’d web site names like netfix.com or the like. They hijack keywords. The manipulate with SEO techniques to get their sites higher on search engines.
In closed communities, its not intentional manipulation, but there is a lack of incentive to fix discovery.
What are ways we can fix our open catalogs? Here are ten ideas:
  1. Make searching more fun – Take facets, like in tools like CKAN and do more like kayak.com jquery filters, time-based, charts that pop-up with context of record counts. Look kayak, is a site scraper hitting APIs, then simply re-presenting it in simple ways, then they get referral fees (in a nutshell). All because they made it easy, but more important, made travel searching sort of fun.
  2. Make Separate Search Components from your WebMIS– Stop fronting MIS systems with advanced form search engines. Keep that if you are required or need for your 5% users, but Instead make a SOLR or NoSQL or fast search that allows you build in search signals as you get data on users. NodeJS feeds to the search database/index are fast, and millisecond updates fast enough for 99% of cases
  3. Use Enterprise Search instead of rolling your own – Try to take search functions in standalone sites in your organization, and make enterprise service, where the standalone group can control or have input on their search signals
  4. Feed Schema.org for SEO with a virtual library card – Beyond traditional SEO tuning, broker relationships or invest in patterns for search engines like google so they can build good signals/rules on top of your data – do this by putting schema.org tags on your data which can be extracted from your inputted data. 
  5. Register to be harvested – Get registered on multiple harvesting sites, as maybe they can find ways to get your data more discoverable, and when they find you, they see the details on your site or your site pushes that as well, but point being, its still authoritative.
  6. CrowdSource and Gamify Search Signal Tuning – Can we get crowdsourcing going to dogfood site usage and help build in better search engine rules signals. Whether crowd from your own organization with corporate awards, or gamify. Or crowd from true external stakeholders? Bonus: More Student Power – Can we get STEM or university systems involved as part of curriculum, projects, etc. as a lot of search signal improvement is really about person-power or machine2machine power.
  7. Make Events to force data wrangling – In Colorado, we (our team is doing the data side) just did gocode.colorado.gov, as a way to get application developers to build apps off OpenData Colorado – the reward was essentially a reverse contract which made it legal to give monetary award, create various set-aside, and incent usage. That usage in turn got more opportunities for exposure as a time-based event, which got data suppliers more engaged to put things up.
  8. Find ways to share signals? This is more of a perspective theory, but could we feed things like Watson, Google, etc. for a brain of search patterns, tell it our audience differences by having it scan our data and some stereo equalizer tweaks, and figure out what rule expression patterns to take from and share signal libraries?
  9. Learn more what our librarians do. Look, our librarian 20 years ago did more than put books back on shelves and give you mean looks on late returns. They also managed what went into the library, they helped on complicated inquiries to find information, even helping in curating across other libraries. Our network of organic capital or public sector driven meta-sites grew up out of computer science and IT, and not Library science. We need to get computer science/MIS/IT, and Library science to start dating again. Get to know each other again. Remember the good times when we used to be able to find things, and help each other out. 
  10. Can we score OpenData sites? We have watchdogs on making data open, which is great. This helps make sure organizations provide what they are suppose to provide and keep as openGov. But, this approach would be more on scoring the reality for discovering what you have provided. For example, we know the lawyer trick when they want to make problems with discovery – they provide their opposing side with so much information that they are inundated and their is not enough time to do discovery, and yadda, yadda, legal gamesmanship. Can we find ways to score or watchdog sites on data discovery as either a part of transparency, or a different type of consumer report? 

.02

 

I always wanted to be a data janitor

Blog post
edited by
Wiki Admin

You know the `data wrangling` field is becoming more mainstream if NYTimes is covering it at this level: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com  

The article emphasizes the point that the amount of time to get the data – sensors, web services feeds, corporate databases, smartphone data observations – prepared and ready to be consumed is still a huge effort. 
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

It hits home the point we love to belabor that the software companies do not. Be it ERP, GIS, MIS, analytics – they show great demos, but with data already prepared or some of your data, but only the easy stuff. It works great in their environment and as long as you give it good data, it performs and rocks! But, the demos continually underscore what it takes to wrangle, cleanup, and get that data prepared.
As a huge proponent of good data being a or the major problem or barrier, its nice to see software beginning to move into startup investments to help – ClearStory, Trifacta, Paxata and other start-ups in the field. In mean-time, we need make sure to always stay on top of best, approved approaches, and the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, its not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is:
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”

But the article does tend to focus on bigdata, and not bigdata in an opendata world

Even if the data is scrubbed for machine to machine processing as the article emphasizes, it still doesn`t address the fact with the push for data catalogs, that data curators and creators HATE – with a passion -metadata creation. Its one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.

Lets take the metadata needed to tag for proper use. Lets take a real-world example recently discussed with a biological data analytics expert. 

A museum wants to get some biological data to support an exhibit. They have a student collect 100 scientific observations records on larvae using well calibrated sensors. The data gets published. An academic study on climate change sees the data. The data shows that there is lower larvae count than historical and demonstrates heat changes impacting such. The data shows the sensors and tools used were scientifically accurate. The data is sourced and used. This is a great example of mis-use of data. True, the data was gathered using great sensors, right techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. Its use is vitally important to denote.

 

Another great example, but slightly different use angle. A major Government Records organization wanted to take their inventory of records data, which is meticulously groomed and its data is very accurate and entered and goes through a rigorous workflow to make sure its metadata and access to the actual data record is prepared to be accurate real-time. The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in proper use. But, the actual metadata to help discover something in the billions of records was not suitably prepared with more natural ways people would explore the data to discover and lookup the records.

Historically, armies of librarians would be tasked with searching the records, but have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills programmed (or NLP) and signals the librarion would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.

This is another great example of overlooked data preparation. `Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out`. Once again though, the search engine will just tell you the museum larvae record matches your science study need.

 

GeoData2014 Post-Mortem

Blog post
added by
Wiki Admin

The workshop theme and community output notes may be of high interest. The focus was more on how Federal Geodata “Operations” / Assets can improve and help Geoscientists through improved interagency coordination. 
There are excellent breakout notes on roadblocks, geoscience perspective, concrete steps, etc. across the following topics on the following URL (Google Docs under “notes” links): http://tw.rpi.edu/web/Workshop/Community/GeoData2014/Agenda
Topics:
Day 1 Breakouts (Culture/Management)
Governmental open data 
Interagency coordination of geodata – progress and challenges
Feedback from the academic and commercial sectors
Collaborating environment and culture building
Day 2 Breakouts (Tech)
Data lifecycle
data citation and data integration frameworks – technical progress
Experience and best practices on data interoperability 
Connections among distributed data repositories – looking forward
Raw Notes:
The workshop has some fruits coming out of it. 
  • About 50 people. NOAA and USGS on Fed side primarily. 
  • Pushing forward on agenda to see if we have progressed on ideation pragmatism since Geodata2011.
  • Focus is on Cultural and Financial issues limiting inter-agency connection. 
  • Term agile government came up often… with some laughs, but some defenders (Relates to our smartleangovernment.com efforts with ACT-IAC)
  • Scientists hear Architecture as Big IT contracts and IT infrastructure, not process improvement, data integration, goal/mission alignment, etc., so there is clear vernacular issues.
  • FGDC and tons of other standards/organizing bodies seen as competing and confusing
  • data.gov and open data policy hot topic (Seen as good steps, low quality data) – “geoplatform” mentioned exactly “zero” times (doh!)
  • geo data lifecycle primarily on the latter end of cycle (citations, discovery, publication for reusability, credit) but not much on the project coordination, data acq coordination, no marketplace chatter, little on coordinating sensor investment
  • General questions on how scientists were interested on how intel groups can be reached 
  • Big push on ESIP
  • Concrete steps suggested were best practices to agencies and professors
  • Data Management is not taught, so what do we expect? You get what you pay for.
  • Finally, big push on how to tie grassroots efforts and top-down efforts together – grassroots agreed we need to showcase more, earlier, and get into the communities top-down folks are looking at.
  • Not high Federal representation there, and agreed with limited Government travel budgets, we need to bring these concepts to them and meet on their meetings, agendas, conferences, circuits, and push these concepts and needs.

Again, lots of great notes from breakouts on roadblocks, geoscience perspective, concrete steps, etc. across the following topics on the following URL (Google Docs under “notes” links): 

http://tw.rpi.edu/web/Workshop/Community/GeoData2014/Agenda

Questions we posed in general sessions:
  • Of Performance, Portfolio, Architecture, Evangelism, Policy, or other what is more important to the GeoScientist that needs to be addressed in order to improve inter-agency coordination
  • You noted you want to truly disrupt or re-invent the motivation and other aspects in the culture, what was discussed related to doing such – inter-agency wiki commons a la intelipedia? Gamification and how incents resource Management MMORPGs – i.e. transparent and fun way to incentivize data maturity? Crowdsourcing a la mechanical turk to help in cross-agency knowledge-sharing? Hackathon/TC Disrupt Competitions to help in showcase? Combindation – i.e. Gamify metadata lifecycle with crowd model?
  • After registering data, metadata, good citations, and doing all the data lifecycle management, and if we are to “assume internet”, who is responsible for the SEO Rank to help people find scientific data in the internet – who assures/enhances schema.org registrations? who aligns signals to help with keywords, and thousands of other potential signals? especially in response to events needing geoscience data? who helps push data.gov and domain catalogs to be harvested by others?

Support Intercambio and the Culture Jam June 21st 2014

Blog post
edited by
Matt Tricomi

This February 2014, Xentity lost an important member to its family – Consuelo Arias.

She is the Founder’s Grandmother. She came to the U.S. from Mexico in the World War II era. She believed education was fundamental and that it opened so many opportunities, and you had to earn it. She learned English early and perfectly (not even allowing the Three Stooges to be shown on TV in her house). She soon trained as a nurse and later as a nurse practitioner, spending her career working at Boston area Hospitals such as Mass General. She raised four children on her own, stressing the pursuit of learning and the power of education. She adored her grandchildren and always welcomed a spontaneous visit.  Although she loved to travel, she never drove a car.  She was devoted to professionally helping others and that frequently spilled over to family and friends. Her dry sense of humor, brilliance, and caring ways were her hallmarks. 

At Xentity, she supported our early work in the private sector, learning about our impact, and she became our biggest fan. She would check in, scan and send newspaper clippings. Admittedly, she did chuckle initially in 2003 when we got into Government transformation, but as she saw our impact, she cheered us on. When Matt asked in 2008 whether to pursue 8(a) or not, and sought guidance from ‘AyAy’, she said “If you can make more of an impact this way, then do it.” She supported the process of getting 8(a), eventually processed in 2010. 

During her latter years, instead of receiving gifts or flowers, she instead insisted on donating time, money, etc. to charities such as in cancer research, animal protection, and education.

You can see a brief video that was played at Culture Jam clicking here or on the photo.

In Memory of ‘AyAy’, Xentity is sponsoring a fundraising event supporting intercambio.

http://www.intercambioweb.org/

The Non-profit, founded in 2000, Intercambio is bridging our communities divide – helping everyone communicate, and starting with language. They educate parents, families, kids, workers. They promote and sponsor ESOL (English as a Second Language) courses to adult immigrants as well as workshops in life skills, culture training, and citizenship. To date Intercambio has helped 9,000 immigrants and trained 4,400 volunteers.

Culture-Jam-2014---Tickets---For-Intercambio-Website

They also reach out to the established community to help connect the immigrant community. Events like “Culture Jam” is just one of the those events. It celebrates the rich diversity in our community. It is a fun event with world renowned music with a mix that just cant be labeled – salsa, hiphop, cumbia, funk, merengue, etc. 

Celebrating the rich cultural diversity in our community, Left Hand Brewing Company is proud to host Culture Jam for its second year! Three time Grammy-Award winning Ozomatli will return to the Longmont stage, bringing Colorado together for a family-friendly evening of world class music, dancing and the arts to benefit Intercambio Uniting Communities and Longmont YMCA.

More info is at InterCambio web site.

Go Code Colorado recognized with a StateScoop 50 award

Blog post
edited by
Wiki Admin

Colorado has done something quite innovative and has been recognized with a StateScoop 50 award for State Innovation of the Year. 

And Xentity was glad to be a part of it.
Colorado’s Office of the Secretary of State invented and delivered Go Code Colorado, an app development competition for using public data to address challenges identified by the business community.  Three teams built winning apps and business plans and will have their technology licensed by the state for a year.  These teams have formed businesses; some are in discussions with investors and some have left their day jobs to focus on building their new venture.
Here’s the genius of Go Code Colorado: the State took funds from the filing fees that businesses pay to register their businesses and GAVE IT BACK to the business community through Go Code Colorado… AND incubated three new promising businesses in the process.
Plans are to build on this first year and run Go Code Colorado for a second year… on a larger scale and improving on many aspects based on lessons learned from year one.
Xentity had the pleasure of participating in Go Code Colorado as the State’s data management provider — sourcing, curating, improving and publishing 38 datasets from state agencies from Higher Ed to Labor and Employment.
You can see more about Go Code and Xentity’s support in our April Blog Entry: Go Code Colorado Open Data Effort is going into its final weeks . Also, check out the photo gallery of the final event.