Ten thoughts on fixing search on our opendata catalogs

If you go to any data catalog: academic publication catalogs, Government agency opendata clearinghouses, and federated catalogs, marketing lists, metadata search sites, or even popular sites, most have actually a lot of great data, but it is extremely hard to make sure you are pulling down data that can actually be informative – that truly information – without spending an increasing amount of time.

So, we are on this great opendata train. The phrase du jour is too much information, I say too much data.  Data is different from information:

Data are values or sets of values representing a specific concept or concepts. Data become “information” when analyzed and possibly combined with other data in order to extract meaning, and to provide context. The meaning of data can vary according to its context (Source: Federal Enterprise Architecture Data Reference Model).

These sites are more like an eHoarder of data in hopes of being an information destination. There is a lot of junk, while at same time, it all started because some things had value, and well, we are losing perspective.

I think part of the problem is that the metadata is bad. But even when good, it sits beside data that is bad. The internet “click” folks rely on this and hijack data discovery on search engines for this exact purpose. They hijack typo’d web site names like netfix.com or the like. They hijack keywords. The manipulate with SEO techniques to get their sites higher on search engines.
In closed communities, its not intentional manipulation, but there is a lack of incentive to fix discovery.
What are ways we can fix our open catalogs? Here are ten ideas:
  1. Make searching more fun – Take facets, like in tools like CKAN and do more like kayak.com jquery filters, time-based, charts that pop-up with context of record counts. Look kayak, is a site scraper hitting APIs, then simply re-presenting it in simple ways, then they get referral fees (in a nutshell). All because they made it easy, but more important, made travel searching sort of fun.
  2. Make Separate Search Components from your WebMIS– Stop fronting MIS systems with advanced form search engines. Keep that if you are required or need for your 5% users, but Instead make a SOLR or NoSQL or fast search that allows you build in search signals as you get data on users. NodeJS feeds to the search database/index are fast, and millisecond updates fast enough for 99% of cases
  3. Use Enterprise Search instead of rolling your own – Try to take search functions in standalone sites in your organization, and make enterprise service, where the standalone group can control or have input on their search signals
  4. Feed Schema.org for SEO with a virtual library card – Beyond traditional SEO tuning, broker relationships or invest in patterns for search engines like google so they can build good signals/rules on top of your data – do this by putting schema.org tags on your data which can be extracted from your inputted data.
  5. Register to be harvested – Get registered on multiple harvesting sites, as maybe they can find ways to get your data more discoverable, and when they find you, they see the details on your site or your site pushes that as well, but point being, its still authoritative.
  6. CrowdSource and Gamify Search Signal Tuning – Can we get crowdsourcing going to dogfood site usage and help build in better search engine rules signals. Whether crowd from your own organization with corporate awards, or gamify. Or crowd from true external stakeholders? Bonus: More Student Power – Can we get STEM or university systems involved as part of curriculum, projects, etc. as a lot of search signal improvement is really about person-power or machine2machine power.
  7. Make Events to force data wrangling – In Colorado, we (our team is doing the data side) just did gocode.colorado.gov, as a way to get application developers to build apps off OpenData Colorado – the reward was essentially a reverse contract which made it legal to give monetary award, create various set-aside, and incent usage. That usage in turn got more opportunities for exposure as a time-based event, which got data suppliers more engaged to put things up.
  8. Find ways to share signals? This is more of a perspective theory, but could we feed things like Watson, Google, etc. for a brain of search patterns, tell it our audience differences by having it scan our data and some stereo equalizer tweaks, and figure out what rule expression patterns to take from and share signal libraries?
  9. Learn more what our librarians do. Look, our librarian 20 years ago did more than put books back on shelves and give you mean looks on late returns. They also managed what went into the library, they helped on complicated inquiries to find information, even helping in curating across other libraries. Our network of organic capital or public sector driven meta-sites grew up out of computer science and IT, and not Library science. We need to get computer science/MIS/IT, and Library science to start dating again. Get to know each other again. Remember the good times when we used to be able to find things, and help each other out.
  10. Can we score OpenData sites? We have watchdogs on making data open, which is great. This helps make sure organizations provide what they are suppose to provide and keep as openGov. But, this approach would be more on scoring the reality for discovering what you have provided. For example, we know the lawyer trick when they want to make problems with discovery – they provide their opposing side with so much information that they are inundated and their is not enough time to do discovery, and yadda, yadda, legal gamesmanship. Can we find ways to score or watchdog sites on data discovery as either a part of transparency, or a different type of consumer report?

.02