Ten Thoughts On Fixing Search On Our Open Data Catalogs

If you go to any data catalog, whether it is an academic publication catalogs, government agency opendata clearinghouses, federated catalogs, marketing lists, metadata search sites, or even popular sites, most actually have a lot of great data. However it is extremely hard to be sure you are obtaining data that can actually be informative without spending an increasing amount of time. So, we find ourselves on this great open data train. The phrase du jour is “too much information”, I say “too much data”. Because you see, data is different from information:

Data are values or sets of values representing a specific concept or concepts. Data become “information” when analyzed and possibly combined with other data in order to extract meaning, and to provide context. The meaning of data can vary according to its context (Source: Federal Enterprise Architecture Data Reference Model).

These sites are more like “eHoarders” of data in the hope of being an information destination. There is a lot of junk, and at same time, it started because some things had value. Consequently, we are losing perspective. Part of the issue is that the metadata is bad. And even when it is good, it sits beside data that is bad. The internet “click” folks rely on this and hijack data discovery on search engines for this exact purpose. They hijack typo’d website names like netfix.com or the like. Or, they hijack keywords. They even manipulate with SEO techniques to get their sites higher on search engines. In closed communities, it is not intentional manipulation. However, there is a lack of incentive to fix discovery.

Our Ideas

Which now begs the question, what are ways we can fix our open catalogs? The following are ten ideas to manage just that:

1. 1. Make searching more fun – Take facets, like in tools like CKAN and do more like kayak.com jquery filters, time-based, charts that pop-up with context of record counts. Kayak is a site scraper hitting APIs, then they represent it in simple ways. Then they get referral fees (in a nutshell). All because they made it easy, but more important, made travel searching sort of fun.
  2. Make Separate Search Components from your WebMIS – Stop fronting MIS systems with advanced form search engines. Keep in mind that if you are required or need for your 5% users, but instead make a SOLR or NoSQL or fast search that allows you to build in search signals as you get data on users. NodeJS feeds to the search database/index are fast, and millisecond updates fast enough for 99% of cases.
  3. Use Enterprise Search instead of rolling your own – Try to take search functions from standalone sites into your organization. Then, make enterprise service, where the standalone group can control or have input on their search signals
  4. Feed Schema.org for SEO with a virtual library card – Beyond traditional SEO tuning, broker relationships or invest in patterns for search engines like google. This way they can build good signals/rules on top of your data. Do this by putting schema.org tags on your data which can be extracted from your inputted data.
  5. Register to be harvested – Get registered on multiple harvesting sites, as maybe they can find ways to get your data more discoverable. When they find you, they see the details on your site or your site pushes that as well. Point being, it is still authoritative.
  6. CrowdSource and Gamify Search Signal Tuning – Can we get crowdsourcing going to dog food site usage and help build in better search engine rules signals. Whether it is a crowd from your own organization with corporate awards, or gamify, or even a crowd from true external stakeholders? Bonus: More Student Power – Can we get STEM or university systems involved as part of curriculum, projects, etc. as a lot of search signal improvement is really about person-power or machine2machine power.
  7. Make Events to force data wrangling – In Colorado, we (our team is doing the data side) just did gocode.colorado.gov, as a way to get application developers to build apps for OpenData Colorado. The reward was essentially a reverse contract. The contract made it legal to give monetary award, create various set-aside, and incent usage. That usage in turn got more opportunities for exposure as a time-based event. This, in turn, got data suppliers more engaged to put things up.
  8. Find ways to share signals? – This is more of a perspective theory, but could we use things like Watson, Google, etc. for a brain of search patterns. Then, tell it our audience differences by having it scan our data and some stereo equalizer tweaks. Finally, have it figure out what rule expression patterns to take from and share signal libraries?
  9. Learn more what our librarians do – Look, our librarian 20 years ago did more than put books back on shelves and give you mean looks on late returns. They also managed what went into the library. They helped on complicated inquiries to find information, even helping in curating across other libraries. Our network of organic capital or public sector driven meta-sites grew up out of computer science and IT, and not Library science. We need to get computer science/MIS/IT, and Library science to start dating again. Get to know each other again. Remember the good times when we used to be able to find things together, and help each other out.
  10. Can we score OpenData sites? – We have watch dogs helping make data open, which is great. This helps make sure organizations provide what they are supposed to provide and keep as openGov. However, this approach would be more on scoring the reality for discovering what you have provided. For example, we know the lawyer trick when they want to make problems with discovery. They provide their opposing side with so much information that they are inundated, there is not enough time to do discovery, and yadda, yadda, legal gamesmanship. The question is, can we find ways to score or watchdog sites on data discovery as either a part of transparency, or a different type of consumer report?

The Takeaway

Data can be a rather messy business to get into. It has got to be the right data. Even when it is correct it actually has to be informative. It must be well-organized. Opendata catalogs may not show up highly on search engines, even though they might actually have the information consumers are looking for. But, at Xentity, we want to try and take these ten thoughts on fixing these catalogs. And, we want to try and apply them.