When moving to the cloud, consider changing your discovery approach

Blog post
edited by
Wiki Admin

As we do not want to pave that cowpath (What cow paths, space shuttles, and chariots have in common or What are some patterns or anti-patterns where architecture and governance can help cover this point), we want to not only save the monies in moving to IT Commodity utility model, but also consider, do we just take the MIS architecture and pattern and put that in the cloud, or do we look at new patterns, such as new search index, engine, or NoSQL models that allow rapid, near real-time smart discovery on the read part of the solution.

This will increase your data and digital assets relevancy as the market demands to make things easier, simpler, and instant gratification. 

 

Traditional: Keyword Search Matching 

For many new large, cloud-hosted, database transaction management solution, organization needs a fast document, record, object, or content search by facets, keywords across both metadata and full search with a quick, nice experience that can handle millions of documents, authorities, and lookup lists along with thousands of monthly transactions.

Currently, the architecture clients invested in is a model that was developed pre “big data”. These models emulate MIS form based searched by trained users with a supporting search engine that does a full scan of any keyword or some category or facet filtering to return ALL matching records weighted by keyword closest match. This can handle full text search as well as facet search, but does tend to be higher taxing on computing power to return not only accurate results, but results are will not be context aware of popularity, typos, synonyms, etc..

New Searching architecture is not just Faster, but is Smarter

NoSQL models are put into query box as well, but NoSQL engines can have multiple index-like “signals” that the query engine can look up to better help interpret should be able to figure out the key signals to infer what the user may be looking for. The search engine solution would handle and have an increase investment in interpretative signals (i.e. fuzzy logic support for popular search weighting, typos, thesauri integration, synonym, typo recognition, community based, event/trending, business rules, profile favorite patterns, etc.). This could include as well researching improving description framework improvements such as improved overlapping categorical/alignment, schema.org and move towards RDFa.

When these solutions do not apply

As Apache on Hadoop states, Hadoop (or it does imply NoSQL more in general) is NOT:

  1. Apache Hadoop is not a substitute for a database – you need something on top for high-performance updates
  2. MapReduce is not always the best algorithm – if an you need MR jobs to know about the last, then you lose the parallelization benefits.
  3. Hadoop and MapReduce is not for beginner Java, Linux, or error debugging – Its open source, and emerging, so many of these techs built on top bring that and are worth the extra layering.

Initial newer search engine better at metadata search, but not full text and full results

Google solutions or Open Source solutions like MongoDB are fast at addressing these “signals”, but are limited at full text document searches of extremely long documents which is sometimes required by legal, policies, or other regulations.  For instance, when doing a CraigsList or Groupon, a user is searching against metadata fields, i.e. “Bike Vintage” between 1950 and 1960 and most of the text, and what is returned is milli-second results of the top x hundred results, but the results are not hitting the raw data and nor every record, but instead is hitting these index-like constructs with the record ID. For those results succeeding, the user can then call a URL to then go into transaction mode back in Oracle. If an edit is made, the trigger to update the NoSQL index can be updated immediately as well as full-text updates can be updated in Oracle reasonably fast, but definitely not as fast as the NoSQL index.

There are other solutions in the newer search engine technologies that can address all requirements. For example, there may be 10,000 results a user wants to pull all those results into their software, then move into transaction/edit mode, and commit those edits in Oracle, and the NoSQL index can be update immediately, and be available for near immediate use for full-text in Oracle or in some search engine solutions in full-text.

Exploring NoSQL and new signals will yield faster and smarter results

Point being, the improved discovery not only be faster from a query return point of view, but also by returning smarter results. This will also make the discovery process itself faster, to move the user to faster actions on their intended transactions as the search results will be more context aware of language issues, popularity, and user personalized needs.  This can be achieved by technologies such as ElasticSearch, possibly Spinx, or possibly a combination of MongoDB for fast search, and existing Oracle for full-text search.

Some favorite TED talks

Blog post
edited by
Wiki Admin
– “Migrated to Confluence 5.3”

A business partner last night said “I don’t wake up and turn on my phone, or watch TV, or check email right away. I try to keep it simple… ” he said as several of us waxed rhapsodic of the pre-pocket tech and internet days and how teenagers patterns know no other world. But yet he continued, “OK, well that’s not true, I do get my morning dose of TED for inspiration”. 

Its just one more to add to the many morning intake mediums. People seeking personal philosophical guidance in the morning through religion, scripture, reading a story, meditation, prayer, mind-body engagement or quiet time. People seeking temporal context in morning news TV, newspaper, internet and feeds, websurfing (can I still use that term?), tablet time. People seeking social engagement with morning coffeee at the diner with the guys/gals, spouse or/and kid quality time, the facebook rise-and-shiner, or other social media digests. People seeking inspiration in either of the above

Personally, I have yet to ever find my morning ritual and I bounce in different mediums. Sometimes, its playing trains or toys or some activity with the family when we get a good rhythm going that morning, sometimes it is tablet browsing when feeling curious on various news or video feeds, sometimes it is mindless TV news digestion, and probably more rare than I should, sometimes it is outside quiet time in a run, bike, walk, or reading or whanot. Other times, the day gets going to fast, and there is no interstitial time, and an east coast call to this mountain time zone starts right up.

Though, I haven’t found my rhythm, but over the last partial decade here are a few of the greatest TED hits I’ve tweeted out as greatest hits and found inspirational :

Hans Rosling: Stats that reshape your world-view (Jun 2007)

Geoffrey West: The surprising math of cities and corporations (July 2011)

TEDxUofM – Jameson Toole – Big Data for Tomorrow (May 2011)

Eli Pariser: Beware online “filter bubbles” (Mar 2011)

Sugata Mitra: Build a School in the Cloud (Feb 2013)

Deb Roy: The birth of a word (Mar 2011)

-mt

The Four V's of BigData – Variety Veracity Velocity Volume

Blog post
edited by
Matt Tricomi

If we are simply talking about lots of retail data, lots of sales data, lots of management data, lots of metadata, we aren’t talking BigData. Though for some reason, those data are going through and into the new architectures. Yeah, sure the retail, defense, and intelligence worlds have been sifting through huge data stores for years. 

But the marketing coined term BigData is not just referring to the Volume of the data. There are 4 V’s of big data . We have been enjoying using as learned through our information exchanges and partnering with IBM.

The first two v’s focus on mission side: Variety and Veracity

Variety (how many categories of data does it cover) includes as well technology (sources, formats (

(e.g. numeric, text, objects, geocoded, vector, raster, structured, unstructured – email, video, audio, etc.), methods) and legal  (complexity, privacy, jurisdiction). Essentially working with various types of data including various dimensions (temporal, geospatial, sentiment, metadata, logs, etc.) and 

Veracity (understanding authority of data) includes known data quality, type of data, data management maturity so that you can understand how much you can trust the data is right and accurate

The other focus on the speed and amount: Velocity and Volume

Volume (how much data). – Capturing, Processing, Reporting and managing a large volume of data

Velocity (how often it changes or real-time) – Analyzing and exploiting lots and lots of new data in real-time

Changes to the V’s over the last 15 years

 

 

Veracity is the newer one. The concept of data quality has always been the orphaned step-child. It is the I in IT. Its the part of the iceberg under the water. All IT vendors want to sell speed, and handles lots of data, and some commodities of variety support, but once sold, you are on your own (or $300/hour for mission customization). But, we are happy to see IBM got there and added it.

As of late, Value has been introduced as well. Paraphrasing the spanish article, even if you can produce information, is there is no real action that can be done with it, its not Valuable to the organization.

Then again, some have accused IBM of stealing the V’s from 2001 Gartner V’s development . This V-gate controversy :

Nonetheless, if anything it proves that its a concept that is worth fighting over meaning and digging into it, you see it has its merits.