To do BigData, address Data Quality – People and Processes – Tech Access to information

Blog post
added by
Wiki Admin

As a follow on to the “cliffhanger” on BigData is a big deal because it can help answer questions fast, there are three top limitations right now: Data Quality, People and Process, Tech Access to Information. 

Lets jump right in.

Number One and by far the biggest – Data Quality

Climate Change isn’t a myth, but it is the first science to ever be presented on a data premise. And in doing so, they prematurely presented models that didn’t take into account the driving variables. Their models have changed over and over again. Their resolution of source data has increased. Their simulations on top of simulations have proven countless theories of various models that can only be demonstrated simply by Hollywood blockblusters. Point being, we are dealing with inferior data for a world scale problem, and we jump into the political, emotional driven world with a data report? We will be the frog in slowly warming water, and we will hit that boiling point late. All because we started with a data justification approach using low quality data. Are they right the world is warming? Yes. Do they have enough data to proven the right mitigation, mediation, or policy adjustments? No, and not until either we increase the data quality or take a non-data tact.

People and processes is a generation away.

Our processes in IT have been driven by Defense and GSA business models from the fifties. Put anyone managing 0s and 1s technology in the back. They are nerds, look goofy, can’t talk, don’t understand what we actually do here and by the way, they smell funny. That has been the approach to IT since the 50s – nothing has changed with the exception that their are a few bakers dozen of the hoodie wearing, mountain dew drinking, late night owls who happen to be loaded now, and their is a pseudo culture of geek chic. We have not matured our people talent investment to balance maturity of service, data, governance, design, and product lifecycle to embrace that engine culture as core to the business. This means, more effective information sharing processes to get the right information to the right people. This also means, investing in the right skills – not just feeding doritos and free soda to hackers – to manage the information sharing and data lifecycle. I am not as worried about this one. As the baby boomer generation retires, it will leave a massive vacuum as Generation X is too small and we’ll have to groom Generation Y fast. That said, we will mess up a lot missing a lot of brain drain, but market will demand relevancy which will, albeit slowly, create this workforce model in 10-15 years.

Access to Environments 

If you asked this pre-hosting environments or pre-cloud, this would have been limited to massive corporations, defense, intel, and some of the academia co-investing with those groups. If you can manage the strain of shifting to a big data infrastructure, this barrier should be the least of your problems. If you can allow your staff to get the data they need at the speed they need so they can process in parallelization without long wait times, you are looking good. Get a credit card, or if Government, buy off a Cloud GWAC, and get your governance and policies moving, as they are likely behind and not ready. Likely they will prolong the silo’d information phenomenon. Focus on the I in IT, and let the CTO respond to the technology stack. 

Focus on data quality, have a workforce investment plan, and continue working your information access policies

The tipping point that move you into Big Data is where these combined require you to deal with the complicated enormity at speeds answering questions not just for MIS and reports, but to help answer questions. If you can focus on those things in that order (likely solving in reverse), you will be able to implement parallelization of data discovery.

This will shorten the distance from A to B and create new economies, new networks, and enable your customer or user base to do things they could not before. It is the train, plane, and automobile factor all over again.

And to throw the shameless plug in, this is what we do. This is Why we focus on spatial data science and Why is change so fundamental.

BigData is a big deal because it can help answer questions fast

Blog post
added by
Wiki Admin

BigData is not just size and speed of complex data – it is moving us from information to knowledge

 

As our Why we focus on spatial data science article discusses, the progress of knowledge fields – history to math to engineering to science and to philosophy – or the individual pursuit of knowledge is based on moving from experiments to hypotheses to computation to now The Fourth Paradigm: Data-Intensive Scientific Discovery. This progression has happened over the course of human history and is now abstracting itself on the internet.

The early 90s web was about content, history, and experiments. The late 90s web was about transactions, security and eCommerce. The 2000s web was about engineering entities breaking silos – within companies, organizations, sectors, and communities. The 2010s web has been about increasing collaborating of communication, work production, and entering into knowledge collaboration. The internet progression is just emulating human history capability development.

When you are ready to move into BigData, it means you are wanting to Answer new questions.

That said, The BigData phenomenom is not about the input of all the raw data and explosion that the Internet of Things is being touted as. The resource sells, and the end product is the consumed byproduct. So lets focus on that by-product – its knowledge. Its not the speed of massive amounts of new complex and various quality data as our discussion on IBM’s 4 V’s focus on.

Its about what we can do with the technology on the cheap that before required supercomputer clusters that only the big boys had. Now with cloud, internet, and enough standards, if we have good and improving data, we ALL now have the environment to be answering complicated questions while sifting through the noise. Its about the enablement of the initial phase of knowledge discovery that everyone is complaining about the “web” right now “too much information” or “drowning in data”.

The article on Throwing a Lifeline to Scientists Drowning in Data discusses how we need to be able to “sift through the noise” and make search faster. That is the roadblock, the tall pole in the tent, the showstopper.

Parallelizing the search is the killer app – this is the Big Deal, we should call it BigSearch

If you have to search billions of records and map them to another billion records, doing that in sequence is the problem. You need to shorten the time it takes to sift through the noise. That is why Google became an amazing success out of nowhere. They did and are currently doing it better than anyone else – sifting through the noise.

The United States amazing growth is because of two things – we have resources and we found out how to get to them faster. Each growth phase of the United states was based on that fact alone, and a bit of stopping the barbarians at the gates our ourselves from implosion. You could say civilization. Some softball examples out of hundreds

  • Expanding West dramatically exploded after trains, which allowed for regional foraging and mining
  • Manufacturing dramatically exploded production output, which allowed for city growth
  • Engines shortened time between towns and cities, which allowed for job explosion
  • Highway systems shortened time between large cities, which allowed for regional economies
  • Airplanes shorten time between the legacy railroad time zones, which allowed for national economies
  • Internet shortened access to national resources internationally, which allowed for international economies
  • Computing shortened processing time of information, which allows for micro-targetted economies worldwide

Each “age” resulted in shortening the distance from A to B.  But, Google is sifting through data. Scientists are trying to sift as well through defined data sensors, link them together and ask very targetted simulated or modeled questions. We need to address the barriers limiting entities success to do this. 

 

When you are ready to move into BigData, it means you are wanting to Answer new questions.

Blog post
added by
Wiki Admin

The article on Throwing a Lifeline to Scientists Drowning in Data discusses how we need to be able to “sift through the noise” to be able this faster and faster deluge of sensors and feeds. Its not amount information management models of fast, large retail or defense data. It is about finding the signals you need to know to take advantage.

In controlled environments, like retail and business, this has been done for years on end to guide business analytics and targeted micro-actions. 

For instance, gambling industries have been doing this for 15 plus years taking in all the transnational data of each pull of a slot machine from all their machines from all their hotels linked with your loyalty card you entered and time of year and when you go, your profile, your trip patterns, then laws allowing, they adjust the looseness of the slots, the coupons provided, the trip rewards all to make sure they do what they are supposed to do in capitalism – be profitable. 

Even in uncontrolled environments such as intelligence, defense or internet search, the model is build analytics on analytics to improve the data quality and lifecycle so that the end analytics can improve. Its sound equalizers on top of the sound board.

Do go for the neat tech for your MIS. Go because your users are asking more of you in the data information knowledge chain. 

Continue on to read more on our follow-on article: BigData is a big deal because it can help answer questions fast

Comparing NoSQL Search Technology Features

Blog post
edited by
Wiki Admin

Features Comparison Matrix

There are many new Document Oriented Databases Out there. Here is a quick high-level comparison of features of five of these newer technologies that were compared when creating the prototype concepts discussed int he blog post – When moving to the cloud, consider changing your discovery approach

 

Oracle on AWS

MongoDB on AWS

ElasticSearch

Sphinx

Use Mongo for quick search and Oracle for Full-Text

Type

SQL

BSON

JSON

Mix1

Mix

EC2 Compatible

Yes

Yes

Yes

Yes

Yes

Scale Horizontally

Non-RAC on AWS

Yes

Yes

Yes

Yes

License

Paid

Open AGPL v3

Open Apache 2

Open

Combined #1 and #2

FullText (FT)

Yes

Up to 1GB docs

Yes

Yes

Yes for Oracle

Near/Proximity

Yes

No

Yes

Yes

Yes on Oracle

Conditional Queries

Yes

Yes

Yes

TBD

Yes

RegEx

Yes

Yes

Yes+

No

Yes

Facets

Would need to be coded into forms

Aggregation

Yes

Yes

Yes

Document Limit

Meets Complicated document needs

16MB/GridFS

2GB*

?

Combined #1 and #2

Paging (FT) Results

Yes

No (16M limit)

Yes

Yes

Combined #1 and #2

Speeds

 

 

 

 

 

Inserts

?

Fast

?

 

Combined #1 and #2

Updates

?

Fast

?

 

Combined #1 and #2

Indexing

?

Fast

Really Fast

10-15MB of text/sec

Combined #1 and #2

Pros / Cons

Oracle

Pros: Likely already invested, easy to do updates in Oracle, ACID for transactions, large workforce

Cons: RAC not on AWS yet, if on XML database, index updates are complicated and high CPU/memory regardless of tuning efforts, no smart search components (i.e. no “signals” to provide more search or semantic context yet), public-facing licenses often priced different than for internal enterprise license

Mongo

Pros: Proves fast to sprint, improve, and add new signals, Proves fast for the metadata load, index update, batch load, search requirement for non full-text document search

Cons: Mongo is good for lots of things but for full-text search requirement, MongoDB cannot do that.

ElasticSearch

Pros:  Solr is also a solution for exposing an indexing/search server over HTTP, but ElasticSearch provides a much superior distributed model and ease of use. Elasticsearch uses Lucene v4 to provide the most powerful full text search capabilities available in any open source product. Search comes with multi-language support, a powerful query language, context aware did-you-mean suggestions, autocomplete and search snippets. All fields are indexed by default, and all the indices can be used in a single query, to return results at breath taking speed. And, can do still can do updates in Oracle or traditional RDMS directly and just sync with ElasticSearch

Cons: No built in security access to RESTful services but there are 2 plugins https://github.com/Asquera/elasticsearch-http-basic and https://github.com/sonian/elasticsearch-jetty as well as just nginx as a reverse proxy. Technology is maturing, new releases often, so your configuration management will be tested. This may require additional optimization and debug period, but other similar feature and document repository and search solutions have been created with this technology.

Sphinx (http://sphinxsearch.com/about/sphinx/ )

Pros: Currently very new JSON support but do support the following. SQL database indexing –  Sphinx can directly access and index data stored in MySQL (all storage engines are supported), PostgreSQL, Oracle, Microsoft SQL Server, SQLite, Drizzle, and anything else that supports ODBC. Non-SQL storage indexing – Data can also be streamed to batch indexer in a simple XML format called XMLpipe, or inserted directly into an incremental RT index.

Cons: Sphinx is maturing but marketing and overview is not as clear to get up and running. It is not really JSON friendly and is a bit more cryptic to plug and play. 

Mongo (Read) / Oracle (Transaction / Sync

Pros: Re-uses Oracle investment for ACID and licenses, Still can do updates in Oracle directly, Mongo can be updated near real-time and fast, best of both worlds.  Oracle could do the full-text part as a secondary search requirement which would likely get less use, and Mongo could to the rest. If a quick partial migration or architecture change is digestable, but not ready for the full swap out, this is something to pass on fast, easy to maintain, supports interpretative signals, google like experience, scales. Mongo can do < 1GB searches and Oracl can do full text on > 1GB

Cons: Its the Prius or Volt model – its a hybrid, so maintaining two tech stack for long period of time, which can happen, can be more than a nuisance.

Recommendation:

Depends on your sunk investment, constraints, workforce, and needs. Your mileage may vary, but:

  • If you are sunk in Oracle, Mongo/Oracle is recommended
  • If you can move away or searching, and do not have full-text search requirement, and want to move fast, Mongo is the winner
  • If you want to move away from Oracle, and do have full-text search requirement, ElasticSearch is the big brother of Solr and has a little more steam and the winner.

Better yet, the best way to find out is do a prototype with light architecture definition upfront. The project usually can be done by 2-3 FTE in 2-4 weeks, assuming 10GB test data slice, Cloud access, data load, some performance test, and an AJAX UI test harness. If needing help, let us know. Best way to get buy-in on architecture beyond definition and rigor is demonstrating it has legs.

 

When moving to the cloud, consider changing your discovery approach

Blog post
edited by
Wiki Admin

As we do not want to pave that cowpath (What cow paths, space shuttles, and chariots have in common or What are some patterns or anti-patterns where architecture and governance can help cover this point), we want to not only save the monies in moving to IT Commodity utility model, but also consider, do we just take the MIS architecture and pattern and put that in the cloud, or do we look at new patterns, such as new search index, engine, or NoSQL models that allow rapid, near real-time smart discovery on the read part of the solution.

This will increase your data and digital assets relevancy as the market demands to make things easier, simpler, and instant gratification. 

 

Traditional: Keyword Search Matching 

For many new large, cloud-hosted, database transaction management solution, organization needs a fast document, record, object, or content search by facets, keywords across both metadata and full search with a quick, nice experience that can handle millions of documents, authorities, and lookup lists along with thousands of monthly transactions.

Currently, the architecture clients invested in is a model that was developed pre “big data”. These models emulate MIS form based searched by trained users with a supporting search engine that does a full scan of any keyword or some category or facet filtering to return ALL matching records weighted by keyword closest match. This can handle full text search as well as facet search, but does tend to be higher taxing on computing power to return not only accurate results, but results are will not be context aware of popularity, typos, synonyms, etc..

New Searching architecture is not just Faster, but is Smarter

NoSQL models are put into query box as well, but NoSQL engines can have multiple index-like “signals” that the query engine can look up to better help interpret should be able to figure out the key signals to infer what the user may be looking for. The search engine solution would handle and have an increase investment in interpretative signals (i.e. fuzzy logic support for popular search weighting, typos, thesauri integration, synonym, typo recognition, community based, event/trending, business rules, profile favorite patterns, etc.). This could include as well researching improving description framework improvements such as improved overlapping categorical/alignment, schema.org and move towards RDFa.

When these solutions do not apply

As Apache on Hadoop states, Hadoop (or it does imply NoSQL more in general) is NOT:

  1. Apache Hadoop is not a substitute for a database – you need something on top for high-performance updates
  2. MapReduce is not always the best algorithm – if an you need MR jobs to know about the last, then you lose the parallelization benefits.
  3. Hadoop and MapReduce is not for beginner Java, Linux, or error debugging – Its open source, and emerging, so many of these techs built on top bring that and are worth the extra layering.

Initial newer search engine better at metadata search, but not full text and full results

Google solutions or Open Source solutions like MongoDB are fast at addressing these “signals”, but are limited at full text document searches of extremely long documents which is sometimes required by legal, policies, or other regulations.  For instance, when doing a CraigsList or Groupon, a user is searching against metadata fields, i.e. “Bike Vintage” between 1950 and 1960 and most of the text, and what is returned is milli-second results of the top x hundred results, but the results are not hitting the raw data and nor every record, but instead is hitting these index-like constructs with the record ID. For those results succeeding, the user can then call a URL to then go into transaction mode back in Oracle. If an edit is made, the trigger to update the NoSQL index can be updated immediately as well as full-text updates can be updated in Oracle reasonably fast, but definitely not as fast as the NoSQL index.

There are other solutions in the newer search engine technologies that can address all requirements. For example, there may be 10,000 results a user wants to pull all those results into their software, then move into transaction/edit mode, and commit those edits in Oracle, and the NoSQL index can be update immediately, and be available for near immediate use for full-text in Oracle or in some search engine solutions in full-text.

Exploring NoSQL and new signals will yield faster and smarter results

Point being, the improved discovery not only be faster from a query return point of view, but also by returning smarter results. This will also make the discovery process itself faster, to move the user to faster actions on their intended transactions as the search results will be more context aware of language issues, popularity, and user personalized needs.  This can be achieved by technologies such as ElasticSearch, possibly Spinx, or possibly a combination of MongoDB for fast search, and existing Oracle for full-text search.