Ten thoughts on fixing search on our opendata catalogs

Blog post
edited by
Wiki Admin

If you go to any data catalog: academic publication catalogs, Government agency opendata clearhinghouses, and federated catalogs, marketing lists, metadata search sites, or even popular sites, most have actually a lot of great data, but it is extremely hard to make sure you are pulling down data that can actually be informative – that truly information – without spending an increasing amount of time.

So, we are on this great opendata train. The phrase du jour is too much information, I say too much data.  Data is different from information:

Data are values or sets of values representing a specific concept or concepts. Data become “information” when analyzed and possibly combined with other data in order to extract meaning, and to provide context. The meaning of data can vary according to its context (Source: Federal Enterprise Architecture Data Reference Model).

These sites are more like an eHoarder of data in hopes of being an information destination. There is a lot of junk, while at same time, it all started because some things had value, and well, we are losing perspective.

I think part of the problem is that the metadata is bad. But even when good, it sits beside data that is bad. The internet “click” folks rely on this and hijack data discovery on search engines for this exact purpose. They hijack typo’d web site names like netfix.com or the like. They hijack keywords. The manipulate with SEO techniques to get their sites higher on search engines.
In closed communities, its not intentional manipulation, but there is a lack of incentive to fix discovery.
What are ways we can fix our open catalogs? Here are ten ideas:
  1. Make searching more fun – Take facets, like in tools like CKAN and do more like kayak.com jquery filters, time-based, charts that pop-up with context of record counts. Look kayak, is a site scraper hitting APIs, then simply re-presenting it in simple ways, then they get referral fees (in a nutshell). All because they made it easy, but more important, made travel searching sort of fun.
  2. Make Separate Search Components from your WebMIS– Stop fronting MIS systems with advanced form search engines. Keep that if you are required or need for your 5% users, but Instead make a SOLR or NoSQL or fast search that allows you build in search signals as you get data on users. NodeJS feeds to the search database/index are fast, and millisecond updates fast enough for 99% of cases
  3. Use Enterprise Search instead of rolling your own – Try to take search functions in standalone sites in your organization, and make enterprise service, where the standalone group can control or have input on their search signals
  4. Feed Schema.org for SEO with a virtual library card – Beyond traditional SEO tuning, broker relationships or invest in patterns for search engines like google so they can build good signals/rules on top of your data – do this by putting schema.org tags on your data which can be extracted from your inputted data. 
  5. Register to be harvested – Get registered on multiple harvesting sites, as maybe they can find ways to get your data more discoverable, and when they find you, they see the details on your site or your site pushes that as well, but point being, its still authoritative.
  6. CrowdSource and Gamify Search Signal Tuning – Can we get crowdsourcing going to dogfood site usage and help build in better search engine rules signals. Whether crowd from your own organization with corporate awards, or gamify. Or crowd from true external stakeholders? Bonus: More Student Power – Can we get STEM or university systems involved as part of curriculum, projects, etc. as a lot of search signal improvement is really about person-power or machine2machine power.
  7. Make Events to force data wrangling – In Colorado, we (our team is doing the data side) just did gocode.colorado.gov, as a way to get application developers to build apps off OpenData Colorado – the reward was essentially a reverse contract which made it legal to give monetary award, create various set-aside, and incent usage. That usage in turn got more opportunities for exposure as a time-based event, which got data suppliers more engaged to put things up.
  8. Find ways to share signals? This is more of a perspective theory, but could we feed things like Watson, Google, etc. for a brain of search patterns, tell it our audience differences by having it scan our data and some stereo equalizer tweaks, and figure out what rule expression patterns to take from and share signal libraries?
  9. Learn more what our librarians do. Look, our librarian 20 years ago did more than put books back on shelves and give you mean looks on late returns. They also managed what went into the library, they helped on complicated inquiries to find information, even helping in curating across other libraries. Our network of organic capital or public sector driven meta-sites grew up out of computer science and IT, and not Library science. We need to get computer science/MIS/IT, and Library science to start dating again. Get to know each other again. Remember the good times when we used to be able to find things, and help each other out. 
  10. Can we score OpenData sites? We have watchdogs on making data open, which is great. This helps make sure organizations provide what they are suppose to provide and keep as openGov. But, this approach would be more on scoring the reality for discovering what you have provided. For example, we know the lawyer trick when they want to make problems with discovery – they provide their opposing side with so much information that they are inundated and their is not enough time to do discovery, and yadda, yadda, legal gamesmanship. Can we find ways to score or watchdog sites on data discovery as either a part of transparency, or a different type of consumer report? 

.02

 

Go Code Colorado Open Data Effort is going into its final weeks

Blog post
edited by
Wiki Admin

States all-around have gotten into Open Data movements. Colorado has as well, and their recent Go Code Colorado effort is a very unique entry into this foray ( http://gocode.colorado.gov/)

Go Code Colorado was created to help Colorado companies grow, by giving them better and more usable access to public data. Teams will compete to build business apps, creating tools that Colorado businesses actually need, making our economy stronger.

 

The following is a great video that summarizes the event as produced by the State and one of Xentity’s colleagues, Engine7 Media.



Get Adobe Flash player

Xentity is very proud to be supporting this innovative Government Solution

Xentity was awarded IT consulting support for the the Business Intelligence Center platform and data catalog which supports the now branded Go Code Colorado initiative. Xentity’s consultants have provided the data and technology resources to manage and advise the publication of public sector data to the Colorado Information Marketplace and to provide technical support developers who participate in the Challenge. 

Xentity primarily has provided data platform support. We have provided data readiness analysis, data architecture guidance, project management, and the data analysts to “wrangle” the data (aka ETL) to get the datasets onto the platform. We also have provided the IT and data support on-site at the multiple locations and events to assure the challenge participants and finalists are getting the support they need to be successful in accessing and using the data and services. Finally, we are supporting the technical review of applications to assure these applications can have a life beyond the “hackathon” stage.

The final stages are coming the first 10 days of May. The 10 finalists have proven to demonstrate very viable solutions to achieve the goal of helping make our economy stronger. 

Some more background and detail on how we got here

(The following is from the State as guidance to this effort)

 

Colorado government agencies possess large volumes of public business and economic data. This data can help businesses with strategic planning, but it exists in so many different places and formats making it difficult for that most businesses to use it. The Secretary of State’s office will address this problem through the creation of the Business Intelligence Center (BIC). BIC seeks to aggregate and analyze data available to the business community.

This effort is led by the Colorado Secretary of State. The Secretary of State’s office interacts with hundreds of thousands of business entities, charities, and nonprofits in the state. The Secretary of State’s office collects, manages, and disseminates large amounts of basic data about those organizations and wanted to make the data useful to Colorado businesses. 

The Department sought to make this data more useful and collaborated with the Leeds School of Business at the University of Colorado to publish the Quarterly Business and Economic Indicator Report. This report combines Department data with other economic data collected by he Leeds School to provide meaningful economic information to the business community. For instance, new business filings are a leading indicator of job creation. With this and other information provided in the report, the business community can make smarter decisions that will grow the Colorado economy.

Since first publishing the report in 2012, the Secretary of State received comments from many members of the business community asking to see more detailed data regarding economic trends 
in order to better understand the distribution of commerce in Colorado. This includes access to the location, size, vibrancy, and concentration of key business nodes. While this level of detail would be tremendously helpful, the Department cannot provide the information because multiple state agencies collect the desired data and it is not readily available in a common place  or even a common format.

A central data collection point is needed. During meetings with other government agencies, Department staff concluded that these data requests could be met by aggregating all the information spread throughout various agencies and databases into a single tool by breaking down agency silos and better cataloging existing resources. Department staff also concluded that access and availability to the data is not enough. In order to make the raw data useful to the vast majority of business owners, data analysis and visualization tools are needed. These conclusions led to the Business Intelligence Center project.

The Business Intelligence Center consists of a centralized data catalog that combines public data into a meaningful tool for businesses. 

The vision for this project is two-fold. First, it consolidates public data relevant to businesses on a single platform. Second, it gives business the tools to make the data useful. The second goal is 
achieved through a civic apps challenge—the Colorado Business Innovation Challenge—that will give financial incentives to the technology community to build web and mobile applications that use state and other data to solve existing business challenges.

The data platform is akin to an information clearing house. It will make data sources currently dispersed over multiple government departments and agencies accessible in a common location. 
This platform will offer Colorado businesses unprecedented access to public data that is validated and relevant to short and long-term needs. Besides enhancing businesses’ access to state data, the BIC will also contribute to economic growth. The creation of the BIC will make data available to all Colorado businesses at no additional cost. Currently only large entities with the time, staff, and budget to engage in detailed statistical analysis can use these data sets. Providing this data to every type and size business in Colorado provides a unique opportunity to contribute to economic development. The BIC will nurture key industry networks and lay the foundation for a digital infrastructure that will continue to expand and improve over time.

The Colorado Business Innovation Challenge is an innovative way to create solutions and ensure the BIC is useful to Colorado businesses.

Simply making the data available is insufficient to most business owners. To truly help the vast majority of businesses—especially small businesses—tools must be developed to present the data in a useful and consumable form. Normally government agencies develop tools to fill this information vacuum, but historically the government has not been successful developing highly useful and effective tools. A new approach is needed—that approach is the Colorado Business Innovation Challenge.

Modeled after a “civ apps” challenge that has been run in multiple cities across the United States and internationally, the Challenge presents the software development community with problem 
questions and then asks that community to create possible solutions. At the end of the challenge, the Secretary of State will license the most innovative and implementable web or mobile application. The best design will receive a contract with the Secretary of State to make the application available to the public on the Business Intelligence Center platform. The Department will also pursue partnerships with the Colorado technology and startup industry to provide additional incentives, such as mentoring, hosting, and office space to the Challenge winners. The long-term intent of the program is to not only create an environment for fostering community involvement through the Challenge, but to develop sustainable tools that are  developed in the Challenge.

BigData is a big deal because it can help answer questions fast

Blog post
added by
Wiki Admin

BigData is not just size and speed of complex data – it is moving us from information to knowledge

 

As our Why we focus on spatial data science article discusses, the progress of knowledge fields – history to math to engineering to science and to philosophy – or the individual pursuit of knowledge is based on moving from experiments to hypotheses to computation to now The Fourth Paradigm: Data-Intensive Scientific Discovery. This progression has happened over the course of human history and is now abstracting itself on the internet.

The early 90s web was about content, history, and experiments. The late 90s web was about transactions, security and eCommerce. The 2000s web was about engineering entities breaking silos – within companies, organizations, sectors, and communities. The 2010s web has been about increasing collaborating of communication, work production, and entering into knowledge collaboration. The internet progression is just emulating human history capability development.

When you are ready to move into BigData, it means you are wanting to Answer new questions.

That said, The BigData phenomenom is not about the input of all the raw data and explosion that the Internet of Things is being touted as. The resource sells, and the end product is the consumed byproduct. So lets focus on that by-product – its knowledge. Its not the speed of massive amounts of new complex and various quality data as our discussion on IBM’s 4 V’s focus on.

Its about what we can do with the technology on the cheap that before required supercomputer clusters that only the big boys had. Now with cloud, internet, and enough standards, if we have good and improving data, we ALL now have the environment to be answering complicated questions while sifting through the noise. Its about the enablement of the initial phase of knowledge discovery that everyone is complaining about the “web” right now “too much information” or “drowning in data”.

The article on Throwing a Lifeline to Scientists Drowning in Data discusses how we need to be able to “sift through the noise” and make search faster. That is the roadblock, the tall pole in the tent, the showstopper.

Parallelizing the search is the killer app – this is the Big Deal, we should call it BigSearch

If you have to search billions of records and map them to another billion records, doing that in sequence is the problem. You need to shorten the time it takes to sift through the noise. That is why Google became an amazing success out of nowhere. They did and are currently doing it better than anyone else – sifting through the noise.

The United States amazing growth is because of two things – we have resources and we found out how to get to them faster. Each growth phase of the United states was based on that fact alone, and a bit of stopping the barbarians at the gates our ourselves from implosion. You could say civilization. Some softball examples out of hundreds

  • Expanding West dramatically exploded after trains, which allowed for regional foraging and mining
  • Manufacturing dramatically exploded production output, which allowed for city growth
  • Engines shortened time between towns and cities, which allowed for job explosion
  • Highway systems shortened time between large cities, which allowed for regional economies
  • Airplanes shorten time between the legacy railroad time zones, which allowed for national economies
  • Internet shortened access to national resources internationally, which allowed for international economies
  • Computing shortened processing time of information, which allows for micro-targetted economies worldwide

Each “age” resulted in shortening the distance from A to B.  But, Google is sifting through data. Scientists are trying to sift as well through defined data sensors, link them together and ask very targetted simulated or modeled questions. We need to address the barriers limiting entities success to do this. 

 

GAO releases report on FGDC Role and Geospatial Information

Blog post
edited by
Wiki Admin

GAO release report on us of geospatial information with title “OMB and Agencies Can Reduce Duplication by Making Coordination a Priority”. Readers Digest – focus on integrating data. 

Click to download PDF

We tend to agree. FGDC is currently very focused on a service enabling management model (Geoplatform) to accomplish this. It is bold, but if their role of being a service provisioner can directly or indirectly get them in the game to address the real problem of data lifecycle management, they will have a chance to address this. 

Point being, FGDC knows its role is not to be in IT Operations as its direct goal. But, they also saw that being a sideline judge with no carrot or stick role would not garner the direction and recommendations that GAO suggests. They are getting on the playing field, taking advantage of the open service provider role, being that broker, and using that role to move IT costs down, and also enabling those shifts in monies to then focus on the data issues cited. Its bold, and a unique approach, and there are many questions can a traditionally non-operational group develop that culture to be effective. Proof will show over the next 2 years.

Below find our summary of strategic direction for FGDC’s geoplatform.

The challenges and recommendation sections are:

  1. FGDC Had Not Made Fully Implementing Key Activities for Coordinating Geospatial Data a Priority
  2. Departments Had Not Fully Implemented Important Activities for Coordinating and Managing Geospatial Data
  3. Theme-lead Agencies Had Not Fully Implemented Important Activities for Coordinating and Managing Geospatial Data
  4. OMB Did Not Have Complete and Reliable Information to Identify Duplicative Geospatial Investments

Our review of Background – then and now

The foundation the FGDC has put in place. The Federal Geographic Data Committee (FGDC) has always been a catalyst and leader enabling the adoption and use of geospatial information.

The Federal Geographic Data Committee (FGDC) has been successfully creating the geospatial building blocks for the National Spatial Data Infrastructure (NSDI) and empowering users to exploit the value of geospatial information.  The FGDC has been leading the development of the NSDI by creating the standards and tools to organize the asset inventory, enhance data and system interoperability and increase the use of national geospatial assets. The FGDC has successfully created policy, metadata, data and lifecycle standards, clearinghouses, catalogs, segment architectures and platforms that broaden the types and number of geospatial users while increasing the reuse of geospatial assets. [1] 

What is next? The Geospatial Platform and NGDA portfolio will be the mechanism for adoption of shared geospatial services to create customer value

Recently, the FGDC and its’ partners, have expanded their vision to include the management and development of a shared services platform and a National Geospatial Data Asset (NGDA)portfolio.  The goals are to “develop National Shared Services Capabilities, Ensure Accountability and Effective Development and Management of Federal Geospatial Resources, and Convene Leadership of the National Geospatial Community benefitting the communities of interest with cost savings, improved process and decision making”.[2]

As the FGDC continues on the road to establish a world class geospatial data, application and service infrastructure, it will face significant challenges “where the Managing Partner, along with a growing partner network, will move from start‐up and proof‐of‐concept to an operational Geospatial Platform”.[3]

Xentity has reviewed the FGDC’s current strategy, business plan and policies and identified the following critical issues that need to be solved to attain the goals:

  • Building and maintaining a federated, “tagged”[4] standards-based NGDA and an open interoperable Geospatial platform. The assets need to provide sufficient data quantity and quality with service performance to attract and sustain partner and customer engagement[5]
  • Developing a customer base with enough critical mass to justify the FGDC portfolio and provide an “Increased return on existing geospatial investments by promoting the reuse of data application, web sites, and tools, executed through the Geospatial Platform” [6]
  • Improving Service Management and customer-partner relationship capabilities to accelerate the  adoption of interoperable “shared services” vision and satisfy customers [7]
  • Executing simple, transparent and responsive Task Order and Requirements management processes that result in standards based interoperable solutions.  [8]

The Big Challenges

Establish the financial value and business impact of the FGDC’s Portfolio!

The Geospatial Platform and NGDA will provide valuable cost saving opportunities for its adopters.  It will save employee’s time; avoid redundant data acquisition and management costs, and improve decision making and business processes.  The financial impact to government and commercial communities could be staggering. It is a big and unknown figure.

The Geospatial Platform by definition and design is a powerful efficient technology with the capacity to generate a significant return on investment.  It is a community investment and requires community participation to realize the return.  The solution will need to assist the communities with the creation and sharing of return on investment information, cost modeling, case studies, funding strategies, tools, references and continue to build the investment justification.  The solution will need to optimize funding enhancement and be responsive to shorter term “spot” or within current budget opportunities while always positioning for long term sustainability.  The FGDC Geospatial Platform Strategic Plan suggests a truly efficient capability could create powerful streamlined channels between much broader stakeholder communities including citizens, private sector, or other government-to-government interfaces. Similar to the market and business impacts of GPS, DOQ, satellite imaging technology, the platform could in turn promote more citizen satisfaction, private sector growth, or multiplier effects on engaged lines of business.

To get a big return, it will demand continuous creative thinking to develop investment, funding, management and communication approaches to realize and calculate the value.  It is a complex national challenge involving many organizations, geospatial policy, conflicting requirements, interests and intended uses.

The key is demonstrable successes.  Successes become the premise for investment strategy and cost savings for the customers.  Offering “a suite of well‐managed, highly available, and trusted geospatial data, services, and application, web site for use by Federal agencies—and their State, local, Tribal, and regional partners” [9] is the means to create the big value.  

”A successful model of enterprise service delivery will create an even greater business demand for these assets while reducing their incremental service delivery costs.” [10]

FGDC has to create and tell a compelling “geospatial” value proposition story

To successfully implement the FGDC’s vision, it will demand a robust set of outreach and marketing capabilities.  The solution will need to help construct the platforms value proposition and marketing story to build and inform the community.  The objective is to ensure longer term sustainable funding and community participation.  The solution will need to bring geospatial community awareness, incentive modeling, financial evaluation tools, multi-channel communication and funding development experience to the FGDC.  The solution will need to have transparently developed and implemented communication and marketing strategies that have led to growth in customer base, alternative portfolio funding models and shared services environments for the geospatial communities.  The solution will need to have an approach that will be transparent, engage the customer and partners and continuously build the community.

This is a challenging time to obtain needed capital and win customers even for efficient economic engines like shared geospatial data and services.  The solution will need to approach the community outreach is impactful, trusted and will tell the story of efficiencies, cost savings, and higher quality information.  The platform and NGDA must impact the customer program objectives. Figure 1 – FGDC Performance and Value framework shows how the platform’s value chain aligns with the types of performance benefits that can be realized throughout its inherent processes. The supporting team’s understanding of this model will need to organize the “Story” to convince the customer and partners that the platform can:

  • Provide decision makers with content that they can use with confidence to support daily functions and important issues,
  • Provide consistency of base maps and services that can be used by multiple organizations to address complex issues,
  • Eliminate the need to choose from redundant geospatial resources by providing access to preferred data, maps and services[11] 

As the approach is implemented, the FGDC, its partners and the Communities of Interest will have successfully accelerated the adoption and use of location based information.  Uses will recognize the value offering and reap the benefits to their operations and bottom line.   The benefits will be measurable and support the following FGDC business case objectives:

  • Increasing Return on Existing Investments, Government Efficiency, Service Delivery
  • Reducing (unintentional) Redundancy and Development and Management Costs
  • Increasing Quality and Usability[12]

Our Suggested Solution

FGDC’s challenges requires PMO, integrated lifecycle management, partner focus, and blend experience with an integrated approach and single voice designed to meet the FGDC’s strategic objectives and provide a world-class service shared services and data portfolio.  Doing this, they can integrate organizations, data, and service provision.

A solution like this would provide the program, partner and customer relationship management, communications, development and operational capabilities required to successfully implement the FGDC’s vision and business plan. The focus will need to 

  1. Coordinate cross-agency tasks, portfolio needs in agile prgoram management coordination with a single voice,
  2. implement an understanding of critical lifecycle processes to manage and operate the data, technology, capital assets and development projects for a secure cloud-based platform
  3. have communications and outreach focused on communities for partner and customer engagement in the lifecycle decisions
  4. Finally, make sure secretariat staff and team has rotating collective experience with representatives and contractors who hav esuccessfully performed at this scale across all functional areas with domain knowledge in Geospatial, technology, program, service, development and operations.

The strategy and collective experience and techniques will enable FGDC to provide a single voice from all management domains (PMO, Development, Operations and Service Management) for customer engagement. The approach will be need to be integrated with the existing FGDC operating model creating a sum value greater than that of its individual parts. This approach will help create the relationship to develop trusted partner relations services. 


[1]  (page 7 – Geospatial –Platform-Business-Plan-Redacted-Final)

[2]  (page 2 – Draft NSDI –Strategic Plan 2014-2016 V2)

[3]  (page 28 – Geospatial –Platform-Business-Plan-Redacted-Final)

[4]  (page 11 – Ibid )

[5]  (page 9 – Ibid)

[6]  (page 26 – Ibid)

[7]  (page 4 – Ibid)

[8]  (page 6 – Ibid)

[9]  (page 2 – Geospatial –Platform-Business-Plan-Redacted-Final)

[10]  (DOI Geospatial Services Blueprint – 2007)

[11]  (page 13 – Geospatial –Platform-Business-Plan-Redacted-Final)

[12]  (Appendix A – Geospatial –Platform-Business-Plan-Redacted-Final)

[13]  (Page 12 – OMB Circular A-16 Supplemental Guidance)

[14]  (page 12 – Geospatial –Platform-Business-Plan-Redacted-Final)

[15]  (page 36 – Ibid)

[16]  (ITSM – Service Operations V3.0)

[17]  (page 26 – Ibid) 


Comparing NoSQL Search Technology Features

Blog post
edited by
Wiki Admin

Features Comparison Matrix

There are many new Document Oriented Databases Out there. Here is a quick high-level comparison of features of five of these newer technologies that were compared when creating the prototype concepts discussed int he blog post – When moving to the cloud, consider changing your discovery approach

 

Oracle on AWS

MongoDB on AWS

ElasticSearch

Sphinx

Use Mongo for quick search and Oracle for Full-Text

Type

SQL

BSON

JSON

Mix1

Mix

EC2 Compatible

Yes

Yes

Yes

Yes

Yes

Scale Horizontally

Non-RAC on AWS

Yes

Yes

Yes

Yes

License

Paid

Open AGPL v3

Open Apache 2

Open

Combined #1 and #2

FullText (FT)

Yes

Up to 1GB docs

Yes

Yes

Yes for Oracle

Near/Proximity

Yes

No

Yes

Yes

Yes on Oracle

Conditional Queries

Yes

Yes

Yes

TBD

Yes

RegEx

Yes

Yes

Yes+

No

Yes

Facets

Would need to be coded into forms

Aggregation

Yes

Yes

Yes

Document Limit

Meets Complicated document needs

16MB/GridFS

2GB*

?

Combined #1 and #2

Paging (FT) Results

Yes

No (16M limit)

Yes

Yes

Combined #1 and #2

Speeds

 

 

 

 

 

Inserts

?

Fast

?

 

Combined #1 and #2

Updates

?

Fast

?

 

Combined #1 and #2

Indexing

?

Fast

Really Fast

10-15MB of text/sec

Combined #1 and #2

Pros / Cons

Oracle

Pros: Likely already invested, easy to do updates in Oracle, ACID for transactions, large workforce

Cons: RAC not on AWS yet, if on XML database, index updates are complicated and high CPU/memory regardless of tuning efforts, no smart search components (i.e. no “signals” to provide more search or semantic context yet), public-facing licenses often priced different than for internal enterprise license

Mongo

Pros: Proves fast to sprint, improve, and add new signals, Proves fast for the metadata load, index update, batch load, search requirement for non full-text document search

Cons: Mongo is good for lots of things but for full-text search requirement, MongoDB cannot do that.

ElasticSearch

Pros:  Solr is also a solution for exposing an indexing/search server over HTTP, but ElasticSearch provides a much superior distributed model and ease of use. Elasticsearch uses Lucene v4 to provide the most powerful full text search capabilities available in any open source product. Search comes with multi-language support, a powerful query language, context aware did-you-mean suggestions, autocomplete and search snippets. All fields are indexed by default, and all the indices can be used in a single query, to return results at breath taking speed. And, can do still can do updates in Oracle or traditional RDMS directly and just sync with ElasticSearch

Cons: No built in security access to RESTful services but there are 2 plugins https://github.com/Asquera/elasticsearch-http-basic and https://github.com/sonian/elasticsearch-jetty as well as just nginx as a reverse proxy. Technology is maturing, new releases often, so your configuration management will be tested. This may require additional optimization and debug period, but other similar feature and document repository and search solutions have been created with this technology.

Sphinx (http://sphinxsearch.com/about/sphinx/ )

Pros: Currently very new JSON support but do support the following. SQL database indexing –  Sphinx can directly access and index data stored in MySQL (all storage engines are supported), PostgreSQL, Oracle, Microsoft SQL Server, SQLite, Drizzle, and anything else that supports ODBC. Non-SQL storage indexing – Data can also be streamed to batch indexer in a simple XML format called XMLpipe, or inserted directly into an incremental RT index.

Cons: Sphinx is maturing but marketing and overview is not as clear to get up and running. It is not really JSON friendly and is a bit more cryptic to plug and play. 

Mongo (Read) / Oracle (Transaction / Sync

Pros: Re-uses Oracle investment for ACID and licenses, Still can do updates in Oracle directly, Mongo can be updated near real-time and fast, best of both worlds.  Oracle could do the full-text part as a secondary search requirement which would likely get less use, and Mongo could to the rest. If a quick partial migration or architecture change is digestable, but not ready for the full swap out, this is something to pass on fast, easy to maintain, supports interpretative signals, google like experience, scales. Mongo can do < 1GB searches and Oracl can do full text on > 1GB

Cons: Its the Prius or Volt model – its a hybrid, so maintaining two tech stack for long period of time, which can happen, can be more than a nuisance.

Recommendation:

Depends on your sunk investment, constraints, workforce, and needs. Your mileage may vary, but:

  • If you are sunk in Oracle, Mongo/Oracle is recommended
  • If you can move away or searching, and do not have full-text search requirement, and want to move fast, Mongo is the winner
  • If you want to move away from Oracle, and do have full-text search requirement, ElasticSearch is the big brother of Solr and has a little more steam and the winner.

Better yet, the best way to find out is do a prototype with light architecture definition upfront. The project usually can be done by 2-3 FTE in 2-4 weeks, assuming 10GB test data slice, Cloud access, data load, some performance test, and an AJAX UI test harness. If needing help, let us know. Best way to get buy-in on architecture beyond definition and rigor is demonstrating it has legs.