Delivering Open Data in Bulk on the Cloud

Blog post
added by
Wiki Admin

We just finished some work for a large National Government data provider who measures their number of files in the millions, records in the tens to hundreds of millions, and storage in sub-petabyte. Below is the obfuscated general requirements if you were to be looking to deliver your bulk data in the cloud : Storage requirements, access, methods, discovery, communications, and applications.

These requirements have been generalized or completely redacted or some cases, added to, to allow for all in Government Open Data delivery with large public datasets to consider. This is simply the business requirements, and not considering the technologies, vendors, cost models, capacity planning, etc. That was done separately.

1.    Storage – Storage supporting file form factors including

Investigate the free public data set clearinghouse areas like http://aws.amazon.com/publicdatasets/ or on Azure, etc. 

Consider  various form factors of files or services

  • Gigabyte Size Files
  • Medium Size Files, but totals more than Gigabyte Size Files
  • Many Terabyte or Gigabyte files that have been broken into medium files for transfer
  • Millions of small files usually delivered in buffered stream
  • Data-driven file delivery via services
  • Terabyte Files only deliverable via Sneakernet Import/Export

2. User Access – easy access for users to copy files to target environment

Public Read-Only Users should not be required to have to pay for access to end-solution (i.e. should not require user to have cloud account on hosted solution)

Internal Users will require access to private directories for files not or yet to be publicly released files (i.e. in response to emergencies, access to licenses data, interim work products)

Internal Users will benefit from lower-latency access than public users. Solutions such as cached volumes, integration with on-premise IT and cloud environment, and secure file transfer.

3.    Multiple Access Methods – Service, Download, Media, Cloud-to-Cloud

Users will look to have data provided in bulk one of three ways: Web Service, Bulk Media, or Cloud to Cloud

Admins should have access user traffic statistics for viewing, exporting statistics logs, and calling statistics logs via hosted applications.

User pulls a directory, set of directories, set of files or a mix via online web access via HTTP, REST, FTP, UDP, or SCP.

Learn about high-performance file transfer solutions are possible such as Edge Network publishing to move closer or supporting high-performance file transfer such as UDT (UDP-based data transfer protocol)

For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user.

For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user. Bulk Media minimum specifications for external hard drives

Users who have existing cloud accounts for storage or who have virtual machine processing points on the cloud, will make requests or will pull a directory, set of directories, set of files or a mix Data pushed to the users cloud point

4.    Discovery – increased visibility and discovery of staged products in catalogs and search engines

Data Products are usually downloaded via keyword, geospatial or temporal product discovery applications based on filtering their search, creating an order, and downloading the products in small group.

Public file directory listing should be discoverable and optimized for discovery by search engines

Public collections should be discoverable and optimized for discovery by search engines

Explicitly demonstrate how bulk data registrations will be discoverable and registered in both Sciencebase.gov and data.gov

Catalogs should be able to pull or push harvest public FGDC, ISO-19115, or RDF metadata of files in the directories for transaction or bulk loading into their catalog.

File Directory Listing can be queried via open-standard discovery service to assist in developing a download filter list.

The National Map can be discoverable in proposed service provider catalog, but the catalog reference needs to follow the metadata provided along with each file with at minimum presenting source, created date, updated day, title, basic description, and the provided DOI link for the file or directory.

Service Provider should be able to be support being called via a Digital Object Identifier

5.    Publishing– support batch file release updates for thousands of files monthly.

Consider if publishing and updating files within datasets incrementally, and will require service or bulk media methods to update the datasets.

Files published will be stored in original formats.

Updates are expected to be updated monthly at no more than on average 10% of files or file storage.

Updates to files should be logged to trigger notifications to subscribed users.

File updates should be able to maintain success and parity check status.

Offline File transfer should support processing of delivered storage devices with clear instructions

Online upload transfer per storage unit (i.e. per gigabytes) should not have transfer charges akin to transactional charges to bulk download area

Online upload should have high performance data transfer capabilities such as UDT (UDP-based data transfer protocol) for between on-premise data and cloud.

Moving from cloud to cloud, i.e. if moving from transactional area to public dataset hosting area, should have very high-speed transmission speeds and should consider location proximity issues.

6.    Notifications – providing ways for users to subscribe to staged product files update notifications

Users can subscribe to changes to directory, sub-directory, or specific files

Users can be notified of such changes via push notifications via such ways as per change, daily changes, RSS updates, or other notification techniques.

Users can use the notifications as ways to request the bulk file updates

7.    Download API  – Supporting applications or including applications that help the user download in bulk

Have a download API that can be controlled by api.data.gov which can uniquely identify, provide HTTP access to via GET parameter in a URL query, support an hourly limit of number of requests per hour based on API Key settings. If api.data.gov rate limit is exceeded, an HTTP status code of 503 should be returned.

3rd party applications should be able to support HTTP, REST, FTP, or SCP calls.

Software Development Kit access (java, python, .NET, PHP, etc.) access should be allowable as well.

The file download should be able to support multiple file requests, allow for parallel downloads, handle restarting partial download file requests, and governor anonymous volume requests

Peer-to-Peer solution support (i.e. such as BitTorrent) must comply with Federal Regulations.

Identify what, availability, and cost for User Training and Sanctioned or third-party consultants for Software Developers is available

8.    Applications – Support the end user experience for unzip files and load into geospatial database

If the user will received multiple zipped files that will require the user to click each link to download, unzip each file, and then load each file using the provided metadata manually into a database, can this be automated

Vendor can create premium either accelerator, increased access or additional formats are part of the delivery if branded separately as a vendor branded product and as long as there is one version that is published clearly marked as Authoritative Government as published and controlled by such in its original published form.

 

To do BigData, address Data Quality – People and Processes – Tech Access to information

Blog post
added by
Wiki Admin

As a follow on to the “cliffhanger” on BigData is a big deal because it can help answer questions fast, there are three top limitations right now: Data Quality, People and Process, Tech Access to Information. 

Lets jump right in.

Number One and by far the biggest – Data Quality

Climate Change isn’t a myth, but it is the first science to ever be presented on a data premise. And in doing so, they prematurely presented models that didn’t take into account the driving variables. Their models have changed over and over again. Their resolution of source data has increased. Their simulations on top of simulations have proven countless theories of various models that can only be demonstrated simply by Hollywood blockblusters. Point being, we are dealing with inferior data for a world scale problem, and we jump into the political, emotional driven world with a data report? We will be the frog in slowly warming water, and we will hit that boiling point late. All because we started with a data justification approach using low quality data. Are they right the world is warming? Yes. Do they have enough data to proven the right mitigation, mediation, or policy adjustments? No, and not until either we increase the data quality or take a non-data tact.

People and processes is a generation away.

Our processes in IT have been driven by Defense and GSA business models from the fifties. Put anyone managing 0s and 1s technology in the back. They are nerds, look goofy, can’t talk, don’t understand what we actually do here and by the way, they smell funny. That has been the approach to IT since the 50s – nothing has changed with the exception that their are a few bakers dozen of the hoodie wearing, mountain dew drinking, late night owls who happen to be loaded now, and their is a pseudo culture of geek chic. We have not matured our people talent investment to balance maturity of service, data, governance, design, and product lifecycle to embrace that engine culture as core to the business. This means, more effective information sharing processes to get the right information to the right people. This also means, investing in the right skills – not just feeding doritos and free soda to hackers – to manage the information sharing and data lifecycle. I am not as worried about this one. As the baby boomer generation retires, it will leave a massive vacuum as Generation X is too small and we’ll have to groom Generation Y fast. That said, we will mess up a lot missing a lot of brain drain, but market will demand relevancy which will, albeit slowly, create this workforce model in 10-15 years.

Access to Environments 

If you asked this pre-hosting environments or pre-cloud, this would have been limited to massive corporations, defense, intel, and some of the academia co-investing with those groups. If you can manage the strain of shifting to a big data infrastructure, this barrier should be the least of your problems. If you can allow your staff to get the data they need at the speed they need so they can process in parallelization without long wait times, you are looking good. Get a credit card, or if Government, buy off a Cloud GWAC, and get your governance and policies moving, as they are likely behind and not ready. Likely they will prolong the silo’d information phenomenon. Focus on the I in IT, and let the CTO respond to the technology stack. 

Focus on data quality, have a workforce investment plan, and continue working your information access policies

The tipping point that move you into Big Data is where these combined require you to deal with the complicated enormity at speeds answering questions not just for MIS and reports, but to help answer questions. If you can focus on those things in that order (likely solving in reverse), you will be able to implement parallelization of data discovery.

This will shorten the distance from A to B and create new economies, new networks, and enable your customer or user base to do things they could not before. It is the train, plane, and automobile factor all over again.

And to throw the shameless plug in, this is what we do. This is Why we focus on spatial data science and Why is change so fundamental.

Xentity awarded IT IDIQ from State of Colorado

Blog post
edited by
Wiki Admin

The State of Colorado’s Governor’s Office of Information Technology (OIT) has awarded to Xentity an IDIQ Master Agreement for business services

This master task order contract (MATOC) is a result of an award under RFP-001-JG-14 for Computer Programming, Consulting Services, and Business Services involving Cloud Solutions. 

In the Fall of 2013, The State of Colorado’s Governor’s Office of Information Technology (OIT) sought proposals to identify Implementation Services (“Implementers”) for business services involving cloud solutions by Salesforce.com, Google, and Perceptive Software (Perceptive), and other emerging technologies. 

  • The award is for an Enterprise Agreement, as a multi-contract award IDIQ
  • base period of 5 years and 5 consecutive 1-year renewal options
  • an initial $10 million maximum contract amount/ceiling.  
  • Task orders can be issued by multiple sponsoring state agencies.

Xentity has previously won and supported contracts for the State of Colorado with the Department of State and has worked closely with the Office of Information Technology.

Xentity’s Services can be ordered from any of the Colorado Agencies via this contract

Scope Include:

  • Task Order Technical Management
  • Agile Project Management
  • Solution Architecture
  • Architecture & Governance Support
  • Cloud Solution Development / Database Support
  • Portal & Development/Database Support
  • Application Development Support
  • Quality Assurance / Customer Support
  • Transition Support
  • Disaster Recovery/COOP Participation
  • Best Practice Group Support/Participation
  • Outreach Strategy and Support

Positions include: Project Manager, Technical Consultants, Architects, Architecture Analysts, Management Analysts, Solution Architects, Enterprise Architects, and Communications specialists for Branding, communications, design, and strategy 

More to come on how to access Xentity services off this contract.

 

When moving to the cloud, consider changing your discovery approach

Blog post
edited by
Wiki Admin

As we do not want to pave that cowpath (What cow paths, space shuttles, and chariots have in common or What are some patterns or anti-patterns where architecture and governance can help cover this point), we want to not only save the monies in moving to IT Commodity utility model, but also consider, do we just take the MIS architecture and pattern and put that in the cloud, or do we look at new patterns, such as new search index, engine, or NoSQL models that allow rapid, near real-time smart discovery on the read part of the solution.

This will increase your data and digital assets relevancy as the market demands to make things easier, simpler, and instant gratification. 

 

Traditional: Keyword Search Matching 

For many new large, cloud-hosted, database transaction management solution, organization needs a fast document, record, object, or content search by facets, keywords across both metadata and full search with a quick, nice experience that can handle millions of documents, authorities, and lookup lists along with thousands of monthly transactions.

Currently, the architecture clients invested in is a model that was developed pre “big data”. These models emulate MIS form based searched by trained users with a supporting search engine that does a full scan of any keyword or some category or facet filtering to return ALL matching records weighted by keyword closest match. This can handle full text search as well as facet search, but does tend to be higher taxing on computing power to return not only accurate results, but results are will not be context aware of popularity, typos, synonyms, etc..

New Searching architecture is not just Faster, but is Smarter

NoSQL models are put into query box as well, but NoSQL engines can have multiple index-like “signals” that the query engine can look up to better help interpret should be able to figure out the key signals to infer what the user may be looking for. The search engine solution would handle and have an increase investment in interpretative signals (i.e. fuzzy logic support for popular search weighting, typos, thesauri integration, synonym, typo recognition, community based, event/trending, business rules, profile favorite patterns, etc.). This could include as well researching improving description framework improvements such as improved overlapping categorical/alignment, schema.org and move towards RDFa.

When these solutions do not apply

As Apache on Hadoop states, Hadoop (or it does imply NoSQL more in general) is NOT:

  1. Apache Hadoop is not a substitute for a database – you need something on top for high-performance updates
  2. MapReduce is not always the best algorithm – if an you need MR jobs to know about the last, then you lose the parallelization benefits.
  3. Hadoop and MapReduce is not for beginner Java, Linux, or error debugging – Its open source, and emerging, so many of these techs built on top bring that and are worth the extra layering.

Initial newer search engine better at metadata search, but not full text and full results

Google solutions or Open Source solutions like MongoDB are fast at addressing these “signals”, but are limited at full text document searches of extremely long documents which is sometimes required by legal, policies, or other regulations.  For instance, when doing a CraigsList or Groupon, a user is searching against metadata fields, i.e. “Bike Vintage” between 1950 and 1960 and most of the text, and what is returned is milli-second results of the top x hundred results, but the results are not hitting the raw data and nor every record, but instead is hitting these index-like constructs with the record ID. For those results succeeding, the user can then call a URL to then go into transaction mode back in Oracle. If an edit is made, the trigger to update the NoSQL index can be updated immediately as well as full-text updates can be updated in Oracle reasonably fast, but definitely not as fast as the NoSQL index.

There are other solutions in the newer search engine technologies that can address all requirements. For example, there may be 10,000 results a user wants to pull all those results into their software, then move into transaction/edit mode, and commit those edits in Oracle, and the NoSQL index can be update immediately, and be available for near immediate use for full-text in Oracle or in some search engine solutions in full-text.

Exploring NoSQL and new signals will yield faster and smarter results

Point being, the improved discovery not only be faster from a query return point of view, but also by returning smarter results. This will also make the discovery process itself faster, to move the user to faster actions on their intended transactions as the search results will be more context aware of language issues, popularity, and user personalized needs.  This can be achieved by technologies such as ElasticSearch, possibly Spinx, or possibly a combination of MongoDB for fast search, and existing Oracle for full-text search.