Delivering Open Data in Bulk on the Cloud - Xentity

Xentity just finished a major project for a large national government data provider who measures their number of files in the millions, data records in the tens to hundreds of millions, and storage in sub-petabyte. If you are looking to deliver your bulk data in the cloud, then see the obfuscated general requirements below.

The requirements outlined below include storage requirements, access, methods, discovery, communications, and applications. We generalized, redacted, and expanded the requirements. Done so, upon considering an audience of open data stewards responsible for large public datasets. We are only providing the business requirements in this article and not considering the technologies, vendors, cost models, capacity planning, etc. That effort finished separately. We will expand on in it in another article.

1. Data Storage – Storage Supporting File Form Factors

As background concerning data storage, investigate the free public data set clearinghouse areas like http://aws.amazon.com/publicdatasets/ or on Azure, etc. A new concept is “form factors” which are the size, configuration, and physical arrangement of a computing device. Taking this definition into consideration, review the various form factors of “files” or “services” your project may encompass:

Gigabyte size files
Medium size files, but totals more than gigabyte size files
Many terabyte or gigabyte files broken into medium files for transfer
Millions of small files usually delivered in buffered stream
Data-driven file delivery via services
Terabyte files only deliverable via sneakernet import/export

2. User Access – Easy Access For Users To Copy Files To Target Environment

Public read-only users should not be required to have to pay for access to end-solution (i.e. should not require user to have cloud account on hosted solution)
Internal Users will require access to private directories for files not or yet to be publicly released files (i.e. in response to emergencies, access to licenses data, interim work products)
Internal users will benefit from lower-latency access than public users. Solutions such as cached volumes, integration with on-premise IT and cloud environment, and secure file transfer

3. Multiple Access Methods – Service, Download, Media, Cloud-to-Cloud

Users will look to have data provided in bulk one of three ways: Web Service, Bulk Media, or Cloud to Cloud
Admins should have access user traffic statistics for viewing, exporting statistics logs, and calling statistics logs via hosted applications
User pulls a directory, set of directories, set of files or a mix via online web access via HTTP, REST, FTP, UDP, or SCP
Learn about high-performance file transfer solutions are possible such as Edge Network publishing to move closer or supporting high-performance file transfer such as UDT (UDP-based data transfer protocol)
For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user
For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user. Bulk Media minimum specifications for external hard drives
Users who have existing cloud accounts for storage or who have virtual machine processing points on the cloud, will make requests or will pull a directory, set of directories, set of files or a mix data pushed to the users cloud point

4. Discovery – increased visibility and discovery of staged products in catalogs and search engines

Data products are usually downloaded via keyword, geospatial or temporal product discovery applications based on filtering their search, creating an order, and downloading the products in small group
Public file directory listing should be discoverable and optimized for discovery by search engines
Public collections should be discoverable and optimized for discovery by search engines
Explicitly demonstrate how bulk data registrations will be discoverable and registered in both Sciencebase.gov and data.gov
Catalogs should be able to pull or push harvest public FGDC, ISO-19115, or RDF metadata of files in the directories for transaction or bulk loading into their catalog
File Directory Listing can be queried via open-standard discovery service to assist in developing a download filter list.
The National Map can be discoverable in proposed service provider catalog, but the catalog reference needs to follow the metadata provided along with each file with at minimum presenting source, created date, updated day, title, basic description, and the provided DOI link for the file or directory
Service Provider should support calls via a Digital Object Identifier

5. Publishing– support batch file release updates for thousands of files monthly.

Consider if publishing and updating files within datasets incrementally. It will require service or bulk media methods to update the datasets
Files published are storable in original formats
Updates occur monthly at no more than on average 10% of files or file storage
Updates to files are logged. This triggers notifications to subscribed users
File updates should be able to maintain success and parity check status
Offline file transfer should support processing of delivered storage devices with clear instructions
Online upload transfer per storage unit (i.e. per gigabytes) should not have transfer charges akin to transactional charges to bulk download area
Online upload should have high performance data transfer capabilities such as UDT (UDP-based data transfer protocol) for between on-premise data and cloud
Moving from cloud to cloud, i.e. if moving from transactional area to public dataset hosting area, should have very high-speed transmission speeds and should consider location proximity issues

6. Notifications – providing ways for users to subscribe to staged product files update notifications

Users can subscribe to changes to directory, sub-directory, or specific files
They can be notified of such changes via push notifications via such ways as per change, daily changes, RSS updates, or other notification techniques
Users can also use the notifications as ways to request the bulk file updates

7. Download API – Supporting applications or including applications that help the user download in bulk

Have a download API controllable by api.data.gov which can uniquely identify. Also, provide HTTP access to via GET parameter in a URL query, support an hourly limit of number of requests per hour based on API Key settings. If api.data.gov rate limit is exceeded, an HTTP status code of 503 should be returned
3rd party applications should be able to support HTTP, REST, FTP, or SCP calls.
Software Development Kit access (java, python, .NET, PHP, etc.) access should be allowable as well
The file download should be able to support multiple file requests. Also, allow for parallel downloads. Furthermore, restart partial download file requests. Finally, governor anonymous volume requests
Peer-to-Peer solution support (i.e. such as BitTorrent) must comply with Federal Regulations
Identify what, availability, and cost for User Training and Sanctioned or third-party consultants for Software Developers is available

8. Applications – Support the end user experience for unzip files and load into geospatial database

The user may receive multiple zipped files requiring the user to click each link to download and unzip each file. If so, then load each file using the provided metadata manually into a database. This can be automated.
Vendor can create premium either accelerator. Also, increased access or additional formats are part of the delivery if branded separately as a vendor branded product. Furthermore, as long as there is one version that is published clearly marked as Authoritative Government as published and controlled by such in its original published form.

The Takeaway

If you are looking to deliver your bulk data in the cloud, then hopefully this article provided you with a set of requirements to assist in that process. Again, our storage requirements, access, methods, discovery, communications, and applications. This list, when combined with selection of the technologies, vendors, cost models, and capacity planning can help provide you the knowledge you need to make this transition a smooth one. And Xentity is more than happy to provide any kind of data-related help with that transition.