Veracity is still struggling to support integrated information products – need to learn from challenges to move to knowledge products […]
View Original Article The National Science Foundation (NSF) has tasked the EarthCube Science Support Office (ESSO) with creating a detailed architecture […]
We just finished some work for a large National Government data provider who measures their number of files in the millions, records in the tens to hundreds of millions, and storage in sub-petabyte. Below is the obfuscated general requirements if you were to be looking to deliver your bulk data in the cloud : Storage requirements, access, methods, discovery, communications, and applications.
These requirements have been generalized or completely redacted or some cases, added to, to allow for all in Government Open Data delivery with large public datasets to consider. This is simply the business requirements, and not considering the technologies, vendors, cost models, capacity planning, etc. That was done separately.
1. Storage – Storage supporting file form factors including
Investigate the free public data set clearinghouse areas like http://aws.amazon.com/publicdatasets/ or on Azure, etc.
Consider various form factors of files or services
- Gigabyte Size Files
- Medium Size Files, but totals more than Gigabyte Size Files
- Many Terabyte or Gigabyte files that have been broken into medium files for transfer
- Millions of small files usually delivered in buffered stream
- Data-driven file delivery via services
- Terabyte Files only deliverable via Sneakernet Import/Export
2. User Access – easy access for users to copy files to target environment
Public Read-Only Users should not be required to have to pay for access to end-solution (i.e. should not require user to have cloud account on hosted solution)
Internal Users will require access to private directories for files not or yet to be publicly released files (i.e. in response to emergencies, access to licenses data, interim work products)
Internal Users will benefit from lower-latency access than public users. Solutions such as cached volumes, integration with on-premise IT and cloud environment, and secure file transfer.
3. Multiple Access Methods – Service, Download, Media, Cloud-to-Cloud
Users will look to have data provided in bulk one of three ways: Web Service, Bulk Media, or Cloud to Cloud
Admins should have access user traffic statistics for viewing, exporting statistics logs, and calling statistics logs via hosted applications.
User pulls a directory, set of directories, set of files or a mix via online web access via HTTP, REST, FTP, UDP, or SCP.
Learn about high-performance file transfer solutions are possible such as Edge Network publishing to move closer or supporting high-performance file transfer such as UDT (UDP-based data transfer protocol)
For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user.
For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user. Bulk Media minimum specifications for external hard drives
Users who have existing cloud accounts for storage or who have virtual machine processing points on the cloud, will make requests or will pull a directory, set of directories, set of files or a mix Data pushed to the users cloud point
4. Discovery – increased visibility and discovery of staged products in catalogs and search engines
Data Products are usually downloaded via keyword, geospatial or temporal product discovery applications based on filtering their search, creating an order, and downloading the products in small group.
Public file directory listing should be discoverable and optimized for discovery by search engines
Public collections should be discoverable and optimized for discovery by search engines
Explicitly demonstrate how bulk data registrations will be discoverable and registered in both Sciencebase.gov and data.gov
Catalogs should be able to pull or push harvest public FGDC, ISO-19115, or RDF metadata of files in the directories for transaction or bulk loading into their catalog.
File Directory Listing can be queried via open-standard discovery service to assist in developing a download filter list.
The National Map can be discoverable in proposed service provider catalog, but the catalog reference needs to follow the metadata provided along with each file with at minimum presenting source, created date, updated day, title, basic description, and the provided DOI link for the file or directory.
Service Provider should be able to be support being called via a Digital Object Identifier
5. Publishing– support batch file release updates for thousands of files monthly.
Consider if publishing and updating files within datasets incrementally, and will require service or bulk media methods to update the datasets.
Files published will be stored in original formats.
Updates are expected to be updated monthly at no more than on average 10% of files or file storage.
Updates to files should be logged to trigger notifications to subscribed users.
File updates should be able to maintain success and parity check status.
Offline File transfer should support processing of delivered storage devices with clear instructions
Online upload transfer per storage unit (i.e. per gigabytes) should not have transfer charges akin to transactional charges to bulk download area
Online upload should have high performance data transfer capabilities such as UDT (UDP-based data transfer protocol) for between on-premise data and cloud.
Moving from cloud to cloud, i.e. if moving from transactional area to public dataset hosting area, should have very high-speed transmission speeds and should consider location proximity issues.
6. Notifications – providing ways for users to subscribe to staged product files update notifications
Users can subscribe to changes to directory, sub-directory, or specific files
Users can be notified of such changes via push notifications via such ways as per change, daily changes, RSS updates, or other notification techniques.
Users can use the notifications as ways to request the bulk file updates
7. Download API – Supporting applications or including applications that help the user download in bulk
Have a download API that can be controlled by api.data.gov which can uniquely identify, provide HTTP access to via GET parameter in a URL query, support an hourly limit of number of requests per hour based on API Key settings. If api.data.gov rate limit is exceeded, an HTTP status code of 503 should be returned.
3rd party applications should be able to support HTTP, REST, FTP, or SCP calls.
Software Development Kit access (java, python, .NET, PHP, etc.) access should be allowable as well.
The file download should be able to support multiple file requests, allow for parallel downloads, handle restarting partial download file requests, and governor anonymous volume requests
Peer-to-Peer solution support (i.e. such as BitTorrent) must comply with Federal Regulations.
Identify what, availability, and cost for User Training and Sanctioned or third-party consultants for Software Developers is available
8. Applications – Support the end user experience for unzip files and load into geospatial database
If the user will received multiple zipped files that will require the user to click each link to download, unzip each file, and then load each file using the provided metadata manually into a database, can this be automated
Vendor can create premium either accelerator, increased access or additional formats are part of the delivery if branded separately as a vendor branded product and as long as there is one version that is published clearly marked as Authoritative Government as published and controlled by such in its original published form.
If you go to any data catalog: academic publication catalogs, Government agency opendata clearhinghouses, and federated catalogs, marketing lists, metadata search sites, or even popular sites, most have actually a lot of great data, but it is extremely hard to make sure you are pulling down data that can actually be informative – that truly information – without spending an increasing amount of time.
So, we are on this great opendata train. The phrase du jour is too much information, I say too much data. Data is different from information:
Data are values or sets of values representing a specific concept or concepts. Data become “information” when analyzed and possibly combined with other data in order to extract meaning, and to provide context. The meaning of data can vary according to its context (Source: Federal Enterprise Architecture Data Reference Model).
These sites are more like an eHoarder of data in hopes of being an information destination. There is a lot of junk, while at same time, it all started because some things had value, and well, we are losing perspective.
I think part of the problem is that the metadata is bad. But even when good, it sits beside data that is bad. The internet “click” folks rely on this and hijack data discovery on search engines for this exact purpose. They hijack typo’d web site names like netfix.com or the like. They hijack keywords. The manipulate with SEO techniques to get their sites higher on search engines.
In closed communities, its not intentional manipulation, but there is a lack of incentive to fix discovery.
What are ways we can fix our open catalogs? Here are ten ideas:
- Make searching more fun – Take facets, like in tools like CKAN and do more like kayak.com jquery filters, time-based, charts that pop-up with context of record counts. Look kayak, is a site scraper hitting APIs, then simply re-presenting it in simple ways, then they get referral fees (in a nutshell). All because they made it easy, but more important, made travel searching sort of fun.
- Make Separate Search Components from your WebMIS– Stop fronting MIS systems with advanced form search engines. Keep that if you are required or need for your 5% users, but Instead make a SOLR or NoSQL or fast search that allows you build in search signals as you get data on users. NodeJS feeds to the search database/index are fast, and millisecond updates fast enough for 99% of cases
- Use Enterprise Search instead of rolling your own – Try to take search functions in standalone sites in your organization, and make enterprise service, where the standalone group can control or have input on their search signals
- Feed Schema.org for SEO with a virtual library card – Beyond traditional SEO tuning, broker relationships or invest in patterns for search engines like google so they can build good signals/rules on top of your data – do this by putting schema.org tags on your data which can be extracted from your inputted data.
- Register to be harvested – Get registered on multiple harvesting sites, as maybe they can find ways to get your data more discoverable, and when they find you, they see the details on your site or your site pushes that as well, but point being, its still authoritative.
- CrowdSource and Gamify Search Signal Tuning – Can we get crowdsourcing going to dogfood site usage and help build in better search engine rules signals. Whether crowd from your own organization with corporate awards, or gamify. Or crowd from true external stakeholders? Bonus: More Student Power – Can we get STEM or university systems involved as part of curriculum, projects, etc. as a lot of search signal improvement is really about person-power or machine2machine power.
- Make Events to force data wrangling – In Colorado, we (our team is doing the data side) just did gocode.colorado.gov, as a way to get application developers to build apps off OpenData Colorado – the reward was essentially a reverse contract which made it legal to give monetary award, create various set-aside, and incent usage. That usage in turn got more opportunities for exposure as a time-based event, which got data suppliers more engaged to put things up.
- Find ways to share signals? This is more of a perspective theory, but could we feed things like Watson, Google, etc. for a brain of search patterns, tell it our audience differences by having it scan our data and some stereo equalizer tweaks, and figure out what rule expression patterns to take from and share signal libraries?
- Learn more what our librarians do. Look, our librarian 20 years ago did more than put books back on shelves and give you mean looks on late returns. They also managed what went into the library, they helped on complicated inquiries to find information, even helping in curating across other libraries. Our network of organic capital or public sector driven meta-sites grew up out of computer science and IT, and not Library science. We need to get computer science/MIS/IT, and Library science to start dating again. Get to know each other again. Remember the good times when we used to be able to find things, and help each other out.
- Can we score OpenData sites? We have watchdogs on making data open, which is great. This helps make sure organizations provide what they are suppose to provide and keep as openGov. But, this approach would be more on scoring the reality for discovering what you have provided. For example, we know the lawyer trick when they want to make problems with discovery – they provide their opposing side with so much information that they are inundated and their is not enough time to do discovery, and yadda, yadda, legal gamesmanship. Can we find ways to score or watchdog sites on data discovery as either a part of transparency, or a different type of consumer report?
Xentity was recognized on CIO Review list for “20 Most Promising Government Technology Solution and Consulting Providers 2013” list.
With the advent of internet technologies, there has been a change in the landscape of business processes related to the Federal Government system. But the change hasn’t been easy as it requires constant dedication to move the entire workforce from traditional systems, and getting them to seamlessly adapt to the modern systems. This transition also includes the role of technology consulting providers, whose sole responsibility is to provide a wide spectrum of services in order to help the federal agencies to cope with the changes, in the best possible manner.
As customers and business partners increasingly demand greater empowerment, it is imminent for government companies to seek for improved interactions and relationships in their entire business ecosystems, by enhancing software capabilities for collaboration, gaining deeper customer and market insight and improving process management.
In the last few months, we have looked at hundreds of solution providers and consulting companies, and shortlisted the ones that are at the forefront of tackling challenges related to government industry.
In our selection we looked at the vendor’s capability to fulfill the needs of government companies through the supply of a variety of services that support core business processes of all government verticals, including innovation areas related to advanced technologies and smart customer management. We also looked at the service providers’ capabilities related to the deployment of cloud, Big Data and analytics, mobility, and social media in the specific context of the government business.
We also evaluated the vendors support for government bridging the gap between IT and Operations Technology. We present to you, CIOReview’s 20 Most Promising Government Technology Solution and Consulting Providers 2013.
CIO Review Magazine Full Article on Xentity:
Xentity Corporation: Rapidly Designing The Needed Change In Cost-Cutting Times
By Benita M
Friday, December 6, 2013
“We always try to believe that leaders want to execute positive change and can overcome the broken system. We are just that naïve,” says Matt Tricomi, Founder of Xentity Corporation in Golden, CO, named for “change your entity” which started on this premise just after 9/11 in 2001.“This desire started in 1999. I was lucky enough to be solution architect on the award winning re-architecture of united.com. It was a major revenue shift from paper to e-ticket, but the rollout included introducing kiosks to airports. Now that was both simple and impactful”. Xentity found their niche in providing these types of transformation in information lifecycle solutions. Xentity started slow, first, in providing embedded CIO and Chief Architect leadership for medium to large commercial organizations.
Xentity progressed, in 2003, into supporting Federal Government and soon thereafter International to help IT move from the 40-year old cost center model to where the commercial world had successfully transitioned – to a service center. “Our first Federal engagement was serendipitous. Our staff was core support on the Department of the Interior (DOI) Enterprise Architecture team”, Matt recalls on how the program went from “worst to first” after over $65 million in cuts. “We wanted to help turn architecture on its head by focusing on business areas, mission, or segments at a time, rather than attack the entire enterprise from an IT first perspective.” The business transformation approach developed ultimately resulted in being adopted as the centerpiece or core to the OMB Federal Segment Architecture Methodology (FSAM) in 2008.
Xentity focuses on the rapid and strategic design, planning and transformation outreach portion of the technology investment in programs or CIO services. This upfront portion is generally 5 to 10 percent of overall IT spending. Xentity helps address the near-term cost-cutting need while introducing the right multi-year operating concepts and shifts which take advantage of disruptions like Geospatial, Cloud, Big Data, Data Supply Chain, Visualization, and Knowledge Transfer. Xentity helped data.gov overcome eighty percent in budget cut this way. “Healthcare.gov is an unfortunate classic example. If acquisition teams had access to experts to help register risks early on, the procurement could have increased the technically acceptable threshold for success.”
One success story of Xentity is at United States Geological Survey (USGS). “After completing the DOI Geospatial Services Blueprint, one of several, the first program to be addressed was the largest: USGS National Mapping Program.” This very respected and proud 125-year old program had just been through major reductions in force, and was just trying to catch its breath. “The nation needs this program. The blueprint cited studies in which spending $1 on common “geo” data can spur $8 to $16 in economic development. Google Maps is one of thousands which use this data.” The challenge was to transition a paper map production program to be a data product and delivery services provider. “The effort affected program planning, data lifecycle, new delivery and service models, and road-mapping the technology and human resource plan. We did architecture, PMO, governance, planning, BPR, branding, etc.” Xentity, with its respected TV production capability, even supported high-gloss video production to deal with travel reduction and support communicating the program value and changes with partners and the new administration. This is definitely different than most technology firms. The National Map got back on the radar, increased usage significantly, and is expanding into more needed open data.
Presently, Xentity is a certified 8(a) small disadvantaged business with multiple GSA Schedules and GWACs (Government Wide Acquisition Contracts). Xentity invested heavily in Federal Business management. Part of providing innovative, pragmatic, and rapid architecture and embedding talent is being able to respond quickly with compliant business management vehicles. Xentity is constantly seeking out the passionate CIOs, Program Directors, Architects, and Managers looking at transformation in this cost-cutting environment. “Sequester, Fiscal Cliff, debt ceiling, continuing resolutions–it’s all tying the hands of the executives who can look at best six months out. They don’t have the time to both re-budget and rapidly design multi-year scenarios to out-year performance drivers and options let alone staff up to speed on the latest disruptions or right innovation. That is where we come in. We start small or as fast or slow as the executive wants or believes their organization can absorb and progress.”