Delivering Open Data in Bulk on the Cloud

Blog post
added by
Wiki Admin

We just finished some work for a large National Government data provider who measures their number of files in the millions, records in the tens to hundreds of millions, and storage in sub-petabyte. Below is the obfuscated general requirements if you were to be looking to deliver your bulk data in the cloud : Storage requirements, access, methods, discovery, communications, and applications.

These requirements have been generalized or completely redacted or some cases, added to, to allow for all in Government Open Data delivery with large public datasets to consider. This is simply the business requirements, and not considering the technologies, vendors, cost models, capacity planning, etc. That was done separately.

1.    Storage – Storage supporting file form factors including

Investigate the free public data set clearinghouse areas like http://aws.amazon.com/publicdatasets/ or on Azure, etc. 

Consider  various form factors of files or services

  • Gigabyte Size Files
  • Medium Size Files, but totals more than Gigabyte Size Files
  • Many Terabyte or Gigabyte files that have been broken into medium files for transfer
  • Millions of small files usually delivered in buffered stream
  • Data-driven file delivery via services
  • Terabyte Files only deliverable via Sneakernet Import/Export

2. User Access – easy access for users to copy files to target environment

Public Read-Only Users should not be required to have to pay for access to end-solution (i.e. should not require user to have cloud account on hosted solution)

Internal Users will require access to private directories for files not or yet to be publicly released files (i.e. in response to emergencies, access to licenses data, interim work products)

Internal Users will benefit from lower-latency access than public users. Solutions such as cached volumes, integration with on-premise IT and cloud environment, and secure file transfer.

3.    Multiple Access Methods – Service, Download, Media, Cloud-to-Cloud

Users will look to have data provided in bulk one of three ways: Web Service, Bulk Media, or Cloud to Cloud

Admins should have access user traffic statistics for viewing, exporting statistics logs, and calling statistics logs via hosted applications.

User pulls a directory, set of directories, set of files or a mix via online web access via HTTP, REST, FTP, UDP, or SCP.

Learn about high-performance file transfer solutions are possible such as Edge Network publishing to move closer or supporting high-performance file transfer such as UDT (UDP-based data transfer protocol)

For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user.

For faster and likely larger file requests, User requests a directory, set of directories, set of files or a mix to be put onto storage device by the service provider and the device is delivered back to user. Bulk Media minimum specifications for external hard drives

Users who have existing cloud accounts for storage or who have virtual machine processing points on the cloud, will make requests or will pull a directory, set of directories, set of files or a mix Data pushed to the users cloud point

4.    Discovery – increased visibility and discovery of staged products in catalogs and search engines

Data Products are usually downloaded via keyword, geospatial or temporal product discovery applications based on filtering their search, creating an order, and downloading the products in small group.

Public file directory listing should be discoverable and optimized for discovery by search engines

Public collections should be discoverable and optimized for discovery by search engines

Explicitly demonstrate how bulk data registrations will be discoverable and registered in both Sciencebase.gov and data.gov

Catalogs should be able to pull or push harvest public FGDC, ISO-19115, or RDF metadata of files in the directories for transaction or bulk loading into their catalog.

File Directory Listing can be queried via open-standard discovery service to assist in developing a download filter list.

The National Map can be discoverable in proposed service provider catalog, but the catalog reference needs to follow the metadata provided along with each file with at minimum presenting source, created date, updated day, title, basic description, and the provided DOI link for the file or directory.

Service Provider should be able to be support being called via a Digital Object Identifier

5.    Publishing– support batch file release updates for thousands of files monthly.

Consider if publishing and updating files within datasets incrementally, and will require service or bulk media methods to update the datasets.

Files published will be stored in original formats.

Updates are expected to be updated monthly at no more than on average 10% of files or file storage.

Updates to files should be logged to trigger notifications to subscribed users.

File updates should be able to maintain success and parity check status.

Offline File transfer should support processing of delivered storage devices with clear instructions

Online upload transfer per storage unit (i.e. per gigabytes) should not have transfer charges akin to transactional charges to bulk download area

Online upload should have high performance data transfer capabilities such as UDT (UDP-based data transfer protocol) for between on-premise data and cloud.

Moving from cloud to cloud, i.e. if moving from transactional area to public dataset hosting area, should have very high-speed transmission speeds and should consider location proximity issues.

6.    Notifications – providing ways for users to subscribe to staged product files update notifications

Users can subscribe to changes to directory, sub-directory, or specific files

Users can be notified of such changes via push notifications via such ways as per change, daily changes, RSS updates, or other notification techniques.

Users can use the notifications as ways to request the bulk file updates

7.    Download API  – Supporting applications or including applications that help the user download in bulk

Have a download API that can be controlled by api.data.gov which can uniquely identify, provide HTTP access to via GET parameter in a URL query, support an hourly limit of number of requests per hour based on API Key settings. If api.data.gov rate limit is exceeded, an HTTP status code of 503 should be returned.

3rd party applications should be able to support HTTP, REST, FTP, or SCP calls.

Software Development Kit access (java, python, .NET, PHP, etc.) access should be allowable as well.

The file download should be able to support multiple file requests, allow for parallel downloads, handle restarting partial download file requests, and governor anonymous volume requests

Peer-to-Peer solution support (i.e. such as BitTorrent) must comply with Federal Regulations.

Identify what, availability, and cost for User Training and Sanctioned or third-party consultants for Software Developers is available

8.    Applications – Support the end user experience for unzip files and load into geospatial database

If the user will received multiple zipped files that will require the user to click each link to download, unzip each file, and then load each file using the provided metadata manually into a database, can this be automated

Vendor can create premium either accelerator, increased access or additional formats are part of the delivery if branded separately as a vendor branded product and as long as there is one version that is published clearly marked as Authoritative Government as published and controlled by such in its original published form.

 

I always wanted to be a data janitor

Blog post
edited by
Wiki Admin

You know the `data wrangling` field is becoming more mainstream if NYTimes is covering it at this level: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights NYTimes.com  

The article emphasizes the point that the amount of time to get the data – sensors, web services feeds, corporate databases, smartphone data observations – prepared and ready to be consumed is still a huge effort. 
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

It hits home the point we love to belabor that the software companies do not. Be it ERP, GIS, MIS, analytics – they show great demos, but with data already prepared or some of your data, but only the easy stuff. It works great in their environment and as long as you give it good data, it performs and rocks! But, the demos continually underscore what it takes to wrangle, cleanup, and get that data prepared.
As a huge proponent of good data being a or the major problem or barrier, its nice to see software beginning to move into startup investments to help – ClearStory, Trifacta, Paxata and other start-ups in the field. In mean-time, we need make sure to always stay on top of best, approved approaches, and the team will bring it to the table. Using various techniques from browser-based network applications in NodeJS to using NoSQL databases to ETL to simply keeping your excel skills up.
In poking at the O`Reilly book “Doing Data Science” and its discussions about Data Wrangling/Janitor work, its not a bad read to pick up. A Great quote in the NYT Article by the Doing Data Science author is:
“You prepared your data for a certain purpose, but then you learn something new, and the purpose changes,”

But the article does tend to focus on bigdata, and not bigdata in an opendata world

Even if the data is scrubbed for machine to machine processing as the article emphasizes, it still doesn`t address the fact with the push for data catalogs, that data curators and creators HATE – with a passion -metadata creation. Its one more burdensome step at the end of a long cycle. Furthermore, there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.

Lets take the metadata needed to tag for proper use. Lets take a real-world example recently discussed with a biological data analytics expert. 

A museum wants to get some biological data to support an exhibit. They have a student collect 100 scientific observations records on larvae using well calibrated sensors. The data gets published. An academic study on climate change sees the data. The data shows that there is lower larvae count than historical and demonstrates heat changes impacting such. The data shows the sensors and tools used were scientifically accurate. The data is sourced and used. This is a great example of mis-use of data. True, the data was gathered using great sensors, right techniques, but it was not intended as a complete study of the area. When the student hit 100, they were out of there. An observation is not an observation. Its use is vitally important to denote.

 

Another great example, but slightly different use angle. A major Government Records organization wanted to take their inventory of records data, which is meticulously groomed and its data is very accurate and entered and goes through a rigorous workflow to make sure its metadata and access to the actual data record is prepared to be accurate real-time. The metadata on data use is perfect and accurate and lawyers and other FOIA experts are well-versed in proper use. But, the actual metadata to help discover something in the billions of records was not suitably prepared with more natural ways people would explore the data to discover and lookup the records.

Historically, armies of librarians would be tasked with searching the records, but have been replaced with web 1.0 online search systems that do not have the natural language interpretative skills programmed (or NLP) and signals the librarion would apply. Even if they do, they are not tuned and the data has not been prepared with the thousand of what Google calls search signals or we call interpretative signals that we discussed back in 2013.

This is another great example of overlooked data preparation. `Publish its metadata in the card catalog standard, and my work is done – let the search engine figure it out`. Once again though, the search engine will just tell you the museum larvae record matches your science study need.

 

Go Code Colorado recognized with a StateScoop 50 award

Blog post
edited by
Wiki Admin

Colorado has done something quite innovative and has been recognized with a StateScoop 50 award for State Innovation of the Year. 

And Xentity was glad to be a part of it.
Colorado’s Office of the Secretary of State invented and delivered Go Code Colorado, an app development competition for using public data to address challenges identified by the business community.  Three teams built winning apps and business plans and will have their technology licensed by the state for a year.  These teams have formed businesses; some are in discussions with investors and some have left their day jobs to focus on building their new venture.
Here’s the genius of Go Code Colorado: the State took funds from the filing fees that businesses pay to register their businesses and GAVE IT BACK to the business community through Go Code Colorado… AND incubated three new promising businesses in the process.
Plans are to build on this first year and run Go Code Colorado for a second year… on a larger scale and improving on many aspects based on lessons learned from year one.
Xentity had the pleasure of participating in Go Code Colorado as the State’s data management provider — sourcing, curating, improving and publishing 38 datasets from state agencies from Higher Ed to Labor and Employment.
You can see more about Go Code and Xentity’s support in our April Blog Entry: Go Code Colorado Open Data Effort is going into its final weeks . Also, check out the photo gallery of the final event.

Go Code Colorado Open Data Effort is going into its final weeks

Blog post
edited by
Wiki Admin

States all-around have gotten into Open Data movements. Colorado has as well, and their recent Go Code Colorado effort is a very unique entry into this foray ( http://gocode.colorado.gov/)

Go Code Colorado was created to help Colorado companies grow, by giving them better and more usable access to public data. Teams will compete to build business apps, creating tools that Colorado businesses actually need, making our economy stronger.

 

The following is a great video that summarizes the event as produced by the State and one of Xentity’s colleagues, Engine7 Media.



Get Adobe Flash player

Xentity is very proud to be supporting this innovative Government Solution

Xentity was awarded IT consulting support for the the Business Intelligence Center platform and data catalog which supports the now branded Go Code Colorado initiative. Xentity’s consultants have provided the data and technology resources to manage and advise the publication of public sector data to the Colorado Information Marketplace and to provide technical support developers who participate in the Challenge. 

Xentity primarily has provided data platform support. We have provided data readiness analysis, data architecture guidance, project management, and the data analysts to “wrangle” the data (aka ETL) to get the datasets onto the platform. We also have provided the IT and data support on-site at the multiple locations and events to assure the challenge participants and finalists are getting the support they need to be successful in accessing and using the data and services. Finally, we are supporting the technical review of applications to assure these applications can have a life beyond the “hackathon” stage.

The final stages are coming the first 10 days of May. The 10 finalists have proven to demonstrate very viable solutions to achieve the goal of helping make our economy stronger. 

Some more background and detail on how we got here

(The following is from the State as guidance to this effort)

 

Colorado government agencies possess large volumes of public business and economic data. This data can help businesses with strategic planning, but it exists in so many different places and formats making it difficult for that most businesses to use it. The Secretary of State’s office will address this problem through the creation of the Business Intelligence Center (BIC). BIC seeks to aggregate and analyze data available to the business community.

This effort is led by the Colorado Secretary of State. The Secretary of State’s office interacts with hundreds of thousands of business entities, charities, and nonprofits in the state. The Secretary of State’s office collects, manages, and disseminates large amounts of basic data about those organizations and wanted to make the data useful to Colorado businesses. 

The Department sought to make this data more useful and collaborated with the Leeds School of Business at the University of Colorado to publish the Quarterly Business and Economic Indicator Report. This report combines Department data with other economic data collected by he Leeds School to provide meaningful economic information to the business community. For instance, new business filings are a leading indicator of job creation. With this and other information provided in the report, the business community can make smarter decisions that will grow the Colorado economy.

Since first publishing the report in 2012, the Secretary of State received comments from many members of the business community asking to see more detailed data regarding economic trends 
in order to better understand the distribution of commerce in Colorado. This includes access to the location, size, vibrancy, and concentration of key business nodes. While this level of detail would be tremendously helpful, the Department cannot provide the information because multiple state agencies collect the desired data and it is not readily available in a common place  or even a common format.

A central data collection point is needed. During meetings with other government agencies, Department staff concluded that these data requests could be met by aggregating all the information spread throughout various agencies and databases into a single tool by breaking down agency silos and better cataloging existing resources. Department staff also concluded that access and availability to the data is not enough. In order to make the raw data useful to the vast majority of business owners, data analysis and visualization tools are needed. These conclusions led to the Business Intelligence Center project.

The Business Intelligence Center consists of a centralized data catalog that combines public data into a meaningful tool for businesses. 

The vision for this project is two-fold. First, it consolidates public data relevant to businesses on a single platform. Second, it gives business the tools to make the data useful. The second goal is 
achieved through a civic apps challenge—the Colorado Business Innovation Challenge—that will give financial incentives to the technology community to build web and mobile applications that use state and other data to solve existing business challenges.

The data platform is akin to an information clearing house. It will make data sources currently dispersed over multiple government departments and agencies accessible in a common location. 
This platform will offer Colorado businesses unprecedented access to public data that is validated and relevant to short and long-term needs. Besides enhancing businesses’ access to state data, the BIC will also contribute to economic growth. The creation of the BIC will make data available to all Colorado businesses at no additional cost. Currently only large entities with the time, staff, and budget to engage in detailed statistical analysis can use these data sets. Providing this data to every type and size business in Colorado provides a unique opportunity to contribute to economic development. The BIC will nurture key industry networks and lay the foundation for a digital infrastructure that will continue to expand and improve over time.

The Colorado Business Innovation Challenge is an innovative way to create solutions and ensure the BIC is useful to Colorado businesses.

Simply making the data available is insufficient to most business owners. To truly help the vast majority of businesses—especially small businesses—tools must be developed to present the data in a useful and consumable form. Normally government agencies develop tools to fill this information vacuum, but historically the government has not been successful developing highly useful and effective tools. A new approach is needed—that approach is the Colorado Business Innovation Challenge.

Modeled after a “civ apps” challenge that has been run in multiple cities across the United States and internationally, the Challenge presents the software development community with problem 
questions and then asks that community to create possible solutions. At the end of the challenge, the Secretary of State will license the most innovative and implementable web or mobile application. The best design will receive a contract with the Secretary of State to make the application available to the public on the Business Intelligence Center platform. The Department will also pursue partnerships with the Colorado technology and startup industry to provide additional incentives, such as mentoring, hosting, and office space to the Challenge winners. The long-term intent of the program is to not only create an environment for fostering community involvement through the Challenge, but to develop sustainable tools that are  developed in the Challenge.

Exploring Public Domain Maps and Imagery – Historic Denver West

Blog post
edited by
Matt Tricomi

Doing some light examination of Denver West, the following is just two aspects of free public domain data. There is so much more beyond imagery and maps, and even more maps and more imagery (i.e. MODIS, LANDSAT, other NASA products).

Historic Maps

USGS Topo Map 1-meter Quads are clockwise:

(Download this KML file to see all options or click the years above the thumbnails below to grab that historic map)

Golden – Download – 1939194219441957

6 Diff dates for 15 Quads 1939-1965

Arvada – Download – 19411944195019571965

5 Diff dates for 12 Quads 1941-1965 

Morrison – Download – 19381942194719571965

5 Diff dates for 11 Quads 1938-1965 

Fort Logan – Download – 1941194819571965

4 Diff dates for 8 Quads 1939-1965 

Historic Imagery

Historic Imagery typeYearMetadataThumbnail (Actual Files range from 20-200 MB)
Digital Orthophoto Quadrangles (DOQs)1994
  • Entity ID: DI00000000912683
  • Acquisition Date: 23-SEP-94
  • Map Name: FORT LOGAN
  • State: CO

 

 

 
National Aerial Photography Program (NAPP)1988
  • Entity ID: NP0NAPP001033224
  • Coordinates: 39.8125 , -105.03125
  • Acquisition Date: 10-JUL-88
 
National High Altitude Photography (NHAP)1983
  • Coordinates: 39.671135 , -105.043794
  • Acquisition Date: 25-JUN-80
  • Scale: 120400

Download ~30MB file

 

Space Acquired Photography   

Single Frame Records

Black-and-white, natural color, and color

infrared aerial photographs

400 or 1,000 dpi.

1978
  • Entity ID: AR1VEQP00010142
  • Coordinates: 39.688989 , -105.053705
  • Acquisition Date: 01-SEP-78
  • Scale: 78000
Aerial Photo Mosaics (Used when creating early/mid-USGS Topo Maps)1953
  • Entity ID: ARDDA001260930776
  • Coordinates: 39.5 , -105.5
  • Acquisition Date: 25-SEP-53
  • Scale: 63299

Download the 30MB file

 

    
High Resolution OrthoImagery (Corrected – Generally .3 meter, color) 

2002 – HistoricOrthoCoverageAreas.kml

 

 
Declass 1 (1996) Stereo Images1965
  • Entity ID: DS1027-1015DA011
  • Coordinates: 39.69 , -104.516
  • Camera Resolution: Stereo Medium
  • Acquisition Date: 10-DEC-65
Declass 2 (2002) Stereo Images1966
  • Entity ID:DZB00403500080H001015
  • Coordinates: 42.49 , -103.45
  • Acquisition Date: 10-DEC-66
  • Camera Resolution: 2 to 4 feet