Washington just can’t catch a break, eh?. Debt Ceiling. Sequester. Shutdown. And now an epic fail of a single portal solution that provides the primary access to the new healthcare access law.
Love it or hate the law, the model for the electronic access provision is definitely an epic fail of a rollout.
What else is an architect to do if we can’t pile on the fiasco, offer opinion, while at same time, we are hoping to catch someone’s attention or get some sense someone is thinking about it from the design point of view, as it appears the media, hearings, and threads are not hitting on, at least, where we were hoping the discusion would land. Its a bad design. I’m not commenting on the policy part – the pundits do enough of that. I’m commenting on the architecture itself. It appears a poor design.
The following captures our internet sleuthing, colleague discussions, and our current deductions.
This is an evolving blog, as its more of a case study than a daily diary. Its a bit in draft form, but a way to begin to pin the story together. Apologies for the language typos and pre-publish ready state, but figured this is so fast moving and important, that I wanted to add to the dialog, and not simply leech after with 20/20 hindsight.
UX and Web Design is fine – Its not the front end
Now much fanfare has gone to the “web site” glitches. It was written about back in June by Alex Howard in the Atlantic
. I had the pleasure of connecting with Alex during data.gov work back in 2010 and off and on, we correspond on social media occasionally. I have a respect for his writing, what he follows, and have found generally that he is spot on with bringing collaboration across traditional boundaries into the world of information and technology. That said, I may be a bit biased.
Here are just a handful of articles done on the healthcare.gov performance:
Its the same thing as above – minify, compress, order your scripts/CSS, cache. Yep, the same things back in 1999, just a MUCH more powerful scripting and style processing capability.
Point being, this client-side tweaking is all REALLY good practice, especially for large sites with large traffic. Every little bit of debris cleanup helps. But, whatever fix is done, there needs to be a balance of the true timeline issues. It appears that the problem is primarily a server-side architecture problem . There were/are definitely several issues with the front-end performance as the above articles suggest. Its easy for the web technophiles to do, since most of the web processing is on the client side – in your browser and tools (i.e. hit F12 on a PC in Chrome or do YSlow plugin for Firefox) are easy to use or websites like what Google Analytics does can analyze performance, content safety, usage statistics and so much more. Point being, are you going to put more resources to fix the leaky faucet or the gushing, gaping hole in the water main first?
Now, Alex only reported on the developments of the front end and UX component. There is deserved high praise on Prose.io, Jekyll, open source concepts, garage organizations breaking beltway development stereotypes, and persona development as a way to develop the navigation for this brand new pattern.
But, the point is, the UX and web design part is fine and and dandy. Alex was right in June and still is. Its the same architecture I did for united.com back in ’99, just different tech. Have a CMS, cache it, distributed over 4 servers west/east. When 9/11 hit, my site was only airline site (check internetwayback machine) and call center that stayed up. So, this model for hc.gov is fine. The UX caching is fine.
I felt pretty bad for him on the article, and generally as a citizen, embarrassed, so like many of us architect weenies, I dug into it as many other colleagues have. Hey, regardless of politics, we all want a working country.
Why the logic for real-time data aggregation architecture?
Just like the Tacoma Narrows Bridge
didn’t fall because the construction contractor failed, it was the architecture forgot one piece of logic – the wind in this river was very strong. There was nothing to dampen the flow, and when the wind force blew, it blew the bridge passed and near its natural frequency (think rubbing your finger on a wine glass), those vibrations shattered the bridge. It was built to specification by the contractors, but the design was horrible because it had flawed logic.
OK, where is the parallel. So if its the back-end problem for healthcare.gov, where is it? My background and recent work on “hero” architecture was also excited to hope it was a minor performance issues that could be fixed by some horizontal scaling of servers. They did that, no major fix. Maybe it could be some technical server or software tuning – no luck. That only leaves bad logic.
Now it appears possibly the contractor did have faulty construction in that their wasn’t enough foresight to do more parallel testing, load testing, integration testing and the “7 steps of doneness“. The architects came out and said it today. Though that sounds a little passive aggressive now, as an architect of a building, would you say that after the bridge collapsed? Sounds like either buckpassing or gag order or droopy dog, no on is listening to me.
But even that would have been able to be reported out by now. So, aside from the obvious lack of discipline for engineering failures which is the civil engineering equivalent to the Tacoma Narrows Bridge, what logic am I talking about.
I believe it comes down to the architecture logic in this case, likely multiple areas. They treated the architecture like a controlled real-time MIS distributed system on data that has not been standardized or proven to be easily integrated through the test of time
Bad Logic: It appears they are responding to business rules that ask for real-time queries?
Why are they doing this? Is Healthcare.gov is following the KAYAK.com or Orbitz.com data aggregation model? In airlines, rates could change any minute, airlines have proven APIs over a decade of time-tested improvements, and those smaller airlines get screenscraped and KAYAK spends dollars keeping those scrapers up to date – just like when mainframes used to be scraped for client-server integration. Airline is a huge industry, they put their paper ticket to eticket on the line and it took a decade to get to this model. Then again, USA Today put out an article noting that Healthcare.gov is not alone on high-tech blunders:
United, Continental merge computer systems
March and August 2012
United Airlines had problems with its reservations system in early March after it switched to Continental’s computer system as the two airlines merged operations. Passengers complained as United struggled for several days to fix problems. In late August, the airline’s computer system and website went down causing problems with reservations, ticketing and check-ins.
This of course made me smile a little bit since they took down my architecture I did for united.com since they decided to move to Continental’s toolsets mostly because United needed to get out of Apollo mainframe (so says word on the street). So when they through [my] proverbial baby, then 12 years old and still well respected and award-winning out, with that, it did make me smile. But, point being, even big moves in private industry will happen in this type of architecture.
While I appreciate the importance of up-to-date comparisons, cant this stuff be pre-loaded? Why are we bucket brigading this information? Did they do this because they got caught up with the fanfare of the neat front-end – which again is slick, but its a small part of the project. Its the book on the cover that gets people there and comfortable and to have a good interior design user experience. But if you cant get a cup of coffee in Starbucks, does that matter?
I wonder if it needed to be this way. When I heard about the queries real-time crossing statelines, which meant that all the quality validation would require quality handshake, in real-time from each source where each time a different provider, state, or other governing data source, that would require a whole different set of rule base to validate. Each state, provider, etc. operates differently and to expect that real-time is quite an aggressive and bold feat.
At same time, is that required? Imagine doing a bucket brigade through 10 points for one bucket to put out a fire, some water will be spilled, but if you simply put more people/power in between, the spillage doesnt matter. That is why it works for normal web service models which has made Twitter, facebook, and many other Services used in other apps work so well. Its a simple single source of data.
Now, pretend there are 50 different types of twitter imitators, all with different approaches to tweeting, different data about the user or the tweet (aka metadata), different ways they setup the service call, and under different state laws to share that information. This is healthcare.gov.
HealthCare.gov faces challenges like aggregating volumes of data and building an efficient system to meet citizens’ needs. An agile database like MongoDB could have helped HealthCare.gov to scale, remove redundancy, and potentially reduce the cost (estimated to be at $292 million so far) (WaPo article) of both creating the site and dealing with the fallout of its failure.
I haven’t reached a point of validation yet, but it sure seems like the separately developed service handshaking across these disparate sources would cause integrating data that was created, formed, provided, and published under such different bucket brigade handlers, that it would be like having a bucket bridge coming out of a lake and an ocean, and expect only fresh water and the extra salt water magically goes away.
Why not pre-stage the queries?
I guess I’m asking still – why do they need to aggregate real-time? Why can’t they pre-stage the calls? Are they really updated every minute?
You make pre-staged products, where in between each organization, you validate the quality, get it ahead of time, or design-time, so then your run-time call can call the validated source which can be updated every week, day, 15 minutes, etc. and different of each feed. Then when new feeds come in, the separate pre-built maps or indices are auto-updated to make and optimize the comparison experience too, and even setup the possibility to help inform what the comparison mean or simply make matter of fact statements about the comparisons (not advice, but make obvious the differences). Google has proven this model. Its just reading content. And with the knowledge snippets in Google now on the right, you can ask it basic factual questions now. They can compare and its all hitting a local, validated, appropriately up to index that can scale, elastically on the cloud, that is proven fast – Google has proven that.
We recently did this same thing for a must be remained anonymous major Government agency. They had their search calls call the traditional RDBMS which is better at searching on a specific record and returning in simple, non-high computational queries. Instead, we asked, can you move 90% of the search into a NoSQL solution. It can load hundreds of millions of records in minutes, do all the pre-calculations to make for a smarter search – like how google knows what you are typo before you do, it can handle typo, many facets, drilldowns, etc. I believe KAYAK has moved this direction as well, but can’t validate, to optimize its search experience.
This is why the twitter, facebook, and other popular highly used service APIs work. They designed their SiM, and now there is a massive ecosystem of sub-applications, sub-markets, aftermarkets. HC.gov did not do that, it took a YAGP (yet another government portal) architecture technique, parted the scoping out like for a battleship. There wasnt enough consideration for patterns like more design-time pattern integration vs. run-time pattern integration (which is something I recently architecture prototype for NARA adding a NoSQL “index”, if you will, in front, so the query part was fast, but the transaction part got passed to the traditional RDBMS, then if updated, it did a millisecond sync back with NoSQL.
Now for the Blame Game: In contracting, investing in architecture is still not a requirement, so its a liability to bid it that way.
It was divvied to over 50 contractors, and outside of a PMO, there appears to be no enterprise service integration patterns concepts are part of the leadership team. I saw a PMO, but they usually manage the production, not the architects of what needs to be produced. We can throw CGI Federal under the bus all we want, and whomever did the PMO, but it sure seems like the requirements and team did not have a solution or architecture integrator as one of the roles. Someone(s) overseeing the Service Integration Model – call it Enterprise Service Architect, Sr. Solution Architect, Architecture Review Board/Governance – to advise on design risks, maintain risk weights and let the PMO know where risk is at escalation points.
I do know in contracts, if you are believing you could win if you could shave 3% off the contract, the first thing to go is usually the higher end rates. Those rates are usually quality focused. Those typically are those on architecture, design, strategy. Typically government contracts do not write that in, or if it is, it is written in a more compliant way. Contracting Officers are not in a position to review whether a subjective concept such as a proposed architecture is better than another and Contracting process review boards for IT have not adopted concepts like in civil engineering architecture concept review boards. Given that, having a higher quality architecture solution component is not seen as differentiator, since it doesnt check a box, typically integrators will drop that high rate position, and wallla, you have just undercut competition by 2-4% on the bid.
By the way, don’t get hard on the contractors only. The same goes for the writing of the contracts. Contracting Officers do not have a way of knowing if the technical requirements is asking for are sound or the best. And they have stated on occasion they like to leave it open to allow the contractor to come back and tell them “how”. While this is fair to let private industry offer best solutions, there should be architecture principles that are put in the contract to guide how they can answer, and thus less subjectively how they can assure robust architecture, for instance, without just saying “it will be a robust architecture”.
But as we saw, when you don’t buy back the risk up front, and the design changes, the cost balloons, which it did. Today, the reports are saying contractors are blaming the government after a few weeks of government throwing “greedy” contractors under the bus. I say its not greed, but a broken procurement process. We have blogged for years, and built our practice around this principle – We can approach architecture for other implementers .
For a large majority of consulting companies both design and implement for ALL projects. Though profitable for many firms, the best design can end up biased towards the agenda of the implementer which may be to sell more components, get more bodies. Now, we have the capability to implement architecture, but our end goal is not to design an architecture that is for us to implement, but an architecture that is implementable.
Many times, the client knows that the implementer will design with a bias, so the client chooses to or must design blind without considering the maturity of what an implementer can provide. In those cases, we can come in, architect, and be a third party to help do the concept, design, and design the requirements and performance work statement basis.
This approach with these services buy-back risk to your implementation and increase the likelihood of achieving your metrics and goals.
11/1 Note: A colleague forwarded me two WSJ articles that discusses Fixing Procurement Process Is Key to Preventing Blunders Like Healthcare.gov
and Procurement Process is Government’s IT Albatross
which only adds fuel to the fire. These blogs uses Clay Johson’s Fix Procurement Manifesto
as its guide. I agree with many points in Mr. Johnson’s manifesto and his points, but I don’t agree with the author’s understanding. The author boiled it down to make procurement easier to get new tech. That sounds well and good, but as a colleague who was formerly in the OMB said a few weeks back, if you add any tech to a broken process, it will only make it worse. And it did. So, I think it will not only take time to fix Agency Leadership as Johnson notes, and the author cites Former OMB CIO Kundra as a key, but also, the general public understanding of how technology and business transformation works. The author and agency leadership compare healthcare.gov, a solution that involves data exchange, policy oversight, and manages millions of transactions to an Operating System flaw in an iPad which requires fix once, deply many. These are very apples and oranges conversations – one is commodity technology, and other is automating multiple decades of policies on top of policies. Until we can help educate the differences in complexity, we won’t be able to achieve the aspects of Johnson’s manifesto as well as help understand the right architecture to invest in PRIOR to letting a contract.