Klick Health

Healthcare.gov – Doing it wrong was the easy choice

SVP, Technology

in
Read More

So, the story is well known by now: Healthcare.gov, the CMS-sponsored health insurance exchange, failed miserably on launch. Users simply couldn’t get past the registration-stage to see anything inside the $500 million site.

Fewer than 27,000 of the 3.7 million Americans who attempted to register the first week after the marketplace’s Oct. 1 launch were successful. Seven million Americans are expected to be insured under the Affordable Care Act, and they must enroll using the marketplaces by Dec. 15 in order to have coverage start on Jan. 1. Administration officials estimate the total number of health insurance applications received through the exchanges  (including the state-run exchanges) at 106,000.

Now we’re three weeks into the launch and performance problems still plague the system. If you read this blog consistently, you probably know that software development is hard. I’m not saying that just to be self-serving, ever since the (in)famous “Chaos Report” in 1999 from the Standish Group the industry has struggled to reconcile the “iron triangle” of project management: cost, quality, and schedule.

Like all major IT fiascos, the fingers are pointing. The different players are:

Testing needs to scale to project

Testing is not a “nice to have”: If you want to have millions of people on your website, it needs to be able to support millions of people. Testing with only a fraction of that will not guarantee a robust experience when the site is launched. Apparently, the programmers of the system were worried about this but their concerns were ignored in favor of meeting the deadline.

Our motto is “testing is too important to leave to the end.” It’s vital that testing be ‘front-loaded’ into the early parts of the development cycle to prevent it from being skipped at the end should a schedule overrun occur. When we build business-critical systems we create multi-level testing plans that include:

We don’t create these large (and expensive) test plans for all sites, just the most critical and complex ones. A typical branded or unbranded information site does not need such a deep test plan, so we scale it back to a level that makes sense.

Log collection and monitoring

The contractors discuss a failure to centralize log collection and monitoring. This is sysadmin 101 mistake and would fail a PCI-DSS (credit card standard) or NIST SP 800-66 (HIPAA data standard) audit. No matter what you think of the politics surrounding this issue, the fact that the federal government is visibly flunking their own standards is embarrassing.

One concern, listed as “severe,” warned, “CGI does not have access to necessary tools to manage envs in test, imp, and prod. Specifically (1) we don’t have access to central log collection / view (2) we don’t have access to monitoring tools. We have repeatedly asked CMS and URS but have not been granted this access.” — CNN Article

No technology or technical group is perfect, but the path to high quality and continual improvement is paved with communication and transparency. We can’t expect the federal government to exhibit this with the partisan politics that overshadow this technical failure, but the fact that the internal technical “experts” didn’t figure out the log collection issues brings their competence into question.

Vendor management

The CNN article quoted above indicates that CGI was aware for the need for these controls, but was unable to get their partner companies to cooperate, probably because CMS was working as the general contractor. This highlights a risk that our clients face: every vendor added to the mix introduces complexity and increases the need for coordination. For example:

There needs to be a “single throat to choke” (quote often heard from our President and CEO, Leerom Segal) who is accountable for making the project work, and the other vendors need to be accountable to that organization. This is a service we provide where we can act as the general contractor on behalf of the client. This type of project structure can help remove the finger-pointing that so often happens between technology contractors (to be transparent, it still happens, it is just that Klick handles it and it never gets to the client).

Scope creep

Late changes can wreak havoc: The main performance problem was with the secure signup technology. The signup technology was designed and configured under the condition that users would be able to shop on a public, non-registered site and only hit the registration site when they were ready to commit. Insisting that the registration happen before seeing any web content expanded the number of people hitting the registration technology by orders of magnitude.

What happens is that these changes made late in the process push the coding into the testing phase. The project team often thinks they can still recover because they can use the testing time for development. There are many forces pushing them toward this decision:

The problem with this approach is that the testing time is not optional. Without comprehensive testing against a code-complete product there can be no assurance that the product will perform to specifications. In the words of our QA lead: a code freeze means “stop coding dammit!”

Off-shoring and H-1B visas

There is some evidence that this project was, at least in part, off-shored and that H-1B (temporary foreign worker) visas were used extensively. This seems to have led to decisions such as the code supporting the obscure Indian Gujarati language and comments being written in a style consistent with offshore programmers. These language issues for the critical in-code documentation will make maintenance and bug fixes much more challenging for the Obama administration’s “alpha teams.”

Also, having a good chunk of your team 12 time zones away and thinking in a different language makes good vendor management exponentially harder. The H-1Bs have been badly abused to bring in foreign workers at very low cost to replace domestic workers. This brings along a lot of overhead that is rarely accounted for:

To a large extent, you get what you pay for. When you pay less than a living wage for a North American worker, you get less.

Full disclosure, Klick Health employs programmers in Toronto, Ontario, Canada so some in the US might consider us “nearshore.” What we object to is the reduction in programming skill that comes with the reduction in programmer wages, not the use of offshore workers per. se.

Size of codebase

The size of the codebase of Healthcare.gov is obscene. I have seen 500 million lines of code listed. For an excellent illustration of just how large that is you can look at this visualization of different program sizes. Writing code is all about elegance and efficiency, something this large is by definition using inefficient and poorly-designed architectures.

The Healthcare.gov website is certainly bloated. Some comparisons:

There is no reason for a website to be that large, even one that is reasonably complex and allows users to compare insurance plans. If you’re building software that big, you’re doing it wrong, period.

One reviewer of the public JavaScript code (the code that gets loaded in your browser when you visit the site) sums it up:

What I am seeing in this code is nothing short of jaw-dropping. As people are now saying, this code is “CRAAAAAZY!” You almost can’t even call it Javascript code. If you sat down 100 monkeys in front of 100 typewriters and told them to start banging away, I’m confident at least one of them would come up with something far better than the Healthcare.gov Javascript code. — from Obamacare computer code riddled with typos, Latin filler text, desperate programmer comments and disastrous architecture

Clean-up is harder than doing it right

Clean up is more work than doing it right: Apparently the administration has “alpha teams” trying to fix the application as it’s running. Learning code is not trivial and these teams will not be as efficient as the people who coded the original systems, they may even inject more bugs than they fix.

The link above shows that the current estimate is that 5 million lines of code are required just to make the current system work. Adding more lines of code will only make the site slower, not faster.

Learn from Healthcare.gov and make sure your projects are handled right from the beginning so you don’t end up with a bloated, un-maintainable, failing piece of software.

More About the Author

Alfred Whitehead

Alf is responsible for the Systems Administration, Quality Assurance, and Security practices at Klick Health. He brings 8 years of experience in software development and high-performance computing to the Klick team, combining his scientific background with an appreciation of the craft of code-writing to pioneer innovative practices.

Go from news to action. Klick Wire

Weekly Digital Health Newsletter

Klick Health will NEVER spam you. Read our privacy policy

Thank you! You're now signed up to get the Klick Wire every week - news from the world of digital health marketing.

Sorry there seems to be a problem. Please try again later.