So, the story is well known by now: Healthcare.gov, the CMS-sponsored health insurance exchange, failed miserably on launch. Users simply couldn’t get past the registration-stage to see anything inside the $500 million site.
Fewer than 27,000 of the 3.7 million Americans who attempted to register the first week after the marketplace’s Oct. 1 launch were successful. Seven million Americans are expected to be insured under the Affordable Care Act, and they must enroll using the marketplaces by Dec. 15 in order to have coverage start on Jan. 1. Administration officials estimate the total number of health insurance applications received through the exchanges (including the state-run exchanges) at 106,000.
Now we’re three weeks into the launch and performance problems still plague the system. If you read this blog consistently, you probably know that software development is hard. I’m not saying that just to be self-serving, ever since the (in)famous “Chaos Report” in 1999 from the Standish Group the industry has struggled to reconcile the “iron triangle” of project management: cost, quality, and schedule.
Like all major IT fiascos, the fingers are pointing. The different players are:
- CGI Federal: the primary contractor on the job, CGI blames the Obama administration under HHS head, Kathleen Sebelius, for a lack of testing and its own subcontractors, especially Optum/QSSI for not raising the issue early or loudly enough.
- Health and Human Services: the overarching group that contains Centers for Medicare and Medicaid Services (CMS). This group blames the contractors, of course, because they spent $500 million and didn’t get a website that performed.
- Optum/QSSI: the vendor that supplied the secure login technology. They had initially scoped their solution for registration after visitors had viewed the information on the different plans, a decision that was changed very late in the project.
Testing needs to scale to project
Testing is not a “nice to have”: If you want to have millions of people on your website, it needs to be able to support millions of people. Testing with only a fraction of that will not guarantee a robust experience when the site is launched. Apparently, the programmers of the system were worried about this but their concerns were ignored in favor of meeting the deadline.
Our motto is “testing is too important to leave to the end.” It’s vital that testing be ‘front-loaded’ into the early parts of the development cycle to prevent it from being skipped at the end should a schedule overrun occur. When we build business-critical systems we create multi-level testing plans that include:
- Unit tests on developers’ environments
- Automated functional test development in parallel with the application so that it covers all the functionality
- Comprehensive automated and manual end-to-end testing
We don’t create these large (and expensive) test plans for all sites, just the most critical and complex ones. A typical branded or unbranded information site does not need such a deep test plan, so we scale it back to a level that makes sense.
Log collection and monitoring
The contractors discuss a failure to centralize log collection and monitoring. This is sysadmin 101 mistake and would fail a PCI-DSS (credit card standard) or NIST SP 800-66 (HIPAA data standard) audit. No matter what you think of the politics surrounding this issue, the fact that the federal government is visibly flunking their own standards is embarrassing.
One concern, listed as “severe,” warned, “CGI does not have access to necessary tools to manage envs in test, imp, and prod. Specifically (1) we don’t have access to central log collection / view (2) we don’t have access to monitoring tools. We have repeatedly asked CMS and URS but have not been granted this access.” — CNN Article
No technology or technical group is perfect, but the path to high quality and continual improvement is paved with communication and transparency. We can’t expect the federal government to exhibit this with the partisan politics that overshadow this technical failure, but the fact that the internal technical “experts” didn’t figure out the log collection issues brings their competence into question.
The CNN article quoted above indicates that CGI was aware for the need for these controls, but was unable to get their partner companies to cooperate, probably because CMS was working as the general contractor. This highlights a risk that our clients face: every vendor added to the mix introduces complexity and increases the need for coordination. For example:
There needs to be a “single throat to choke” (quote often heard from our President and CEO, Leerom Segal) who is accountable for making the project work, and the other vendors need to be accountable to that organization. This is a service we provide where we can act as the general contractor on behalf of the client. This type of project structure can help remove the finger-pointing that so often happens between technology contractors (to be transparent, it still happens, it is just that Klick handles it and it never gets to the client).
Late changes can wreak havoc: The main performance problem was with the secure signup technology. The signup technology was designed and configured under the condition that users would be able to shop on a public, non-registered site and only hit the registration site when they were ready to commit. Insisting that the registration happen before seeing any web content expanded the number of people hitting the registration technology by orders of magnitude.
What happens is that these changes made late in the process push the coding into the testing phase. The project team often thinks they can still recover because they can use the testing time for development. There are many forces pushing them toward this decision:
- No project is perfect and the development team may feel like they need to “go above and beyond” to make up for parts of the project that went slowly, even though that is the nature of software development
- The project sponsor, in this case CMS, wants the project done perfectly, on budget, and on time, so they will not be willing to accept a project delay without a fight, even when change is communicated late in the process
- If the project has been grueling, the project team may just want it to be over
The problem with this approach is that the testing time is not optional. Without comprehensive testing against a code-complete product there can be no assurance that the product will perform to specifications. In the words of our QA lead: a code freeze means “stop coding dammit!”
Off-shoring and H-1B visas
There is some evidence that this project was, at least in part, off-shored and that H-1B (temporary foreign worker) visas were used extensively. This seems to have led to decisions such as the code supporting the obscure Indian Gujarati language and comments being written in a style consistent with offshore programmers. These language issues for the critical in-code documentation will make maintenance and bug fixes much more challenging for the Obama administration’s “alpha teams.”
Also, having a good chunk of your team 12 time zones away and thinking in a different language makes good vendor management exponentially harder. The H-1Bs have been badly abused to bring in foreign workers at very low cost to replace domestic workers. This brings along a lot of overhead that is rarely accounted for:
- Language comprehension problems
- Lack of a standardized skill set
- Having to rotate workers in and out as their visas expire
To a large extent, you get what you pay for. When you pay less than a living wage for a North American worker, you get less.
Full disclosure, Klick Health employs programmers in Toronto, Ontario, Canada so some in the US might consider us “nearshore.” What we object to is the reduction in programming skill that comes with the reduction in programmer wages, not the use of offshore workers per. se.
Size of codebase
The size of the codebase of Healthcare.gov is obscene. I have seen 500 million lines of code listed. For an excellent illustration of just how large that is you can look at this visualization of different program sizes. Writing code is all about elegance and efficiency, something this large is by definition using inefficient and poorly-designed architectures.
The Healthcare.gov website is certainly bloated. Some comparisons:
- 7x the size of the entire Linux operating system (the whole thing, *not* just the kernel!)
- 10x the size of the entire Large Hadron Collider software root (the most complex piece of scientific software ever written)
There is no reason for a website to be that large, even one that is reasonably complex and allows users to compare insurance plans. If you’re building software that big, you’re doing it wrong, period.
Clean-up is harder than doing it right
Clean up is more work than doing it right: Apparently the administration has “alpha teams” trying to fix the application as it’s running. Learning code is not trivial and these teams will not be as efficient as the people who coded the original systems, they may even inject more bugs than they fix.
The link above shows that the current estimate is that 5 million lines of code are required just to make the current system work. Adding more lines of code will only make the site slower, not faster.
Learn from Healthcare.gov and make sure your projects are handled right from the beginning so you don’t end up with a bloated, un-maintainable, failing piece of software.