The big data revolution in the context of the enterprise arguably started with the open sourcing of Hadoop by Yahoo! in 2009. This is precisely the same time that bleeding edge health care IT organizations, including Cerner, began to apply this new technology to classes of problems that were historically cost-prohibitive. It turns out this chance encounter with Hadoop was an important turning point in an industry ripe for innovation.

"Hadoop can play an instrumental role as the "great aggregator" of large and heterogeneous collections of data sources"

The health care system in the U.S. is undeniably a poster child for an industry fraught with waste, spending more than $2 trillion annually. It is also an industry of closed systems and fragmented databases where you and I, as patients, often suffer the consequences of such disarray. Reasons aside, most people familiar with the business of health care would universally agree that technology serves not only as a catalyst for change, but also perhaps as the only plausible antidote.

The digital workhorse of modern health care is the electronic health record (EHR) found in virtually every U.S hospital (94% to be precise, according to HIMSS). These predominantly closed systems contain vast amounts of clinical, financial and operational data begging to be liberated, if nothing else, for the purpose of making systemic improvements to the delivery and quality of health care.

Hadoop can play an instrumental role as the “great aggregator” of large and heterogeneous collections of data sources. As evidence of disaggregation, the health record of an individual is fundamentally fragmented by virtue of the many islands of EHRs that a person might encounter over a lifetime. By combining these partial collections of data, a more comprehensive longitudinal record can be stitched together. The blending of diverse data enables organizations to solve classes of problems, such as those in the category of population health management that would otherwise be impossible if attempted through traditional methods of integrating disparate systems.
A rather straightforward use case that has financial implications for hospitals is the prevention of patient readmission within 30 days of being discharged. The Centers for Medicare and Medicaid Services will deny reimbursement to a health system for patients readmitted to the hospital for the same condition inside a 30-day window. This means that care providers must operate proactively and take preventative action for those patients at highest risk of readmission.
This problem is ideally solved using a predictive model constructed from machine learning techniques whose training data is drawn from various EHR sources. Hadoop turns out to be an excellent platform for aggregating these various sources and materializing the relevant bits into features that can be used for both development and deployment of the predictive model.

Another compelling use case with life-saving potential is predicting the onset of sepsis. Sepsis is an elusive bloodstream infection that can become fatal if not detected during its early stages. In fact, the chances of death are nearly 50 percent once the infection reaches a severe stage. To make matters worse, sepsis can be rather difficult to diagnose during early stages because symptoms are often similar to common medical conditions.

Similar to the readmission prevention use case, detecting sepsis is best solved using a predictive model with data ideally flowing from multiple venues of care, hence the possibility of multiple EHR sources. However, the detection algorithm is time-sensitive and must be deployed into a near real-time monitoring environment. Again, Hadoop is quite attractive as an underlying platform, particularly with the expansion of the ecosystem to incorporate emerging technologies, such as Kafka and Spark Streaming that make scalable, near real-time processing possible. In fact, Cerner has deployed a centralized Hadoop-based sepsis detection system, oriented around a highly tuned decision tree classifier, which actively monitors more than one million lives on a daily basis.

The concept of data-driven, or evidence-based, medicine has been occurring for decades, but given the historical costs of acquiring and analyzing such data, the extent of its reach beyond research has been limited. Hadoop is effectively shedding those cost barriers and democratizing access, allowing virtually any organization to exploit those benefits in ways that positively impact health care. And, these efforts are not just about the application of sophisticated analytics, but also about simple steps towards the use of data, rather than intuition, in making decisions.

An important and unavoidable journey of trial and error often unfolds when health care IT organizations undergo a cultural transformation from decisions based on intuition to those grounded in data. Fortunately, the ecosystem surrounding Hadoop has matured appreciably since 2009 and is rapidly becoming much more enterprise-ready, which means organizations are spending less time tinkering with the technology.

However, it turns out that the data itself is usually the bigger challenge, especially when units on the measuring stick are labeled in petabytes. Because of the historical nature of closed systems, health care data is remarkably messy once aggregated. As such, cleansing and normalization become essential steps not only for applying any sort of reasoning at scale, but also for making the data approachable to and consumable by a larger audience.

As Hadoop and other big data technologies continue to mature, the health care industry will inevitably see a pronounced uptick in adoption combined with a shift towards more intelligent, data-driven systems. Arguably, one could say that such an inflection point on the proverbial “hockey stick” is already playing out. But, regardless of timing, the inescapable liberation of data currently trapped in these closed systems will almost certainly have a profound effect on the quality and delivery of health care in the future.