Marc Sturm, Director-Data Analytics, NewYork-Presbyterian Hospital
A little less than two years ago, the data analytics team at NewYork-Presbyterian Hospital (NYP) in New York City, decided to dive into Big Data. Big Data was not just “hype” and our organization had a Volume, Velocity and Variety (VVV) problem. Our real time data analytics platform couldn’t handle the spikes in our data flow (Velocity), we needed to do text mining on clinical notes and documents (Variety),and we were shredding our data because of its abundance (Volume). Big data seemed like a natural solution. Our CIO, Aurelia Boyer, strongly encourages exploring new ideas and technologies, so after defining our first use case and the plan, she gave us the green light and the Big Data team at NYP was born.
A small team was assembled with two nimble, eager-to learn developers and one manager. To keep costs low, we would use commodity servers and VMs. After considering a variety of noSQL databases, we settled on Hadoop, the most recognizable name in the world of Big Data technology. We chose the Open Source Apache version, a strategy to keep vendors at bay.
Our use case was a NLP project developed at Columbia University’s Department of Biomedical Informatics. The project plan was simple: learn the technology, build a Proof of Concept and, if successful, implement the solution in a semi-production Hadoop cluster in collaboration with our colleagues at Columbia. We had a couple of other use cases in mind that we would work on in parallel. They would serve dual purposes: as insurance that our solution was not only tailored for the main use case and as a fallback if the Proof of Concept was unsuccessful.
Our first step was downloading Hadoop and testing it on our workstations. After a few runs, the results proved impressive. We spent a lot of time on research to fully understand various aspects of the technology including map reduce, splitting, shuffling, etc. We read the main Hadoop technical books and relevant blogs and attended meet-ups. At this point, we reached a comfort level that NLP is a good use case for Big Data and Hadoop.
The first decisions were the most important ones: What environment do we need? How do we model our data? How do we get the data in and out of the cluster? Our Proof of Concept was implemented on a hybrid environment. The data processing was done in batch in Hadoop and the user facing application was implemented with traditional technology. We quickly completed it, validating the value of Hadoop. However, would be completely different. The hybrid solution was not sustainable and we needed to move to the second phase of the project.
The second phase was to design and build an environment that can not only support our initial use case in real-time, but is also optimal for other use cases such as integrating EMR data with Biomedical data and building prediction models. This second phase is currently underway and will be completed in 2014.
The results were impressive and the technical challenges matched our expectations. But the real lessons were elsewhere:
We now think of data differently. The data concern switched from having too much to not having enough. We used to model and manage our data to save storage and cost, keeping only what we need. Now that storage is unlimited and inexpensive, we need to generate more electronic data, bring it into our platform and integrate it with other data.
At the beginning of this adventure, we thought Big Data would allow us to keep data as it is and run complex queries and integrate it at query time. We quickly realized that with the growing volume of data, data modeling and governance are increasingly more important. Every IT project must now have a data strategy component.
Shift in Skills & Resource Requirements
We now look at skill sets and recruiting differently. The range of skills we need to analyze our data is becoming broader. While we will still need to build traditional dashboards, reports and OLAP cubes, an increasing number of our users will need to mine the data for discovery and insights using machine learning and statistical techniques. As a result, especially as a healthcare organization, we will need to bring users and their tools to where the data is, so they can take advantage of the technology hosting the data. We need data scientists who can intimately understand the data, master the technology to write complex scripts and interpret the results.
Complex Infrastructure Decisions
Big Data doesn’t make technology simple, but it allows a data analyst to concentrate on the data without worrying about the infrastructure. A Hadoop cluster manages its resources and monitors its processes. The cluster can automatically kill and restart processes without the user noticing it. The IT department must build a reliable infrastructure, and with Big Data infrastructure, decisions are more challenging and complex. The choices of technology in the Big Data space are vast.
“The first decisions are the most important ones: What environment do we need? How do we model our data? How do we get the data in and out of the cluster”
New and Constantly Changing
We realized that real experts in the field of Big Data are still rare. We sent one of our developers to training, and he knew as much as the instructor, if not more. The technology is changing fast, and we need to be ready for change. During the course of our project, we integrated the upgrade from Hadoop V1 to Hadoop V2. The Open Source community is great, and you will often get the right answer, but if not, you can look at the source code yourself.
The driving force of our Big Data project was our main use case and the users waiting for the application. Having a small dedicated project team who were eager to learn was a big plus. Having the time and flexibility to try different technologies and solutions allowed us to really learn on our own, and make the right decisions. We knew that the first iteration of what we build would be temporary, but it would be part of our education, and would change quickly with the evolution of the technology.
There is more to do at New York- Presbyterian with Big Data. Our short-term priority is to consolidate and validate what we built. We are still earning new technologies, and we have other use cases on deck. Our long-term goal is to integrate our Big Data environment with our other NYP data environments, creating one platform allowing users and/or applications to search, mine and retrieve the information they need. Access to data fuels innovation, and, continuously improving at NewYork- Presbyterian’s access to data is the primary priority of our data analytics team.