Bill Piper, VP of Hardware Engineering, Wells Fargo
What is big data?
Big data may be one of the hottest IT industry terms today despite the lack of consensus on its exact meaning. The word data is relatively straight-forward; big is somewhat subjective. To complicate the issue further, some believe there is more to big data than just being big, that it refers to software specifically designed to manage and analyze large data sets. Others view big data as the next generation of business intelligence/analytics, while others would argue that big data is not traditional business intelligence but a more exploratory approach. Personally, I view all of these as reasonable definitions of big data, and define big data as all types of large, rapidly-growing, structured, and unstructured data.
“Over the last twenty years, nearly all organizations transitioned to centralized or shared storage technologies to deal with rapidly growing storage capacities and workloads”
Rapid Data Growth
While the term big data has become popular in recent years, the trend of data growing rapidly began at the dawn of the digital age. The hard disk drive was invented in the 1950s, with capacity measured in single digit megabytes. These original disk drives leveraged platters over twenty inches. Today we have 8 terabyte disk drives in a 3.5 inch form factor. This represents an improvement of over a million times capacity in the last sixty years, and does not take into account the significant reduction in size of the devices. To put this into perspective, the amount of storage in the common cell phone today is larger than an entire room of disk drives just a couple decades ago. Data growth is the primary driver behind the innovation we have seen in the storage technology space. Outpacing Moore’s law (doubling every 24 months) is not a challenge for the light hearted.
Transformation of Storage Driven By Big Data
Over the last twenty years, nearly all organizations transitioned to centralized or shared storage technologies to deal with rapidly growing storage capacities and workloads. From an organizational perspective, this led to the creation of storage teams and/or departments. By physically separating compute and storage resources utilization, their capacity (space and performance) management functions become independent. This independence has a profound impact on our day-to-day ability to run an optimized infrastructure. Physical servers are not required to increase storage capacity and physical storage devices are not required when increasing compute power.
Modern storage technologies have become efficient through over-provisioning and pooled resources. Over-provisioning is the allocation of more resources than physically installed in the system. This is possible as storage is only consumed at 40-50 percent utilization on average. Pooled resources combine multiple resources such as disk drives, CPU, memory, etc. into a single pool. Resource utilization can then be managed at the pool level rather than the individual com¬ponent level. Together these techniques have resulted in average storage utilizations in the 60-70 percent range. This is roughly a 50 percent increase in utilization, which translates directly to lower CAPEX and OPEX by reducing the amount of physical equipment being purchased and maintained. The downside of a pooled resource model is the complexity of prioritizing high-value requests. In the storage world, this translates to ensuring mission critical systems are provided priority over less critical systems such as test and development systems. While there are many methods to handling this prioritization, many of them are divergent of the pooled re¬source model and driving up efficiencies.
With data continuing to rapidly grow, the aforementioned techniques are being pushed to the limit. Enterprise storage devices can be upgraded online, but there are physical limitations to how large they can grow. Many organizations place soft limits on these devices to manage the concentrated risk of how large these devices grow today. One approach to this challenge has been the rise of scale-out storage technologies. These technologies grow by adding additional resources, often termed nodes, as demands increase. This approach eliminates many physical constraints associated with the traditional monolithic approach. However, there are still physical constraints as the connection between nodes requires high bandwidth and low latency to maintain high performance levels.
Storage devices are not the only technologies leveraging scale-out architectures today. There are similar trends within the compute or serverspace. Many virtualization, cloud, big data, and grid computing platforms today leverage scale-out architectures. This leads to the natural question: Should storage and compute scale out together? This approach works well for platforms with known workloads, growth patterns, and storage functionality (replication, encryption, tiering, caching, etc.).
How Do we Manage All this Data?
In many industries data is doubling every 18-36 months with no signs of slowing down. The majority of this growth is driven by the creation of new data, but growth can be curtailed by purging data which no longer has business value or compliance requirement. W hen is t he last time you cleaned out your inbox, documents, downloads, or any other data you manage? You likely can’t recall but probably remember that it was time consuming. Managing data throughout the entire lifecycle can be labor intensive. Data management is an area of technology with a lot of opportunity that, I believe, will see plenty of innovation in the near future.
From a business perspective, data analytics is an amazing opportunity. In today’s world we regularly deal with very complex business decisions requiring data from multiple sources. Analytics provide the ability to not only pull all of this data together, but also provide statistical analysis (correlations, regressions, simulations, optimization, etc.). This provides management teams the data they need to make business decisions more intelligently. When leveraged properly, these decisions directly impact business outcomes such as costs, revenue, risk, customer service, etc. The reporting aspect of analytics helps capture business value.
The future for analytics in the work place is very bright. There are already real-world examples of this technology automating decisions. Customers who receive a phone call from Wells Fargo asking about questionable transactions are directly benefiting from analytics, which are able to detect and prevent fraud before a human even realizes it has occurred. In this example, analytics are minimizing cost; in other examples, this technology optimizes profit. The automation of profit optimization challenges the notion of IT as a pure cost center.