Big Data – How Big is Big?
While the term Big Data was coined some 20 or more years ago – the true genesis of the term is wildly debated – it has only been in the past few years that the term has become one of the everyday ‘hype’ words of the Information Technology (IT) industry. More importantly, it’s only been in the past two or three years that Big Data analytics has moved downmarket. Whereas Big Data analytics was previously relegated to Fortune 100 class companies due to cost and complexity, several recent developments make Big Data relevant, affordable, accessible – and maybe even imperative – to much smaller companies:
Many ski resorts now use RFID chips to authenticate skiiers as well as track their activity on the mountain. The ski pass is scanned automatically on every lift ride and may also be used for food and gear purchases. The ski resorts, and subsequently sporting gear manufacturers and other affiliated resort operators, use this information about the users’ recreational activities including frequency, risk levels, implicit fitness levels, and even food preferences based on the number and difficulty of the runs taken throughout the day and meals eaten on the mountain. This information may be used for targeted advertising as well as determining where improvements at the resort might result in even higher revenues.
- Ever increasing processing power of servers and density of storage subsystems with ever decreasing cost per MIP and bit
- Open source Big Data architectures and file systems such as Apache Hadoop 2.0 and the Hadoop Distributed File System (HDFS)
- Big Data analytics data bases and applications such as Cloudera, Hortonworks,
MapR, NoSQL and others
The world continues to generate and collect massive amounts of data everyday. Some recent studies by various IT companies and industry analyst companies state that:
- The world generates 2.5 quintillion bytes of information per day – the equivalent of 2.5 billion 32GB iPads!
- 90% of the world’s data was generated in just the past two years.
- The number of Internet-connected devices will grow from ~6 billion at the end of 2013 to more than 18 billion devices by the end of this decade. Each connected device continues to get faster and generate even more data with each subsequent generation.
What kinds of devices? Certainly our laptops, smartphones and tablets are all connected, but add to that videogame machines, televisions, Blu-ray players, fitness wristbands, RFID-enabled identification cards and even our cars and refrigerators!
To put some additional perspective on the scale of Big Data, today’s workloads commonly measure in the 10s of Terabytes and even up to petabytes of data. Big Data databases routinely measure in the billions or even a few trillions of records. All of this data must be stored, indexed, queried, retrieved and analyzed to deliver useful results in time frames relevant to the pace of business – typically, within a matter of minutes or 10’s of minutes. This requires an infrastructure built using high performance servers, fast storage arrays and a high bandwidth/low latency interconnect.
At several busy intersections, surveillance cameras continuously capture video imagery of automobile traffic. If an accident were to occur at say 8AM on a Tuesday, law enforcement can retrieve the video files for a few minutes before and after the incident – still a lot of image data to analyze. Using character recognition algorithms, license plate information for all vehicles passing through the intersection may be discerned, thus converting unstructured data into structured data, i.e., the license plate numbers themselves. This data in turn can be correlated with DMV records to determine the make, model and owner information for vehicles at the site of the accident in order to identify potential witnesses to the event.
Structured vs. Unstructured Data
Numerous database applications have analyzed structured data for decades. However, some of the more interesting contemporary Big Data applications now extract useful information from unstructured data. Structured data is that which fits neatly into the rows and columns of a relational data base. For example, the call detail records (CDR) from the phone company which include the called number, calling number, date and time of the call, duration of the call and location information. Unstructured data is more amorphous. Simple examples include digital photos, video, and audio files. Big Data applications employing image processing algorithms, speech and voice recognition, and keyword search capabilities can extract useful information from huge volumes of unstructured data. This type of analysis would have been prohibitively expensive just a few years ago, but with the ever-decreasing cost of CPU processing power, unstructured data analytics has become commonplace.
Real World Big Data Applications
So now we’ve collected huge volumes of data, but that data in and of itself is quite useless. Only if data can be analyzed and transformed into useful, actionable information does it become relevant to business. Big Data analytics help organizations optimize and improve decision making in virtually every industry segment. The chart below lists just a few examples of the simple and complex problems to which companies apply Big Data analytics and Hadoop infrastructure to solve. And there are many more.
Big Data Principles and MapReduce
HADOOP CLUSTER INTRODUCTION
A key characteristic of Big Data analytics is parallel processing of the workloads. This methodology is quite similar to HPC cluster architectures where tens to thousands (or even hundreds of thousands) of processors operate in parallel as a single large machine to solve a complex problem. In many ways, Hadoop and HPC clusters are the opposite of the server virtualization model where a single machine is carved up into multiple independent virtual machines.
In a Hadoop cluster, large amounts of data is stored and processed across many servers, physical and/or virtual. Typically storage is directly attached to the servers although there are models using centralized storage as well. HDFS creates a single logical data store for the analytics application, and the architecture optionally creates redundant copies of the data to avoid loss in the event of server failures (see the Extreme Networks Big Data Solution Guide for more details). A Master Node, often referred to as a Resource Manager, manages the applications and Worker Nodes in the cluster that perform the parallel processing of the workload.
A SIMPLE INTRODUCTION TO MAPREDUCE
One of the more popular Big Data processing programming models is MapReduce. With MapReduce, the workload (for example a search task or even a mathematical operation on a large distributed dataset) is segmented into much smaller tasks and parallelized across multiple processors in the cluster. The diagram below demonstrates a very simple example of MapReduce.
A customer is searching an online sporting goods site for ski boots, and types
Interconnect Challenges and Requirements
Big Data means storing, processing and moving massive amount of data across multiple servers. That same data is usually replicated across multiple servers to provide inherent redundancy within the Hadoop cluster; a node can fail and no data will be lost and in fact, the Big Data analytics applications continue to operate, albeit with slightly reduced performance. The interconnect must meet a number of challenges and satisfy a number of requirements to deliver useful results. These include:
- High Availability
Big Data interconnects require an elastic architecture that scales in port count and bandwidth as application demand increases. Extreme Networks offers high performance top of rack (ToR) switches and modular switches that support Hadoop clusters ranging from a dozen nodes to several thousand nodes. The two figures below represent two sample architectures; one built solely from ToR switches for modest sized yet extremely high performance clusters and a second scaling to thousands of compute nodes.
Hadoop interconnect built from fixed form-factor switches for both server connect and the spine interconnect in a Fat Tree configuration, scaling to hundreds of 10GbE connected servers.
Core – Edge Hadoop interconnect using Top of Rack switches connecting servers with 40GbE uplinks to high performance core switch. This architecture can scale out to support thousands of processing nodes.
A Big Data analytics application must return a result within a time span that is relevant to the business process it supports. Examples might range from a product search performed by a potential online customer, to a report on transaction activity for the accounting department, to an early indicator of potential credit card fraud, or many other types of information extracted from large amounts of stored data. Typically, a job must complete and return results within a few seconds to a few minutes. Both high port speeds and low latency considerations drive network decisions for Hadoop interconnects, and the majority of new Hadoop clusters now connect with 10GbE ports. Extreme delivers high performance, low latency switches in multiple form factors supporting 10GbE, 40GbE and 100GbE ports. A full complement of Datacenter Bridging (DCB) features provides lossless Ethernet, delivering both performance and availability.
End users simply want the Big Data application to work. As Big Data analytics move downstream to smaller organizations, IT resources will likely be stretched thin. While it’s easy to focus on server, storage and even network performance in designing a Big Data cluster, careful evaluation of the management applications can pay big dividends in reduced operational expenditures (OPEX) and faster “time to repair” if a disruption occurs.
A user-friendly Network Management System is easier to learn and easier to operate. Features such as dynamic topology generation, auto-provisioning, and CLI scripting accelerate troubleshooting as well as help automate configuration changes and the introduction of new hardware or software features. Finally, a responsive technical support partner not only helps resolve issues more quickly if they occur, but can also help to more quickly implement new features and capabilities into the infrastructure.
As businesses depend more and more on Big Data to operate and optimize their businesses, the architecture must be architected to minimize disruptions due to unplanned events. We saw earlier how the Hadoop cluster may be designed to tolerate the failure of an individual server without suffering data loss. Similarly, the network interconnect must be designed with redundancy and resiliency. Selecting network elements with proven records of high hardware MTBF and high software quality will minimize interruptions in business-critical applications.
Early Big Data clusters were isolated and often kept physically separated from the rest of IT infrastructures. However, as Big Data becomes more readily adopted in smaller enterprises, security becomes a relevant concern. The data itself is usually valuable and sometimes sensitive (for example, credit card data or health records), therefore the threat of leaks or security breaches must be addressed. Protection at the network layer through server and end user authentication is important as well as protecting the network itself from intrusions and DoS attacks.
For More Information…
Some estimates have claimed that less than 1% of data collected is actually analyzed and converted into useful, actionable information. Recent technological advances as well as the ever decreasing costs for processing and storage now make it possible for mid-tier and even small enterprises to harness the value of Big Data. Extreme Networks and our partners can help with the design and implementation of high performance yet cost-effective Big Data solutions for your organization. To learn more about Big Data and Big Data solutions, contact your Extreme Networks partner or sales representative.