The World: One Big Data Problem
We all use Smartphones but have you ever wondered how much data it generates in the form of texts, phone calls, searches, music, etc. We might not even realize how much our Digital Footprints contribute to the universal Big Data.
Approximately 40 exabytes of data gets generated every month by a single smartphone user.
Amazon Prime has 150 Million subscribers.
Netflix has nearly 183 Million subscribers.
Netflix collects user behavior data from its more than 183 million customers. This data helps Netflix in understanding what every individual customer wants to see. Based on the analysis it recommends movies and TV shows which the viewer will love to watch.
Facebook has over 2.7 Billion users.
I might not need to tell anyone how Facebook uses Big Data. The name “Mark Zuckerberg” defines it itself!
Puns apart, if interested in knowing about their Open Storage Challenge, click here.
Google processes over 3.5 Billion searches per day.
The total number of people who use YouTube — 1,300,000,000.
300 hours of video are uploaded to YouTube every minute!
Be it banking, communication, healthcare, media, advertising, manufacturing, transportation, retail, Big data can be used everywhere and this is why more and more businesses are trying to harness its power.
Data: We can say that all the facts and figures which can be stored in digital format can be termed as data. All the text, numbers, images, videos, audios stored in our phones and computers are examples of data.
Big Data
Big Data is also a type of data but with a huge size. It is a term used to define a collection of data that is huge in volume and yet rising exponentially with time. In short such data is so enormous and complex in structure that none of the traditional data management tools are able to store it or process it efficiently and effectively.
The amount of data generated everyday by industries, large organizations, and research institute is blitz scaling. These huge volumes of data is required to be kept for not just analytical purposes, but also in compliance with laws and service level agreements to protect and preserve data.
Storage and management are major concerns in this era of big data. The ability of storage devices to scale, to meet the rate of data growth, enhance access time, and the data transfer rate is very demanding and challenging. These factors, to a considerable extent, determine the overall performance of data storage and management. Big data storage requirements are very complex and thus it needs a holistic approach to mitigate its challenges.
With this explosive amount of data being generated, storage capacity and scalability have become a major challenge. The storage demand for big data at the organizational level is reaching petabytes (PB) and even beyond.
Storing and maintaining large sets of data overtime at the rate of growth can be difficult . Factors such as capacity, performance, throughput, cost, and scalability are involved in an ideal storage solution system. In addition, storage devices play an important character in mitigating big data challenges. Reliability is of equal concern for the big data storage. Reliability is sort of the retrieval of data in its original form without any losses. The issue of reliability takes into account both internal and external system failures and vulnerabilities. With the scale of data, the probability of losing some it during retrieval can be very high. Large data-intensive applications such as Google map and Facebook requires high Input-Output-Operations-Per Second (IOPS) to maintain performance in order to stay in business.
The I.T departments of most large organizations are facing strict and tight budgets, which is limiting their ability to manage the huge data at their disposal effectively. With limited funds, organizations are now required to design techniques that fall within their budget. The ability to maximize performance and capacity while minimizing our cost has become a headache for organizations operating big data scale, and research focuses on academia. Their existing storage systems are inadequate to meet the stringent requirements of big data storage and management.
Storage Medium Issues
Mechanical disk drives (HDD) and Solid State Disk (SSD) drives are the major trending storage mediums, with hard disk drives forming the basis of the bulk of big data storage. Solid State Drives and Hard Disk Drives are mostly used by organizations as their storage device, with their capacity density expected to increase at a rate of 20 percent. HDDs' characteristics are completely different from SSDs. I/O subsystems designed for HDDs in traditional storage systems do not work for SSDs. Reliability risks such as overheating and magnetic faults, and disk access overhead have made HDDs undesirable for big data storage, though the price per gigabyte is relatively low. SSDs on the other hand can service I/O processing requests at a much faster rate than HDDs, because there is no mechanical part, reducing access time hence increase in I/O rate. SSDs are more resistant to physical shock, hence more reliable. The problem with SSDs is the price per gigabyte. The cost of replacing all mechanical drives with SSDs for a big data storage is unreasonably high.
Challenges
With the rate of data explosion, storage systems of organizations and enterprises are facing significant challenges from vast quantities of data, and the ever-increasing of generated data. Data, irrespective of its size, plays a vital role in the industry. Value can be created from a large data set. For example, Facebook increases its ad revenue by mining its users' personal preferences and creating profiles, showcasing advertisers which products they are most interested in. Google also uses data from google search, google hangouts, YouTube, and Gmail accounts to profile users’ behavior.
In spite of the numerous benefits that can be gained in the large data sets, big data demand for storage and processing poses a major challenge. The total size of data that will be generated by the end of 2015 is estimated at 7.9 zettabytes (ZB), and by 2020, is expected to reach 35 ZB. It is clear that big data has outgrown its current infrastructure, and pushes the limit on storage capacity and storage network. Existing traditional techniques cannot support and perform effective analysis, due to the large scale of data.
Solution
Due to the massive increase, and the heterogeneous nature of application data, one main challenge of big data is effectively, manage the petabyte (PB) of data being generated daily. Storage management encompasses technologies and process organizations to improve data storage performance. Big data requires efficient technologies in processing large quantities of data within an acceptable time frame. A wide range of techniques and technologies have been developed and adapted to manipulate, analyze, and visualize big data. Technologies such as massive parallel processing (MPP) database, data mining grids, distributed file systems, cloud computing platforms, and scalable storage systems are highly desirable. The deployment of Map-Reduce, with Yahoo’s Pig, alongside Facebook’s Cassandra applications, has gotten the attention of the industry. Google file system, GFS, is designed to meet the increasing demands of big data, such as scalability, reliability, and availability. GFS is composed of clusters, which is made up of hundreds of storage servers that support several terabytes of disk space. This meets the scalability issue of big data.
Hadoop is a free version of Map-reduce implementation by the Apache Foundation. Hadoop distributed file system (HDFS), is a distributed file system designed to run on commodity hardware. HDFS can store data across thousands of servers. All data in HDFS is reduced into block-size chunks, and distributed across different nodes, and is managed by the Hadoop cluster.
This shows the distribution of data across different data nodes to enhance the performance of the entire Hadoop system. Storage vendors such as NetApp, EMC, Hitachi Data Systems, and many more are offering storage management solutions to big data inclined companies. EMC VPLEX enables the manageability of storage area networks, through a virtual storage infrastructure that consolidates heterogeneous storage devices.
Thank you for reading if you reached till here. Don’t forget to clap if you liked the blog.