An Introduction to Big Data
What is BigData?
Big data is a buzzword, or one can say it’s a catch-phrase, which can be used to describe a huge volume of structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, military surveillance, e-commerce and so on that is so massive that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. Big data is nothing but a synonym of a huge and complex data that it becomes very tiresome, difficult or slow to collect, store, sort, process, retrieve and analyze it with the help of any existing relational database management tools or traditional data processing techniques. Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
Some examples of Big Data:
- An airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time.
- Twitter has over 500 million registered users.
- The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.
- 79% of US Twitter users are more like to recommend brands they follow.
- 67% of US Twitter users are more likely to buy from brands they follow.
- 57% of all companies that use social media for business use Twitter.
How fast data is increasing:
Carefully look the picture which explains us “what happens in every 60 seconds on the internet“. By this we can understand how much data being generated in a second, a minute, a day or a year and how exponentially it is generating. As per the analysis by TechNewsDaily we might generate more than 8 Zettabytes of data by 2015.
Characteristics of Big Data:
Data scientists break big data into four dimensions: volume, variety, velocity and veracity.
- Volume: BIG DATA depends upon how large it is. It could amount to hundreds of terabytes or even petabytes of information.
- Velocity: The increasing rate at which data flows into an organization.
- Variety: A common theme in big data systems is that the source data is diverse and doesn’t fall into neat relational structures.
- Veracity: Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed.
Big Data Problems:
Traditional systems build within the company for handling the relational databases may not be able to support/scale as data generating with high volume, velocity and variety of data.
- Volume: As an example, Terabytes of posts generated on Facebook or 400 billion annual twitter tweets could mean Big Data! This enormous amount of data will be stored somewhere to analyze and come up with data science reports for different solutions and problem solving approaches.
- Velocity: Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, millions of records are generated in the stock market which needs to be stored and processed with the same speed as its coming into the system.
- Variety: There is no specific format of Big Data. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. Until now we have been working with only structured data. It might be difficult to handle the quality and quantity of unstructured or semi-structured data that we are generating on a daily basis.
How Big Data handles the above problems:
- Distributed File System (DFS): In DFS, we can divide a large set of data files into smaller blocks and load these blocks into multiple number of machines which will then be ready for parallel processing. For example, if we have 1 Terabyte of data to read with 1 machine and 4 Input/Output channels with each channel’s reading speed id 100MB/sec, the whole 1 TB data will be read in 45 minutes. On the other hand, if we have 10 different machines, we can divide 1 TB of data into 10 machines and then the data can be read in parallel which reduces the total time to only 4.5 minutes.
- Parallel Processing: When data resides on N number of servers and holds the power of N servers, then the data can be processed in parallel for analysis, which helps the user to reduce the wait time to generate the final report or analyzed data.
- Fault Tolerance: The Fault tolerance feature of Big Data frameworks (like Hadoop) is the one of the main reason for using this framework to run jobs. Even when running jobs on a large cluster where individual nodes or network components may experience high rates of failure, BigData frameworks can guide jobs toward a successful completion as the data is replicated into multiple nodes/slaves.
- Use of Commodity Hardware: Most of the Big Data tools and frameworks need commodity hardware for its working which reduces the cost of the total infrastructure and very easy to add more clusters as data size increase.