American Assosiation of Public Opinion Research defined big data is an “imprecise description of a rich and complicated set of characteristics, practices, techniques, ethical issues, and outcomes all associated with data”.
Big Data as a Concept and Related Terms
Chen added a more technical definition to big data; “big data processes the datasets which cannot be interpreted, stored, managed, and processes by traditional software or hardware tools”. In other words, this term is used for capturing, storing, managing, analyzing and processing a hugh amount of data which cannot be managed by any traditional database management system such as RDBMS. With the increase in data sizes, the complexity gets more and it causes difficulties in interpreting and managing of data. Consequently, nowadays enterprises require big data analytics tools with real-time or near real-time capabilities to process this kind of complex data sets.
Characteristics of big data
Zikopoulos and Eaton identified the characteristics of big data in 3Vs, volume, variety and velocity. “In Lomotey and Deters , this model has been extended into 5V, by adding: Value for understanding the cost and value of data and Veracity to check the accuracy of the data and data cleaning” .
The primary attribute of big data is definitely volume and corresponds how big the data is. The volume of big data could be in TBs or PBs because of coming from lots of different sources including logs, clickstreams, social media, sensor data. Therefore storing and processing of this amount of data sets would not be possible with traditional storage systems.
Variety corresponds the types of data and big data has unstructured data and has different types whereas the traditional data types can be stored in the relational databases as they are structured. With the changing technology big data comes from lot’s of different sources with different data types such structured, unstructured, geographic, real-time media, natural language, time series, event, network and linked.
- Unstructured Data: audios, videos, emails, social media , logs, IoT, clickstreams
- Semi-structured Data: JSON, XML
- Structured Data: Excel, TXT, relational databases, RDBMS
Velocity corresponds to the speed of the arriving data and the speed of the change in data. High velocity data is requires distributed processing techniques with real-time and non- real-time capabilities. Veracity is the quality of the data and deals with the correctness of data. Value provides the ability to turn huge amount of data into valuable resources. It helps to improve the efficiency of the organisation, effectiveness in maintenance. “It is all well and good having access to big data but unless we can turn into value it is become useless. It becomes very costly to implement IT infrastructure systems to store big data, and business are going to require a return on investment.”
Advantages and disadvantages of big data
Ciklum explains the advantages of big data which are; easy and quickly identification of the root causes of the failures, fraud detection, catching errors quickly and responding to the upcoming failures, supporting innovation, increasing revenue, defining the pain points, learning the customer needs, creating customer value, increasing customer satisfaction and loyalty, tracking the movements of the customers, knowing better about the customers, predicting the customer trends, increasing operational effectiveness like customer services.
Unstructured data is captured from IoT, human and online machines which provide rich, varied data which can be used to understand user requirements. The examples of unstructured data are social media, keyword searches, clickstreams, YouTube videos and this kind of real-time data provides to the companies to create real-time advertisements.
Besides the advantages, the disadvantages are cost of deploying and managing of big data platform, complexity and requiring proper training and to hire experienced employees, difficult to decide the correct platform, possibility of misusage of the collected data by sharing private data between the organisations. Regulations tries to protect the customers from misusage but organisations must be more careful on this topic.
Big Data Tools
Big data can be integrated and implemented by using different tools such as Hadoop, Spark, MapReduce, Pig, Hiv, Cassandra and Kafka. Based on the special requirements of companies, the most effective tool is chosen by the companies after evaluating advantages and disadvantages of them. In Figure, Towards Data Science presents big data tools:
Hadoop is mostly used and very popular in big data implementation and has the capability to handle with hugh data including both structured and unstructured data formats.
Apache Spark is a computung engine and processes data on computer clusters with parallel processing.
Apache Cassandra is a database with the capabilities such scalability and high availabilit y and is a good choice for web and mobile applications.
Kafka is another successsful tool with low-latency and real-time processing functionalities.
Apache Storm is free-of-charge and open source real-time processor and has complex event processing capabilities.