The data mining stage is the operation of algorithms on the cleaned data set with the help of data mining tools. Hidden patterns and models are discovered as a result of data mining. In this section, our researches about the two different data mining libraries used in the analysis of our data set were shared. It is aimed to give general information about Weka and Apache Spark before explaining our analysis process.
Weka is a data mining tool and that contains data mining algorithms for data analysis. You can analyze your data through the desktop application or by referring to the Weka library in your code. Weka use cases are data pre-processing, regression, classification, clustering, association rules, visualization.
In previous studies, It has been found that Weka works with less error rate than RapidMiner and Orange. The most sensitive issue in data analysis is the low error rate. However, it is a known fact that Wekan does not work well on large data sets. For this reason, we have searched for a new data mining library and as a result of our research we encountered Apache Spark.
Spark is a general-purpose data processing engine, suitable for use in a wide range of circumstances. That was designed to run in memory, and this allowed Spark to process data much faster. Spark began life in 2009 as a project within the AMPLab at the University of California, Berkeley.
Spark has expansive libraries and APIs, and that supports for different programming languages such as Java, Python, R, SQL and Scala. Spark is often used Hadoop Distributed File System for storage data, but can also integrate with HBase,
Amazon’s S3, MongoDB , Cassandra, MapR-DB. Well-known companies such as IBM, Huawei, Chinese search engine Baidu Alibaba Taobao, social networking company Tencent and pharmaceutical company Novartis are use Apache Spark.
if we explain the reasons why Spark is used in data processing:
- Simplicity: We can easily use Spark’s capabilities with collection of APIs. These APIs are designed and documented to be easy to use by developers.
- Speed: Spark was developed to shorten data processing time. Spark won the Daytona Gray Sort benchmarking challenge in 2014, by processing 100 terabytes of data in 23 minutes and did not leave any doubt about its fast work.
- Support: Spark supports many programming languages and storage systems. The Apache Spark community is also large, active and international.
Apache Spark Use Cases
Spark is an API-aided tool that integrates into applications to quickly analyze users’ data. Use cases of Spark:
It is very difficult for developers to deal with data streams, such as log files and sensor data. These data usually have a continuous flow from multiple sources. Although it is difficult to store and process this data on disk, it is very important to obtain meaningful information.
Machine learning has become indispensable with increasing data sizes. Spark’s ability to store data in memory and respond quickly to queries creates an environment suitable for the machine learning process.
Data generated by different systems need to be combined to report and analyze. However, since the data are dirty and incompatible, it is not easy to combine these data. In order to analyze the data, data integration is done first. Data integration includes data collection, cleaning, standardization and analysis.
Spark can quickly respond to and adapt to the interactive query process.
Apache Spark Usage Techniques
Apache Spark is open source software and requires at least Java version 6 and Maven version 3.0.4. You can use the spark with the help of spark shell or APIs.
We used the Apache Spark in the Java Spring boot project via Apache Maven by adding dependency. Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM.xml), Maven can manage a project’s build, reporting and documentation from a central piece of information.
Pom.xml dependency for adding Apache Spark.
Recently, many of big data technology has been involved in our lives. In our work, we chose Apache Spark because the speed of the data analysis process is quite good and we gave information about Apache Spark in this part. Spark, like other big data technologies, is not the best technique for every data processing process.
Apache Spark Architecture
Spark is a open source project that developed for use in various architectures with various programming languages. The Spark project stack contains Spark Core and four libraries that optimized to meet the requirements of four different usage situations. Applications must contain Spark Core and at least one of these libraries. Figure shows project stack of Apache Spark.
It is the foundation stone of Spark and is responsible for management functions such as task scheduling. Spark Core implements and depends upon a programming abstraction known as Resilient Distributed Datasets.
Designed to work with structured data. Spark SQL supports the Hive project and the HiveQL query language. Spark can be integrated with SQL databases, data warehouses and business intelligence tools and supports JDBC and ODBC connections.
The purpose of this module is to process the streaming data.
It is machine learning library of Spark and contains classification, correlations and hypothesis testing, regression, clustering and principal component analysis.
Designed to support the analysis on graphs of data and includes many graphical algorithms.
It was developed for data scientists and statisticians who use R programming language to benefit Spark.
Resilient Distributed Datasets (RDDs)
RDD (Resilient Distributed Dataset) is a very important concept for Spark. It is designed to support in-memory data storage. RDDs are collections of distributed objects and are divided into several parts, each of which is calculated on different nodes. Efficiency is increased by running operations in parallel across multiple nodes in the cluster and minimizing data repetition between these nodes. Two basic operations are performed on the data in the RDD.
- Transformations: Creating a new RDD with techniques such as mapping,
- Actions: Performing various measurements without changing data.
The original RDD does not change during the process. RDD conversions are logged and If any data loss occurs in cluster nodes, they are repaired immediately. Transformations are not executed if they are not needed by the next process, and this increases efficiency. Because unnecessary data processing is not done. Rdds remains in memory, thus dramatically enhancing performance in repetitive queries and transactions.