Data mining process analyze data to find useful information that has not yet been discovered, revealing previously unknown patterns and relationships between attributes hidden in the data. The purpose of data mining is to extract information from the data set and make it understandable for further use.
History About Data Mining Process
Manual knowledge extraction from data has been performed over the centuries. First examples of knowledge extraction from data could be Bayes’ Theorem at 1700s and regression analysis 1800s. Data mining first appeared with primitive methods by economists and statisticians but after 1960, it has been transformed to a data fishing. However, data mining has appeared as a term firstly used in 1983 by Michael Lovell.
Huge amount of information holds as idle without uncovered even today. The similarities or differences in these data stay covered. However, due to increasing amount of data and data complexity, efforts have moved to an automatic knowledge extraction system than manual knowledge extraction. Using data mining techniques, meaningful patterns can be extracted from large data sets that cannot be examined by humans. In addition, the system is self-feeding, so, the results could be getting better by the time.
Data Mining Process
The main reason why data mining has become so popular and effective, data mining could identify similarity trends and patterns among groups of data. In this way, data mining process can be automated. Quick decisions could be made at critical points with the help of data mining. Classification or decision-making mechanism that will last for years or months can be achieved in seconds thanks to data mining techniques and a great workforce could be saved. Data mining process could inspected in five topics as visualized in Figure.
Selection is the first step to be taken before starting data mining. Not all data can be meaningful in large data warehouses. These data which do not make any sense can affect the operation of the algorithms and cause worse and meaningless outcomes. Significant data records and databases should be selected. Databases and data rows which will lead to intended results are selected at this stage.
Selected data could be noisy. After selection process, extracted data should be investigated and should be decided if data could be used in decision support or data mining. Data sets should be combined and systematized. Data should be cleaned and simplified according to the patterns that need to be exposed.
Third step of the data mining is the transformation. In this stage, data cleared from previous step would transform into a supported structure for data mining algorithm. In transformation step data are made for data mining. All data would be digitized after this step, according to algorithms which is going to use.
Mining step is the main step of learning and knowledge extraction. Mining could be split into six stages.
Anomaly detection: Data would be investigated to see if there is any kind of anomaly. If there is an anomaly, it should be investigated again whether there is a chance to bypass it. Those process are made in this stage.
Association rule learning: In this stage, relations in features of data are investigated to determine the connection. If there is a connection, then, the related features can be used.
Clustering: Groups and structures are investigated to see if there are similarities in Clustering stage.
Classification: This is the stage of applying the extracted information to the data.
The results are exposed at this stage.
Regression: It is the stage of extracting of the function that will estimate results with the least error.
Summarization: This is the last stage of data mining. Outputs would be visualized and report its success rate.
Interpretation or Evaluation
Result could be misleading in data mining. Although the results at first seem to be meaningful, they may not actually produce any meaningful results according to the future situations. This is usually due to lack of appropriate testing or a selection of not appropriate features. This could be relatively prevented by separating the test set from the training set. However, only splitting sets may not be enoug.
The final step of data mining is to verify that the pattern generated by the mining algorithms occurs in larger clusters. Even if not all the patterns are correct, it is tried to eliminate the patterns that are thought to be wrong and that damage the result. Performing tests with a data set that has not been used in training stage is very important at this point.