We can define text mining as the discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources. It can be viewed as an extension of data mining or knowledge discovery from (structured) databases.
As research in all areas of life continues, many fields will become so overwhelmed with information that processing all the information on a particular topic will become actually impossible for any person. There is a vast amount of unstructured, both documents and Web pages, business information in data repositories on Intranets and the Internet.
In fact, it is estimated that 80% of a company’s information, such as emails, memos, customer correspondence, and reports is contained in text document. The ability to distill this untapped source of information, free text document, provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy.
Since managing the texts by human effort, have become both inadequate and too expensive to perform and to maintain for the majority of the available data, the use of automatic methods, algorithms, and tools for dealing with this large amounts of textual data, has become necessary.
The Text Mining (TM) field was born to address the huge demand for mining large amounts of text automatically. It is inherently interdisciplinary, borrowing heavily from neighbouring fields such as data mining and computational linguistics.
In any case, a scheme can be built to include all possible steps that characterize text mining strategies. The scheme below describes the steps:
Text Mining Objectives
Most text mining objectives fall under nine categories of operations: entity extraction, text-base navigation, search and retrieval, clustering, categorization, summarization, trends analysis, associations, and visualizations.
Entity extraction deals with finding particular pieces of information within a text. It is to distinguish which noun phrase is a person, place, organization or other distinct objects. This operation should include term extractions and calculate the number of times each term appears in the text analyzed (keyword frequency).
Text-base navigation enables the text miner to see related terms in context and connect important relationships between them.
Search and retrieve operation allows the user to search and retrieve relevant information based on pre-specified search criteria.
Clustering groups similar documents in a way that the degree of association between two documents is maximal if they belong to the same group and minimal if otherwise.
Categorization is the process of using content- mining technologies to identify and organize like/similar pieces of raw data into a pre-defined set of topics for analysis.
Summarization is the operation to describe the content of a document while reducing the amount of text a user must read.
Trends analysis is used for discovering trends from time-dependent textual data.
Association analysis is to associate one extracted pattern with another pattern found.
Visualizations utilize feature extraction and key term indexing in order to build a graphical representation that can help user identifying the main topics or concepts by their importance on the representation. Additionally, it is easy to discover the location of specific documents in a graphical document representation.
Text mining techniques can range from simple one (e.g., arithmetic averages) to those with intermediate complexity (e.g., linear regression, clustering and decision trees) and highly complicated ones such as neural network .
In the following subsections, the relationship between text mining and data mining, between text mining and natural language processing (NLP) and between text mining and information retrieval (IR), is discussed.
Text Mining and Data Mining
Text mining, also known as text data mining or knowledge discovery from textual databases (KDTD), the process of finding useful or interesting patterns, models, directions, trends, or rules from unstructured text, is used to describe the application of data mining techniques to automated discovery of knowledge from text.
Text mining has been viewed as natural extension of data mining or sometimes considered as a task of applying the same data mining techniques to the domain of textual information. This reflects the fact that the advent of text mining relies on the burgeoning field of data mining to a great degree, but as the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining.
Text mining, however, is also a much more complex task (than data mining) as it involves dealing with text data that are inherently unstructured and fuzzy.
Although Text Mining and Data Mining are related as they are mining processes they differ in point of the following issues:
- Text mining deals with unstructured or semi-structured data, such as text found in articles, documents, etc. However Data Mining is related to structured data from large databases. In addition, another characteristic of text mining is the amount of textual data. The concepts constrained in a text are usually rather abstract and can hardly be modelled by using conventional knowledge representation
- Furthermore, the occurrence of synonyms (different words with the same meaning) or homonyms (words with the same spelling but with distinct meanings) makes it difficult to detect valid relationships between different parts of the text.
Text mining and Natural Language Processing
Natural language processing is the study of computer processing of human language. It is the ability to automatically process written text based on language constructs (words, phrases, sentences, etc.) and different parts of speech (nouns, adjectives, verbs, etc.).
Natural Language Processing has developed various techniques that are typically linguistically inspired, i.e., text is typically syntactically parsed using information from a formal grammar and a lexicon, the resulting information is then interpreted semantically and used to extract information about what was said.
Natural Language Processing may be deep (parsing every part of every sentence and attempting to account semantically for every part) or shallow (parsing only certain passages or phrases within sentences or producing only limited semantic analysis), and may even use statistical means to disambiguate word senses or multiple parses of the same sentence.
It tends to focus on one document or piece of text at a time and be rather computationally expensive. It includes techniques like word stemming (removing suffixes) or a related technique, lemmatization (replacing an inflected word with its base form), part-of-speech (POS) tagging (elaborations on noun, verb, preposition, etc.).
Text mining appears to comprise the whole of automatic natural language processing and, perhaps, far more besides, for example, analysis of linkage structures such as citations in the academic literature and hyperlinks in the Web literature, both useful sources of information that lie outside the traditional domain of natural language processing.
But, in fact, most text mining efforts deliberately avoid the deeper aspects of classic natural language processing in favor of shallower techniques more similar to those used in practical information retrieval.
Text mining uses techniques primarily developed in the fields of information retrieval, statistics, and machine learning. Its aim typically is not to understand all or even a large part of what a given speaker/writer has said, but rather to extract patterns across a large number of documents.
The simplest form of text mining could be considered information retrieval, what typical search engines do. However, more properly text mining consists of areas such as automatic text classification according to some fixed set of categories, text clustering, automatic summarization.
While information retrieval and other forms of text mining frequently make use of word stemming, more sophisticated techniques from Natural Language Processing have been rarely used.
Text Mining and Information Retrieval
It is important to differentiate between text mining and information access (or information retrieval (IR), as it is more widely known).
Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself. In order to achieve this goal statistical measures and methods are used for the automatic processing of text data and comparison to the given question .
Even though, the definition of information retrieval is based on the idea of questions and answers, systems that retrieve documents based on keywords, i.e., systems that perform document retrieval like most search engines, are frequently also called information retrieval systems. In Information retrieval procedure, the problem is not that the desired information is not known, but rather that it coexists with many other valid pieces of information.
While the goal of text mining is to discover or derive new information from data, an information retrieval system can return a document that contains the information a user requested implies that no genuinely new information is found, i.e., no new discovery is being made: the information had to have already been known to the author of the text; otherwise the author could not have written it down.