3. PLANNING APPROACH TO TEXT MINING IN KNOWLEDGE BASE FRAMEWORK
Extensive researches are conducted in the field of information technologies, especially artificial intelligence. The advances are strengthened by the creation of wide range of software and universal and specialized computer equipment that uses it. All this is accompanied by the rapid and intensive formation of the new scientific terminology that is not always commonly acknowledged. Therefore to avoid ambiguity we should give the basic definitions of concepts used in this work.
The notion of 'knowledge' remains most common and least clearly defined yet. In this work, we assume that knowledge is useful information. Unlike the information that is measured as media volume needed for its storage, knowledge should be measured as a benefit of using corresponding information. Knowledge acquisition can in principle take place only when and where there is its (potential) carrier – an intellectual agent which has information about its current and desirable state as a goal, achievement problem and motivation, and a strategy (plan) to achieve its goals. Only then the information can be used to solve these problems and thus serve as knowledge to their carrier – an intelligent agent.
We are considering ontology as formal explicit representation of common terminology and its logical interdependence for a certain subject domain. An ontology formalizes an intensional of the domain – e.g. a set of rules in terms of formal logic, while its extensional is defined in the knowledge base as a set of facts about instances of concepts and relationships between them. The process of filling the knowledge base called knowledge markup (further – KM), or ontology population (further – OP), methods and tools for automatic (semi-automatic) ontology structure development – ontology learning (OL). OL methods in turn are based on the methods of natural language processing (NLP) and machine learning (ML). Far less attention is paid to the approaches developed in the field of automated planning (AP). Knowledge acquisition (KA) in particular from text documents using the NLP methods is perhaps the only way for automatic construction of ontology, which however can not replace an OL as a scientific discipline, remaining its key instrument. A role of ontology makes a fundamental difference between OL and KA because for OL it is not only the tool but also a target and a performance criterion for methods and tools developed during OL researches.
The method of recognizing the logic content of the natural text document i.e. natural language understanding (NLU) is based on an information technology of semantic text analysis, i.e. text mining (TM), which in turn can be approximately determined as the process of identification information that can be useful for solving certain tasks. On the other hand, the research area TM includes NLU in the part where it acts as a scientific discipline - linguistic tool for translation natural language (NL) documents to formal knowledge representation languages for further formal analysis. NLU in turn considered as a section of NLP (Fig. 1).
Currently TM is an actively developing scientific discipline. It has not well developed institutions (no universally recognized textbooks, lectures, chairs on this subject) and therefore is in a somewhat uncertain status. Some experts interpreted TM too broadly and include in it all IT techniques which deals with NL texts, others understood TM too narrowly as a particular case of statistical data analysis – data mining (DM), yet others tend to believe this discipline extension or replacement of the information retrieval (IR) discipline. In most cases, a statement of the problem depends on the subject domain in which it is formulated. TM includes very different approaches to this problem by a level of analysis - a simple classification using a syntactic parser and identification the specific values of the semantic structures as provided, for example, in a quite advanced project GATE, up to complex predictive analysis, aimed at finding a solution of the problem. It is the second approach is considered in this work - an intelligent content recognition (understanding) NL text as a main source of knowledge today.
At 2013 the world used more than 6 billion mobile devices such as smartphones, tablets and others. Their users, in particular, exchange information about themselves through questionnaires, registration forms about 2.5 billion accounts only of three most popular social networks. Every 24 hours there are generated 2.5 trillion (2.5×1018) bytes of new information. Much of it is contained in text documents that are tailored to the informed reader.
It is a text document at present and is the most common and most widely used means of formal information exchange . However, only adequately informed reader can find important to him relevant knowledge in the information contained in text document. For other potential readers, who can not compare received message with the information they already processed, this document has no value. Therefore, a value of the information contained in text documents makes sense only in relation to a given consumer of such information and is relative rather than absolute measure. The coordinate system, metrics, which can give a numerical evaluation of information value must bind to specific customer, which can be represented by its information model in the form of, for example, knowledge base of an intelligent agent. Then it is possible to use a reaction of the model to submitted information to assess its value to the consumer. For this we must be able to implement a model of an information consumer in the form of an intelligent agent, extract potentially useful information from natural language text documents and simulate the process of a text analyzing by means of the developed intelligent agent. This may be useful to automate information retrieval on the Internet, to build highly efficient adaptive spam filters, to develop fundamentally new agent-oriented operating systems, to control the content of the Internet traffic, automation of distance learning and other applied problems in the field of information technologies. This approach provides a broad platform for the implementation of intelligent software and thus can be the basis for the so-called intelligent technologies.
(Для ознайомлення з повним текстом статті необхідно залогінитись)