Knowledge Discovery (KD)
It is an
interdisciplinary area focusing upon methodologies for extracting knowledge[2]* (useful knowledge) from data. The ongoing rapid growth of online data due
to the Internet and the widespread use of databases have created an immense
need for KD methodologies. The challenge of extracting knowledge from data
draws upon research in statistics, databases, pattern recognition, machine
learning, data visualization, optimization, and high-performance computing, to
deliver advanced business intelligence and web discovery solutions [1].
KD is widely used
data mining technique which is a process that includes data preparation and
selection, data cleansing, incorporating prior knowledge on data sets and
interpreting accurate solutions from the observed results to be applied to
benefit the business [3]. Data mining is considered the core step of knowledge
discovery process [6]**.
Data Mining and
Knowledge Discovery are terms used interchangeably. Other terms often used are
data or information harvesting, data archeology, functional dependency
analysis, knowledge extraction and data pattern analysis [4].
The purpose of KD is
to access historical data and to identify relationships which have a bearing on
a specific issue, and then extrapolate from these relationships to predict
future performance or behavior. The human analyst plays an important role in that
only they can decide whether a pattern, rule or function is interesting,
relevant and useful to an enterprise [4].
Traditionally, data
mining and knowledge discovery was performed manually. As time passed, the
amount of data in many systems grew to larger than terabyte size, and could no
longer be maintained manually. Moreover, for the successful existence of any
business, discovering underlying patterns in data is considered essential. As a
result, several software tools were developed to discover hidden data and make
assumptions, which formed a part of artificial intelligence [5].
The recent KD houses
many different approaches to discovery, which includes inductive learning,
Bayesian statistics, semantic query optimization, knowledge acquisition for
expert systems and information theory. The ultimate goal is to extract
high-level knowledge from low-level data [5].
KD is not a fully
automatically way of analysis. It should Interact with a user/expert, so user
is an important element in KD process. User should decide about choosing task
and algorithms, selection in preprocessing.
There are steps involved in
the knowledge discovery. These steps are [5][7][8]:
Goal Identification:
Identify the goal of
the KD process from the customer’s perspective.
Domain Understanding:
Understand
application domains involved and the knowledge that's required by understanding
your requirements. You need to have a clear understanding about the application
domain and your objectives, whether it is to improve your sales, predict stock
market etc. You should also know whether you are going to describe your data or
predict information.
Selection of data set:
Data mining is done
on your current or past records. Thus, you should select a data set or subset
of data, in other words data samples, on which you need to perform data
analysis and get useful knowledge. You should have enough quantity of data to
perform data mining. For example, if firmographic attributes are the most
important criteria, then only the data models that meet the minimum threshold
for annual income or revenue would be selected. If psychographic data matter
more, then records might be selected for specific interests such as camping,
concerts or social causes.
Data Cleaning:
Also known as Data Hygiene. This
step is done through cleanse (clean) and preprocess data by deciding strategies
to handle missing fields and alter the data as per the requirements. In other
words, data cleaning is the step where noise and irrelevant data are removed
from the large data set. This is a very important preprocessing step because
your outcome would be dependent on the quality of selected data. As part of
data cleaning, you might have to remove duplicate records, enter logically
correct values for missing records, remove unnecessary data fields, standardize
data format, update data in a timely manner and so on.
Data Integration:
Simplify the data
sets by removing unwanted variables. Then, analyze useful features that can be
used to represent the data, depending on the goal or task. You can say that
this step is about combining more than one set of data—such as customers and
prospects or leads that are in various stages of the demand waterfall. You may
also want to aggregate prospects from more than one source, including both
purchased and rented lists. Although there are several steps involved in data
integration, the most important is de-duplicating the records. This can
eliminate a tremendous amount of waste. But you must establish rules that
define which source is preferred when duplicates are found. - See more at:
Data Transformation:
With the help of
dimensionality reduction or transformation methods, the number of effective
variables is reduced and only useful features are selected to depict data more
efficiently based on the goal of the task. Data is transformed into a uniform
set and optimized for use in a marketing program or campaign. All the fields
must be consolidated, merged and purged so that they will be easy to index. In
short, data is transformed into appropriate form making it ready for data
mining step.
Data Mining:
In this step some
appropriate tasks are applied in order to extract data pattern. These tasks are
classification, clustering, association rule discovery, sequential pattern
discovery, regression and deviation detection. You can choose any of these
tasks based on whether you need to predict information or describe information.
This step can be done by:
- Matching KD goals with data mining methods to suggest hidden patterns.
- It involves searching the various fields of the database for specific attributes. These are then used to identify trends that can be matched against the predictive models
- Choose data mining algorithms to discover hidden patterns. This process includes deciding which models and parameters might be appropriate for the overall KD process. This can be done by:
- Selecting appropriate method(s) for looking for patterns from the data.
- Choosing the model and parameters that might be appropriate for the method (Search for patterns of interest in a particular representational form). Some popular data mining methods are decision trees and rules, relational learning models, example based methods etc.
Pattern Evaluation:
Interpret essential
knowledge from the mined patterns and relationships. The patterns that emerge
during the data mining process must be evaluated to determine which are
relevant to the model and which aren’t. If the pattern evaluated is not useful,
then the process might again start from any of the previous steps, thus making
KD an iterative process.
Knowledge Presentation:
This is the final
step in KD. The knowledge discovered is consolidated and represented to the
user in a simple and easy to understand format. Mostly, visualization
techniques are being used to make users understand and interpret information.
This step allows you using the knowledge and incorporate it into another system
for further action. Morevover, you can document it and make reports for
interested parties.
Note:
*Knowledge extraction: is the
creation of knowledge
from structured (relational databases, XML) and unstructured (text, documents,
images) sources. The resulting knowledge needs to be in a machine-readable and
machine-interpretable format and must represent
knowledge in a manner that facilitates inferencing. It requires either the
reuse of existing formal
knowledge (reusing identifiers or ontologies) or the generation of a schema
based on the source data [2].
**knowledge discovery process: is
the process that leads to find new knowledge in some application domain. It
also called knowledge discovery in databases. The process defines a sequence of
steps (with eventual feedback loops) that should be followed to discover
knowledge in data. Each step is usually realized with the help of available
commercial or open-source software tools. [6].
References:
Comments
Post a Comment