Saturday, 18 April 2015

Knowledge Discovery (KD)

Knowledge Discovery (KD)
It is an interdisciplinary area focusing upon methodologies for extracting knowledge[2]* (useful knowledge) from data. The ongoing rapid growth of online data due to the Internet and the widespread use of databases have created an immense need for KD methodologies. The challenge of extracting knowledge from data draws upon research in statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions [1].

KD is widely used data mining technique which is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results to be applied to benefit the business [3]. Data mining is considered the core step of knowledge discovery process [6]**.

Data Mining and Knowledge Discovery are terms used interchangeably. Other terms often used are data or information harvesting, data archeology, functional dependency analysis, knowledge extraction and data pattern analysis [4].

The purpose of KD is to access historical data and to identify relationships which have a bearing on a specific issue, and then extrapolate from these relationships to predict future performance or behavior. The human analyst plays an important role in that only they can decide whether a pattern, rule or function is interesting, relevant and useful to an enterprise [4].

Traditionally, data mining and knowledge discovery was performed manually. As time passed, the amount of data in many systems grew to larger than terabyte size, and could no longer be maintained manually. Moreover, for the successful existence of any business, discovering underlying patterns in data is considered essential. As a result, several software tools were developed to discover hidden data and make assumptions, which formed a part of artificial intelligence [5].

The recent KD houses many different approaches to discovery, which includes inductive learning, Bayesian statistics, semantic query optimization, knowledge acquisition for expert systems and information theory. The ultimate goal is to extract high-level knowledge from low-level data [5].

KD is not a fully automatically way of analysis. It should Interact with a user/expert, so user is an important element in KD process. User should decide about choosing task and algorithms, selection in preprocessing.

There are steps involved in the knowledge discovery. These steps are [5][7][8]:

Goal Identification:
Identify the goal of the KD process from the customer’s perspective.

Domain Understanding:
Understand application domains involved and the knowledge that's required by understanding your requirements. You need to have a clear understanding about the application domain and your objectives, whether it is to improve your sales, predict stock market etc. You should also know whether you are going to describe your data or predict information.

Selection of data set:
Data mining is done on your current or past records. Thus, you should select a data set or subset of data, in other words data samples, on which you need to perform data analysis and get useful knowledge. You should have enough quantity of data to perform data mining. For example, if firmographic attributes are the most important criteria, then only the data models that meet the minimum threshold for annual income or revenue would be selected. If psychographic data matter more, then records might be selected for specific interests such as camping, concerts or social causes.

Data Cleaning:
Also known as Data Hygiene. This step is done through cleanse (clean) and preprocess data by deciding strategies to handle missing fields and alter the data as per the requirements. In other words, data cleaning is the step where noise and irrelevant data are removed from the large data set. This is a very important preprocessing step because your outcome would be dependent on the quality of selected data. As part of data cleaning, you might have to remove duplicate records, enter logically correct values for missing records, remove unnecessary data fields, standardize data format, update data in a timely manner and so on.

Data Integration:
Simplify the data sets by removing unwanted variables. Then, analyze useful features that can be used to represent the data, depending on the goal or task. You can say that this step is about combining more than one set of data—such as customers and prospects or leads that are in various stages of the demand waterfall. You may also want to aggregate prospects from more than one source, including both purchased and rented lists. Although there are several steps involved in data integration, the most important is de-duplicating the records. This can eliminate a tremendous amount of waste. But you must establish rules that define which source is preferred when duplicates are found. - See more at:

Data Transformation:
With the help of dimensionality reduction or transformation methods, the number of effective variables is reduced and only useful features are selected to depict data more efficiently based on the goal of the task. Data is transformed into a uniform set and optimized for use in a marketing program or campaign. All the fields must be consolidated, merged and purged so that they will be easy to index. In short, data is transformed into appropriate form making it ready for data mining step.

Data Mining:
In this step some appropriate tasks are applied in order to extract data pattern. These tasks are classification, clustering, association rule discovery, sequential pattern discovery, regression and deviation detection. You can choose any of these tasks based on whether you need to predict information or describe information. This step can be done by:
  •  Matching KD goals with data mining methods to suggest hidden patterns.
    • It involves searching the various fields of the database for specific attributes. These are then used to identify trends that can be matched against the predictive models
  • Choose data mining algorithms to discover hidden patterns. This process includes deciding which models and parameters might be appropriate for the overall KD process. This can be done by:
    • Selecting appropriate method(s) for looking for patterns from the data.
    • Choosing the model and parameters that might be appropriate for the method (Search for patterns of interest in a particular representational form). Some popular data mining methods are decision trees and rules, relational learning models, example based methods etc.

Pattern Evaluation:
Interpret essential knowledge from the mined patterns and relationships. The patterns that emerge during the data mining process must be evaluated to determine which are relevant to the model and which aren’t. If the pattern evaluated is not useful, then the process might again start from any of the previous steps, thus making KD an iterative process.

Knowledge Presentation:
This is the final step in KD. The knowledge discovered is consolidated and represented to the user in a simple and easy to understand format. Mostly, visualization techniques are being used to make users understand and interpret information. This step allows you using the knowledge and incorporate it into another system for further action. Morevover, you can document it and make reports for interested parties.

*Knowledge extraction: is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data [2].

**knowledge discovery process: is the process that leads to find new knowledge in some application domain. It also called knowledge discovery in databases. The process defines a sequence of steps (with eventual feedback loops) that should be followed to discover knowledge in data. Each step is usually realized with the help of available commercial or open-source software tools. [6].