留学生作业：数据挖掘技术在保险行业中的决策研究_TERMPAPER

1 Introduction
1 引言

With the rapid development of database technology and database management systems widely used, more and more data accumulate all walks of life. Growing surge of data hidden behind a lot of important information that people want to be able to be a higher level of analysis in order to make better use of the data. The current database systems can efficiently implement data entry, query, statistics and other functions, but can not find the data relationships and rules exist, can not be based on existing data to predict future trends. Lack of knowledge hidden behind data mining tools, led to the "data explosion but knowledge poor" phenomenon.
随着数据库技术的迅速发展以及数据库管理系统的广泛应用,各行各业积累的数据越来越多。日益剧增的数据背后隐藏着许多重要的信息,人们希望能够对其进行更高层次的分析,以便更好地利用这些数据。目前的数据库系统可以高效地实现数据的录入、查询、统计等功能,但无法发现数据中存在的关系和规则,无法根据现有的数据预测未来的发展趋势。缺乏挖掘数据背后隐藏的知识手段,导致了“数据爆炸但知识贫乏”的现象。

With the development of computer and network technology, access to a particular industry relevant information has been feasible. For large quantities, involving a wide range of data, relying on the traditional simple summary of the specified model to analyze the statistical methods of data analysis can not be completed. Therefore, an intelligent analysis of information technology - "data mining" (Data Mining) came into being.
随着计算机及网络技术的发展,获得某一行业有关资料已切实可行。而对于数量大、涉及面广的数据,依靠传统的简单汇总、按指定模式去分析的统计方法无法完成对数据的分析。因此,一种智能化的信息分析技术——“数据挖掘”(Data Mining)应运而生。

Data Mining (Data Mining) is a large, incomplete, noisy, fuzzy, random data to extract implicit in them, people are not known in advance, but is potentially useful information and knowledge in the process . By mining data warehouse to store large amounts of data, and found a new association meaningful patterns and trends in the process. Data mining is a new business information processing technology, is a large number of commercial database business data extraction, transformation, analysis and processing of other models to extract critical data supporting business decisions. So that enterprises in the fierce market competition opportunities. As for the insurance industry, currently has a broad market demand.
数据挖掘(Data Mining)是从大量的、不完全的、有噪声的、模糊的、随机的数据中提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。通过挖掘数据仓库中存储的大量数据,从中发现有意义的新的关联模式和趋势的过程。数据挖掘是一种新的商业信息处理技术,是对商业数据库中的大量业务数据进行抽取、转换、分析和其他模型化处理,从中提取辅助商业决策的关键性数据。从而使企业在激烈的市场竞争中获得先机。就保险行业而言,目前具有广阔的市场需求。

2 Item Description
The project has developed "the insurance industry decision system V1.0". The main interface of system operation using ASP programming: data preprocessing, customers to buy insurance analysis, customer buying habits analysis and the results output functions; background database using the Sql Server 2005 network database implementation; mining tools using SPSS Clementine 11.0; experiments in the study stage Apriori algorithm exists for "Storage complexity" and "a lot of redundant rules," two major drawbacks of the algorithm to improve through the use of a pattern tree structure to reduce the complexity of storage Apriori algorithm, while reducing the appearance of redundant rules .
The system consists of: data preprocessing, customers to buy insurance analysis, customer buying habits analysis and the results output and other major functional blocks.

(1) "preprocessing" modules include: upload, data platform, data processing, statistics, and other functions to generate data sets.
● Upload: to be completed by all branches Insurance Corporation under the data upload.
● Data Platform: allows the data before uploading data platform to choose.
● Data processing: cleaning up the data, format conversion and other operations.
● Statistics: The preprocessed data analysis, extraction efficacy data.
● generate data sets: the statistical data generating process to extract the active data set, to provide a higher quality data mining data source.

(2) "customers to buy insurance analysis" modules include: data import, parameter setting, result analysis and other functions.
● Data Import: In this user interface, by selecting different data platform will go through "data preprocessing" generated data sets were imported.
● Parameter setting: In this user interface settings "support", "confidence" and other parameters for effective analysis of the data set with the value range of the data record filter.
● Analysis: In this user interface can be "customers to buy insurance analysis," the final results of the analysis to the "report", "chart" format display, the results of this analysis for the industry to provide a "same customer buy our various (sub) insurance "customer information, thus providing the industry" to win customers' decision-making basis.

(3) "customer buying habits of" modules include: data import, parameter setting, result analysis and other functions.
● Data Import: This operation is the same (2) "customers to buy insurance analysis" module "Data Import."
● Parameter setting: In this setting, respectively, "Input Parameters" (including: age, gender, occupation and other basic customer information) and "Output Parameters" (customers buy insurance information).
● Analysis: With this interface can demonstrate customer buying habits analysis, thus providing the industry "to retain customers' decision-making basis.
(4) "analysis result output" modules include: "Analysis of customers to buy insurance" and "customer buying habits analysis" of the print output results.

Three projects improved fast algorithm
Since Apriori algorithm time and space complexity is high and there is a large amount of redundant rules two major defects. Therefore, this project through the use of a pattern tree structure to reduce the complexity of storage Apriori algorithm, while reducing redundant rules appear.

3.1 a pattern tree structure
root is the one labeled as "null" the root, root root following the child's program as a prefix sub-tree collection, as well as project head table composition; tree each node contains four fields user_id, count, node_link, node_next. Which, user_id is user tags (uniquely identifies a user), count for the parent node of the node reaches the number of paths, node_link point to the same tree the user_id next node to the next node, the moment a node does not exist, node_link is null, node_next pointing to its child nodes in the tree; program header table for each table entry contains three fields: user_id, count, head of node, user_id with the same meaning as defined in the tree, count as user_id of the tree and all the same, head of node points to the tree with the same user_id value of the first node pointer.

3.2 Creating Pattern Tree
Algorithm is as follows:
Let the transaction database as A, one of the items set to Ai.
Algorithm: Patterntree (tree, p), constructed pattern tree
Input: A transaction database user
Output: User mode tree
Procedure Patterntree (T, p)
{Create_ tree (T) ;/ / create a Pattern-Tree root node to "null" mark
t = T; / / t for the current node
While A <> null do
{Read into a transactional database item set Ai
while p! = null
do
{If p.user_id == t ancestors n.user_id
then
{N.count = n.count + l;
t = n;
}
Elseif p.user_id == T kids c.user_id
then
{C.count = c.count + l;
t = c;
}
else
insert_Patterntree (T, p) ;/ / put p as a new node into the tree, as the current node's child nodes
p = p.next;
}
}
}

3.3 pairs pattern tree pruning
Pattern tree is established, there may be a large number of redundant branches, in order to ensure that the data mining results will not be the redundant branches affected by the noise generated, so the need for tree pruning, removing noise information.
Algorithm: SPT (Tree, a), by calling the model tree pruning algorithm
/ / SPT to support pattern tree, ie Supported Access Pattern Tree; a head table for the project
Input: Pattern tree PatternTree, Min_Sup (Pattern Tree minimum support)
Output: After pruning the support pattern tree SPT, mode B = {bi | i = 1,2,3 ...... n}
SPT (Tree, a)
{I = 1;
While (ai! = null) / / for the project head table in a one
{
if (ai.count> = Min_Sup)
then
{
Mode bi = ai.head of node;

p = ai.head of node ;/ / p in the schema tree pointing ai
Location
While (p! = null and ai.count> = Min_Sup)
{
Find the prefix p group, the p-group, and p connection prefix, configuration
Into Mode b;
if (bi.count> = Min_Sup)
then
{
/ / Bi.count the mode p and p b is the base of the prefix
The minimum count
P in the schema bi retain their prefixes base;
bi = bi. node_link
}
else
{
Depending on the mode of p and b prefix base deletion
PatternTree the corresponding node, a child node reconfiguration
With the parent node, and modify the project header table ai;
p = p. node_next / / p points in the pattern tree
Next position; http://www.ukassignment.org/dxtermpaper/
}
}
}
else
{
Modify the project head node ai value;
Delete mode corresponding node in the tree and prefix-based, reconstruction Sons
Node;
i + +;
}
}
}

The establishment of the tree can be avoided through mode multiple scans the transaction database; while taking advantage count field effectively retains the number of itemsets to avoid generating a large number of frequent itemsets, for reducing the complexity of space-time has played a certain role. Tree structure can be avoided through a large amount of redundant rules.
Through the pattern tree pruning, tree can be deducted in the pattern generation process produces a large number of redundant branches, played a role in reducing the space complexity, and can utilize the output mode B production rules, to avoid a number of sets appears frequently, reducing the time complexity.

4 Conclusion
The project tree structure by mode improved Apriori algorithm, Apriori algorithm to make up for the defects. This method is not only capable of Apriori algorithm from time complexity and space complexity to improve on, while avoiding the generation of intermediate rules. This study shows that by using a pattern tree structure to reduce the complexity of storage Apriori algorithm, while reducing the appearance of redundant rules, which improved Apriori algorithm is an effective measure.