Exploiting Symbiosis between Data Mining and OLAP for Business Insights

2002-1-13 21:50:20【作者】 畅享网 【进入论坛】
本文关键字 理论探讨
广告

Exploiting Symbiosis between Data Mining and OLAP for Business Insights

By Bhooshan Kelkar, Ph.D.

Certain fundamental principles just stand the test of time ?in the old economy or the new, brick and mortar or dot-com (read dot-gone of 2001). What still holds true is "garbage in, garbage out." It clearly means we need good data, but is that enough? The answer is that just having a lot of good data is not enough; you also need the right set of tools to go into the system to harness the prolific data in the business environment. Future strategies will have to bridge retroactive reporting technologies such as online analytical processing (OLAP) and proactive technologies such as data mining. Using one over the other does not satisfy the needs of business growth but a tighter integration between these technologies is necessary.

IDC estimates e- commerce will grow from $111 billion in 1999 to $1.3 trillion in 2003. This growth rate and subsequent complexity underscores the need for appropriate, fast and more efficient e-commerce solutions. There is a clear and ever-growing need to move from the paradigm of retrospective and dynamic data delivery at the record level (RDBMS and SQL) to dynamic data delivery at multiple levels (OLAP) and proactive knowledge discovery (data mining). Solving modern business problems such as market analysis and financial forecasting require query- centric database schemas that are array oriented and multidimensional. This is because these business problems are characterized by the need to retrieve large numbers of records from very large data sets (hundreds of gigabytes and even terabytes) and summarize them quickly.

The multidimensional nature of the problem is simultaneously looking at the data from multiple points of view. The key driver for OLAP technology is this aspect of multidimensional views. For example, a business analyst might want to get monthly (time dimension) sales data for a set of products (products dimension) across all market segments (market dimension) and across all customer groups (customer dimension) to analyze buying patterns. This is multidimensional expression of a real business problem.

Relational databases are optimized for transaction systems and query while OLAP is optimized for multidimensional query and report. OLAP cubes store precomputed results of data at different levels and hence multidimensional query results are returned faster.

OLAP is a computing technique for summarizing, consolidating, viewing, applying formulae to and synthesizing data according to multiple dimensions. OLAP software enables users, such as analysts, managers and executives, to gain insight into performance of an enterprise through rapid access to a wide variety of data views that are organized to reflect the multidimensional nature of the enterprise performance data. An increasingly popular data model for OLAP applications is the multidimensional database (MDDB), which is also known as the data cube.

Data mining, on the other hand, is more of an exploration tool and uses several techniques.

  • Association is exploring not only obvious and unintuitive but sometimes even counter-intuitive relationships between attributes to find interesting rules. An example of an association rule is "a sale of vodka implies sale of lemonade."
  • Principal component analysis is identifying significant attributes for a given metric.
  • Clustering is grouping data based on similarity in one or many criteria. An example would be that of customer segmentation based on age, gender, etc.
  • Classification is segregation of data into predetermined groups, such as high-risk customers and low-risk customers.
  • Prediction is identifying trends and future values.

Analysts can use these algorithms individually or in combination for gaining insight into businesses and discovering interesting patterns and relationships for strategic and tactical decisions.

In data mining projects, as much as 80 percent of the time is spent on preparing data. The remainder does require in- depth knowledge of machine learning and statistical techniques. However, as the data mining suites become more and more user friendly, it is easier for the end users in the data management community with domain knowledge and good business questions to successfully exploit data mining. Thus, the myth that for even dabbling in data mining one needs to have a Ph.D. and sit in an ivory tower is quickly crumbling. Data mining is no longer an esoteric activity but is an integral part and an embedded function in a variety of corporate software products.

While OLAP is by its nature retrospective, data mining is proactive. OLAP is driven by experts and is deductive in nature. Data mining is driven by the data itself and is inductive in nature. It is easy to imagine that these two technologies complement each other in business analytics. (See Figure 1.)


Figure 1: OLAP and Data Mining : A Complementary Relationship

Since both techniques require cleansed, consistent and integrated data, the same high quality data warehouse can provide data to both data mining and OLAP.

OLAP has the inherent ability to present data in many views and at different granularities by virtue of OLAP operations such as slicing, dicing, drill down, pivoting and filtering. OLAP cubes are also being successfully used for presenting data mining results to the business analysts. Similarly, various mining algorithms on the clean data warehouse can stage the data and help evolve a more meaningful meta data model for building an OLAP cube.

The possible areas of integration of OLAP and data mining can be conveniently grouped as pre-processing (cube-building stage) and post-processing (cube- analysis stage). The conceptual application of data mining algorithms for pre and post processing for OLAP is depicted in Figure 2.


Figure 2: Pre- and Post-Processing of OLAP Cubes Based on Data Mining

Selection of appropriate dimensions or dimension reduction, deciding what level data needs to be aggregated, are important decisions which require significant insight when building an OLAP cube. Dimension reduction is an area where data mining techniques such as clustering or principal component analysis can be of great assistance to the analyst. As suggested in Figure 2, principal component analysis, clustering or association either individually or in combination with each other may be employed to evolve the cube structure.

OLAP cubes can be over 100GB in size and may be very difficult to interactively analyze. Post-processing of cubes can be done using data mining tools such as modeling or prediction using neural networks or fuzzy logic. Analysis of interesting areas of the OLAP cube is also a very important application of post-processing. Sophisticated statistical methods can be employed for the discovery of interesting areas in the cubes.

Data Mining for Pre-Processing of OLAP Cubes

Let us use an example of building OLAP cubes for Web-based business analytics. For such analysis, customer segmentation has an important role to play in making online campaigns more targeted and hence more effective. With this goal in mind, an analyst can use a clustering algorithm on the data warehouse having customer information. An example of results of this data mining activity is depicted in Figure 3. It shows that the customer group has been broken down into three big clusters labeled on the lifetime value (LTV) scale that is a very popular and useful index for grouping customers. It shows that 10 percent of the customers belong to the high LTV, 35 percent belong to medium LTV and 55 percent of the customers belong to low LTV.

The real value of data mining is further shown by the attributes that are ranked by their importance. As an example, the cluster of high LTV illustrates that the most important attribute for that cluster is quality. The most important attribute for the low LTV cluster is price. This tells the analyst that the high LTV customer cluster is more sensitive to quality than price and the opposite is true for the low LTV cluster of customers. This information can now be used to devise different customer-specific campaigns.

This information can be used to select the dimensions that are used when building the OLAP cube. This may be required when many dimensions are available from the relational database, but due to size and performance constraints only the most significant dimensions can be selected.

Data mining not only can assist the analyst in making a decision on dimension reduction, but also assist in determining how many cubes to build. This assists the analyst to optimize the cost and business value of maintaining large cubes for different functional areas of the business. If three separate cubes for low, medium and high LTV are built for these clusters then dimensions can be selected accordingly. The three cubes built from the same data may not have the same set of dimensions.


Figure 3 : Clustering-Based Pre-Processing for Customer OLAP Cube

Another, much wider application of data mining is to formalize the meta data and structure of the OLAP cubes. As an example, results from association can be reflected into OLAP cube building process, thus bringing additional intelligence.

Using Data Mining to Analyze OLAP Cubes

In OLAP cubes, typically the entries are numerical data or nominal data. Since OLAP cubes present logically grouped views and aggregations at different levels, selection of a subcube for various business questions is intuitive. Once the subcube is selected, there are several mining algorithms that can synergistically work with OLAP cubes for post-processing.

Association on a data warehouse requires aggregation to be performed at different levels, which can be slow. Since an OLAP cube has precomputed results, performing association on an OLAP cube is much faster.

Statistics-based algorithms like correlation analysis can also work with an OLAP subcube. An example would be that of analyzing various market segments with respect to each other. Another application of statistical tools on the subcube is trending. Time- series analysis is also a specific type of trending. An example would be that of predicting sales of a product line based on historical records for a particular market.

OLAP data cubes are also being increasingly used for interactive exploration of regions of anomalies in data, which are also referred to as exceptions. Problem areas and/or new opportunities may be identified when an anomaly is located. A novel algorithm to find exceptions based on advanced statistical prediction has been developed in IBM. This is a good example of mining working with OLAP in order to identify problem areas in large business data repositories.

An example of an exception in a cube is given in Figure 4. Value "6" is an exception in both the Region and Time dimensions.


Figure 4: Example of an Exception in the Sales Cube

Automatic identification of an exceptional value such as this can be a warning that is very useful for preventive maintenance. For example, an analyst might want to find why the sales in Florida (FL) dropped so dramatically compared to other states such as California (CA) or Texas (TX). The analyst can drill down into the cube to explore details of this exception and zero in on the root causes which can be useful in tactical decisions of the business. In other cases, an exception may help identify an opportunity.

As e-commerce grows in size and reach, business analytics is expected to assume even greater importance. As wider spectrum of businesses realizes the need to be more customer centric, effective use of customer data is a necessity. Business analytics is, therefore, an essential tool and provides the means not only to stay afloat among fierce competition, but also to excel in the market. In the current and future marketplaces with accelerated speed of business, seeing the shifts in customer patterns and choices before your competitors and acting on it is crucial. This is where the integration of data mining technology and OLAP is bound to play a pivotal role.


 

Dr. Bhooshan Kelkar is an advisory software engineer at IBM business intelligence group. Kelkar earned his B.Tech. in engineering from IIT, Bombay and M.S. and Ph.D. in artificial intelligence from Scotland. His 10-year career in applied machine learning includes engineering applications and more recently data management. Kelkar can be reached at bkelkar@us.ibm.com.


如果您希望与本文章的作者或其所在机构,进一步交流,请联系:畅享网 姜小姐
jill.jiang@amt.com.cn | 021-51096826-112 | 在线联系
罗永辉呼吸BI[原创]商业智能:感性到理性 完..

  2007年是商业智能从感性回归理性的一年,也是从完善到提升承前启后的一年。 回顾篇 认识层面 2007年,国内国外普遍加深了对BI的理解。Gart……

TTNN-BI观点TTNN-BI观点十月刊——湖光山色

2007,国际权威重新定义了BI。从当前实践看来,这种定义符合实际,毕竟BI要落地,要能给企业带来真正的收益。当然,如何落地,自然必须有技术的支撑和管理策略及相……