Document Warehousing & Content Management: Poor Search Quality in Your Enterprise Information Portal?

2002-1-13 10:59:01【作者】 畅享网 【进入论坛】
本文关键字 理论探讨
广告

Document Warehousing & Content Management: Poor Search Quality in Your Enterprise Information Portal?

By Dan Sullivan

As enterprise information portals (EIPs) grow to truly enterprise levels, finding relevant information becomes more challenging. Searches of large content repositories, whether we are talking about the Web or an EIP, often suffer two related problems. First, users' queries often return irrelevant information because the chosen keywords have multiple meanings. A second problem with keyword searching is that it assumes users know the terms that reflect the information they are looking for, a questionable assumption especially when researching new topics. There is no single solution to providing high-quality universal searches across an enterprise. We will do better to look for improvements by combining tools rather than trying to squeak out marginal improvements from a single technique. Taxonomy-generation tools, for example, complement search engines and should be considered the second step toward providing high-quality searches in an enterprise information portal.

Taxonomy-generation tools create a Yahoo!-like directory structure for navigating content in a portal or intranet. The process of categorizing content with a taxonomy can start with a predefined set of categories such as those found in an industry thesaurus or an internal organization structure. Some tools, such as Verity's K2 Enterprise, provide automated methods for initial taxonomy construction based on hierarchical clustering, meta data extraction and other techniques. Key terms are associated with categories and provide the link between content and their place in the taxonomy. The final, and recurring, step is analyzing content to determine the most relevant terms and placing the document into the appropriate place in the taxonomy.

With a taxonomy in place, users will have an easier time finding information. First, users do not have to come up with keywords to find information. Someone looking for best practices in quality control will not need to know specific terms about a manufacturing process or statistical measures ­ he or she only needs to know enough to drill down through a set of choices to find what he or she needs. A second benefit is that taxonomies can categorize all content in a portal or intranet. There are limits to the throughput of categorizers, but crawling the most important areas of an intranet will ensure the taxonomy indexes the most relevant documents and content. This, in turn, will improve the chance users will find what they are looking for in their searches.

It's now time for some truth in advertising. Automatic taxonomy-generation tools are not so automatic. Categorizations vary in accuracy. Scalability is a concern. Improvements in the quality of the taxonomy will take time and require methodical evaluations and adjustments. Let's look at these individually.

Creating a taxonomy requires initial investment in defining the basic taxonomy structure. With the structure in place, the terms that link categories to documents will almost certainly require editing. For example, documentation about a product could end up in the same place as promotional material about the item even though the taxonomy has separate categories for technical documents and marketing material. To prevent this type of misclassification, the categorization rules need to be revised to include filter terms that discriminate between technical and marketing information.

Categorization is rarely a black or white proposition. Weights associated with assigned categories reflect the relative confidence that the content actually falls into that category. This information is essential for establishing business rules for managing a taxonomy. For example, any document placed into a category with a weight greater than 0.8 is automatically published, anything below 0.5 is rejected and everything in between is sent to a human for a decision. One approach to improving the quality of results is to explicitly support business rules and workflow in the categorization process. Another technique is to use multiple categorization algorithms to avoid the particular shortcomings of individual algorithms.

Any enterprise-class tool must scale, and taxonomy tools are no exception. One possible bottleneck is the clustering algorithm. Fortunately, a new breed of algorithms based upon support vector machines (SVMs) is offering much faster categorizations than some of the more traditional techniques. In addition, taxonomy tools should allow for incremental additions of documents without having to reanalyze content already categorized.

Finally, we need to remember Deming's warning about quality: if we do not measure it, we cannot improve it. The quality of a taxonomy is measured by how well it categorizes content compared to a knowledgeable human. To improve the quality of the taxonomy and its categorization of content, we need to methodically sample and evaluate what has been automatically categorized, especially after significant changes to the categorization rules.

The taxonomy-generation market is relatively young; but companies such as Semio, Stratify, Quiver and SmartLogik are offering products with a range of functionality. Portal and search vendors, such as Verity, are incorporating taxonomy tools into flagship products. It is not clear what this market will look like a year from now. However, one thing is certain – taxonomies are as essential as search engines in enterprise information portals.

如果您希望与本文章的作者或其所在机构,进一步交流,请联系:畅享网 姜小姐
jill.jiang@amteam.org | 021-51096826-112 | 在线联系
吴勇毅 专栏CIO 应向刘邦学管理

而国内不少专家也认为,“七分管理,三分技术”,CIO优良与否,与技术出身有关,更与整体素质有关。

夏敬华的KM专栏[原创]智慧的和谐—知识管理推..

从知识管理的角度来观察执行力体系,我们会发现,知识管理和战略、运营和人员这三个环节之间有着内在紧密的逻辑联系。

KM八爪鱼-萧秋水的专栏[原创]企业知识库2.0

面对经济危机,企业更应该关注知识管理,关注知识库的构建,扩充知识储备,提高企业智商和竞争优势。

前沿论丛2009年第三期——知识管理..

国内中小企业普遍存在管理基础薄弱、规范化程度低、信息化基础差等方面的问题,而知识管理的实施难度甚至要高于ERP的实施,因为简单的从上而下压迫式的推行只能做到知识……