Taxonomies, ontologies, and autoclassification
Historically, people manually have applied classification to documents and books to help users organize and consistently retrieve information. Libraries use the Dewey Decimal System to shelve and retrieve books, but as everyone who has ever had to find a book on a shelf knows, having a system does not make it easy to use. Many a time I’ve stood in an aisle looking for something like 970.81 and getting frustrated with the process. (Anyone else out there as old fashioned as I am and still use a library?) Consistency matters, and humans are not consistent – even among themselves.
But we look far beyond libraries for information these days, and very valuable knowledge and insights exist in an organization’s information such as e-mails, memos, reports, field notes, blog posts, and across social media. People cannot categorize all of this information by hand, and even if they could, we already know that people classify documents inconsistently. Anyone who has had to look through folder trees on a shared drive for a specific document has felt the truth of those words. Consequently, we are moving from manual classification to computer-based auto-classification systems. Such systems help organize and retrieve content as well understand what is in the content, much like an index at the back of a book.
Apply a classification model
So the question becomes how to apply a classification model to a large document set that you’re likely to find inside an organization. There are other approaches, such as text mining and entity extraction that pull facts from an article and that complement a classification model, but I’ll focus on the model in this post.
Classification models allow organizations to define what is meaningful to their particular line of business, thereby providing context to terms in the model. For example, a news company might define a “drug maker” as a criminal, whereas a pharmaceutical company would likely define that entirely differently.
Classification models also often use synonyms, which broadens the system’s ability to identify entities and concepts – and takes away the need for everyone to think of something in the same way. For example, a taxonomy of health issues might direct its classifier to find content about H1N1 when a searcher types “swine flu.” In this way, the computer understands a link that the user might not think to look for or might not understand even exists.
Once a document is classified, it can be labeled with metadata, also known as tags, describing the core concepts found in the text. This can then be stored for later recall by search engines and by workflow applications to automate enterprise processes. In addition, SharePoint users can add the classification results to their farms to bolster SharePoint’s native abilities and add deeper taxonomy relationships. This keeps information better organized and makes it more findable for the user. Most importantly, it also removes the burden of applying terms by the user, making metadata more consistent.
Systems that use models for auto-classification help organize and examine information for use in a variety of ways. They look through volumes of content in a fraction of the time that humans would, and they apply more consistent metadata to it. They drive discovery of links between documents and reveal patterns within text, and they save knowledge workers time in searching for and recreating content.
Go here to see the differences between systems that use machine learning and those that use the type of rulebase system I talk about below.