Iterative taxonomy building yields more reliable classification

Creating a taxonomy, ontology, or other classification scheme for manual classification is a different process than creating one for auto-classification. A taxonomy or ontology that people use for reference to manually apply a label to content can be structured in almost any way that makes sense to the people using it.

TCA manual taxo.PNG

In Fig. 1, we see a good example of a manual tagging system that a Bacon Tree Consulting client used to tag articles before they were distributed across a news wire. The first item in the list is 50 Plus, also referred to as Baby Boomers. Editors selected articles to tag with the 50 Plus label and were able to choose from a broad range of topics that appeal to Baby Boomers, including travel, fitness, fashion and health insurance. That diversity is a good example of how manual tagging can force content into different categories.

However, the content itself drives auto-classification. The language in a piece of content, how the text is formatted, who its author is, and when it was created can all be used in auto-classification.

The client who used a taxonomy to manually classify Baby Boomer articles also had an automated system that processed about 90 percent of client's content. That required mapping human concepts to words, phrases and terms that computers recognize rather than having a person make a decision about the classification.

classification snippet.png

Fig 2. Illustrates how auto-classification software searches a news article for words and phrases that match terms in a taxonomy or ontology. It uses a combination of seemingly unrelated phrases to identify and label the article with a college football team name.

The manual- and auto-classification taxonomies can identify amorphous topics, but they require different approaches. Subject matter experts build manual taxonomies, and it is worth noting that they also are the only ones who can successfully apply them because one has to understand the subject to link content and taxonomies.

However, a taxonomy for auto-classification maps the knowledge of subject matters experts in a way that can be applied by machines consistently and over large quantities of content. A taxonomy for auto-classification also benefits most from an iterative approach.

This is due, in part, to what auto-classification software looks for when working – the more conceptual the term – like in our Baby Boomers example – the more information the software needs to identify it. Another example may help further the point: if software is trying to identify wars, it simply may be able to look for the names of wars. However, if it needs to identify causes of war, it will need to look for terminology around many different subjects and identify them all as contributing to war.

People can manually tag poverty, political unrest, border invasions, declining GDP, religious differences, and other concepts as causes of war, but it becomes rather more difficult for a computer to make those decisions unless the words are actually in the text. To accomplish this requires a cycle that includes mining content for word use and language patterns, adding those terms to a taxonomy, classifying content and testing that classification.

An iterative approach to term development

The iterative approach to building a taxonomy or ontology for auto-classification delivers timely, regular and reliable business value through short cycles that emphasize collaboration and feedback. Those cycles can be any length, but for ontology building, one or two-week cycles give a team time to accomplish larger tasks, while allowing for adjustments as needed. The following processes make up the iterative term build cycle. They build on one another; each step contributing to the success of the others as well as to the entire project.

term life cycle development.PNG

Scope – Decide on the term or term set your team will be working on, investigate the resources available for mining of terms and for subject expertise, and set a time range for developing those terms.

Design – Plan how terms will fit into your existing taxonomy, determine if those terms relate to other concepts already in your taxonomy, decide where new terms will be added in your structure.

Collaboration – Work with teammates to deliver of the highest priority tasks or goals. It is important to note that collaboration helps deal with shifting goals, because when people work together, they're more likely to understand what a shift entails and the value it needs to bring. It also helps team members working on related terms understand how those terms work together to auto-classify content.

Feedback – Hand-in-hand with collaboration is feedback. In taxonomy building, feedback should exist

  • among team members,

  • between team and other stakeholders, which helps adjust to shifting business needs fairly easily, and

  • from your content. This helps create a model that truly reflects the content you have instead of the content you think you have.

Delivery – you can deliver topic sets to stakeholders along the way.

Begin your cycle by adding a broad taxonomy of terms and synonyms; add only one or two levels. This will prevent you from developing the taxonomy you think should work and help you create a taxonomy that will work with very little adjustment.

Scope your term development by auto-classifying content using your broad terms, find content that classified as a particular term, and look at a subset of that content to identify trends. Decide how the concepts in those trends fit into your taxonomy and add the broadest available terms to it. For example, add "education" as a broad term. Classify content, pick out all content that classified as education, and identify themes, such as "early childhood education" and "adult education." Add this second level of terms to your taxonomy. Assign further development of similar terms to different team members who will collaborate on developing them.

Team members will test how those terms performed in auto-classification against your content. Each broader concept will catch similar content that your team then looks at for more specific themes. Those themes became the next, more narrow set of concepts. Deliver each broader set of concepts to your business holders and users as they are complete. Repeat the process for each tier until you develop the specificity you need.

To sum up: add, classify, test, repeat

Though contrary to the common thought that one needs to create a whole, fully-formed, and robust model before classification, an iterative process will allow you to develop a solid taxonomy or ontology that delivers value early and frequently. Building a taxonomy is a process of discovery – you will generally not know the destination of your effort when you begin the process.

Previous
Previous

Text analytics, taxonomy and auto-classification terms

Next
Next

The human benefit of natural language processing