Text analytics, taxonomy and auto-classification terms

Nov 7

A running list of vocabulary used in the text analytics field.

Authority list

A set of preferred or authorized terms. For example, there may be an authority list for sports team names to encourage the use of formal names instead of team nicknames.

Controlled vocabulary

Some people use the controlled vocabulary as a generic term to refer to classification models. Others say it as a pre-defined set of terms that can be used to classify content, and content can be labeled only with terms in the controlled vocabulary.

Data analytics

Looking at raw numerical data to find useful information from the data. The process often includes cleaning and normalizing the data to get it into a digestible format.

Lemmatization

Lemmatization is similar to word stemming but uses a more sophisticated and rigorous procedure that will find word stems with greater accuracy.

Model

A model is a generic term for any group of terms used to define a domain – no matter how it is arranged. People often combine it with the word classification to form classification model. Physical models help you visualize the end product. Think of a model house in a subdivision or an airplane model. The term has its basis in math, where it is used to describe the general characteristics of a process, device or concept. Computer science is based on math, so the term has migrated to many technology fields.

Named entity extraction

Named entity extraction looks for and labels things such as people, places, organizations, dates, forms of money, and companies, in text.

Ontology

The term has its origins in philosophy, where it defines the nature of being. I believe that still holds true when talking about computer science. Ontologies define the nature of a thing in relation to other things in the same area. They have terms and relationships among terms.

Part-of-speech tagging

Assigning to and labeling each word with a part of speech, such as noun, verb, adjective and adverb, using software.

Predictive analytics

A branch of data analytics concerned with predicting future conditions from current or past conditions.

Relationship extraction

Relationship extraction identifies how things are related. Two examples would be “Bakers make cakes” and “John lives in London.”

Schema

A structured framework or plan, so “classification schema” is a generic term that refers to any set of terms used to classify content.

Sentiment analysis

Text analytics used to determine the mood of text as positive, negative or neutral.

Synonym ring

A synonym ring adds a list of alternate ways of referring to a term. For example, the term the Affordable Healthcare Act will also have a listing for Obamacare, since it is known as both.

Taxonomy

Taxonomy is a way of describing a particular area of knowledge in simple hierarchy. Most people were probably first exposed to taxonomies as school children, where they would have learned the Dewey Decimal System for organizing library books or would have learned how animals and plants are sorted in the scientific world: Kingdom, Phylum, Class, Order, Family, Genus, Species, where a lion is a feline (Family) which is a mammal (Class), which is an animal (Kingdom). In such a hierarchical classification, each level helps define the one above it, as well as the one below it.

Text analytics

Extracts quality information from unstructured data to distill meaning from the text. It goes beyond counting words to extract meaning and give some context to that meaning. It often can give you the why of something that has occurred, where data analytics can tell you what has occurred.

Text mining

Text mining and text analytics are sometimes used interchangeably. Some experts say there is an important distinction. Text mining looks at text as words and extracts the numbers of words within documents and the number of kinds of words within kinds of documents.

Word sense disambiguation

Word sense disambiguation is the ability to determine the meaning of a word with multiple definitions, often from context; an example is the word well.

Word stemming

Word stemming produces the semantic root of a word by applying heuristics, for example flying becomes fly. A word stemmer will look for both forms in text.

Evelyn Kent

Text analytics, taxonomy and auto-classification terms

Rulebases and machine learning

Iterative taxonomy building yields more reliable classification