A set of preferred or authorized terms. For example, there may be an authority list for sports team names to encourage the use of formal names instead of team nicknames.
Some people use the term controlled vocabulary as a generic term to refer to classification models. Others say it as a pre-defined set of terms that can be used to classify content, and content can be labeled only with terms in the controlled vocabulary.
This moves us beyond looking at only text to look at video, audio and images
Looking at raw numerical data to find useful information from the data. The process often includes cleaning and normalizing the data to get it into a digestible format.
Lemmatization is similar to word stemming but uses a more sophisticated and rigorous procedure that will find word stems with greater accuracy.
A model is a generic term for any group of terms used to define a domain – no matter how it is arranged. People often combine it with the word classification to form “classification model.”
Named entity extraction looks for and labels things such as people, places, organizations, dates, forms of money, and companies, in text.
At first glance an ontology is similar to a taxonomy because it has a hierarchy of terms. However, in an ontology, those terms are related to one another and the relationships have specific names and definitions.
Assigning to and labeling each word with a part of speech, such as noun, verb, adjective and adverb, using software.
A branch of data analytics concerned with predicting future conditions from current or past conditions.
Relationship extraction identifies how things are related. Two examples would be “Bakers make cakes” and “John lives in London.”
A structured framework or plan, so “classification schema” is a generic term that refers to any set of terms used to classify content.
Text analytics used to determine the mood of text as positive, negative or neutral.
A synonym ring adds a list of alternate ways of referring to a term. For example, the term the Affordable Healthcare Act will also have a listing for Obamacare, since it is known as both.
Taxonomy is a way of describing a particular area of knowledge in simple hierarchy. Most people were probably first exposed to taxonomies as school children, where they would have learned the Dewey Decimal System for organizing library books or would have learned how animals and plants are sorted in the scientific world: Kingdom, Phylum, Class, Order, Family, Genus, Species, where a lion is a feline (Family) which is a mammal (Class), which is an animal (Kingdom). In such a hierarchical classification, each level helps define the one above it, as well as the one below it.
Seth Grimes makes a distinction between text analytics and text analysis, saying, “So we have text analytics on the one hand — text as data, fueling quantitative methods communicate business-required insights — and text analysis on the other, techniques that characterize and describe a text itself.”
Extracts quality information from unstructured data to distill meaning from the text. It goes beyond counting words to extract meaning and give some context to that meaning. It often can give you the why of something that has occurred, where data analytics can tell you what has occurred.
Text mining and text analytics are sometimes used interchangeably. Some experts say there is an important distinction. Text mining looks at text as words and extracts the numbers of words within documents and the number of kinds of words within kinds of documents.
Word sense disambiguation is the ability to determine the meaning of a word with multiple definitions, often from context; an example is the word well.
Word stemming produces the semantic root of a word by applying heuristics, for example flying becomes fly. A word stemmer will look for both forms in text.