Techniques and language in text analytics

This was first published in Business Information Review, 2014, Vol. 3 (i) 50-53

The traditional definition of text analytics is the process of deriving high quality information from text by turning text into data through the use of machines and software.

At its core, text analytics is breaking a stream of text into meaningful words or phrases, but meaningful is a relative term – how does one decide or discover just what information is important or meaningful? Some say that one way is to use text mining, which counts and groups words in various ways and looks at the pattern of word use within documents. This can tell you something about the document, but none of that has anything to do with the meaning of those words, and text analytics deals with the meaning of words.

The use of text analytics is particularly difficult when language is so fluid – words may have many different meanings, and there are many ways to describe the same thing. Consumer sites deal with this reality every day when handling customer feedback that comes from surveys and websites. Sometimes, customers don't know how to articulate their views, and they often use different language to describe the same issue. This word use becomes a problem when trying to track issues across channels. Some organizations address this by using a combination of computer help and human investigation – the software gives clues as to where to look, but people still need to do searches to find the information they need.

Oftentimes, connections in text become meaningful only when a person asks the right question and then looks to text plus data to answer it.

Just how does it do that?

The text analytics industry seems to have stalled at this computer-plus-human equation. There seems to be a trade off between software that needs to be easier to use but that can't ever get any better and software that takes a lot of startup and development time that can achieve a higher accuracy.

Generally speaking, the software that is easiest to operate uses natural language processing, which, as its name would indicate, tries to understand and interpret human language for computers. Natural language processing is based on machine learning, in which the system works to achieve an end through an adaptive process. It looks at how well something is defined using a pre-set measure, then it takes the data from a trusted, verified source and uses it to "train" itself to work better. Machine learning can be responsive to exactly the parameters that designers intended, but it is responsive only to that which the designers anticipated. In other words, it works until something unanticipated changes, and then the machine needs new input to start learning again. When it fails, it is not always easy to discover why.


Technology behind the processing

Lemmatization is similar to word stemming but uses a more sophisticated and rigorous procedure that will find word stems with greater accuracy.

Name entity extraction looks for and labels things such as people, places, organizations, dates, forms of money, and companies, in text.

Part-of-speech tagging is assigning to and labeling each word with a part of speech, such as noun, verb, adjective and adverb, using software.

Relationship extraction identifies how things are related. Two examples would be "Bakers make cakes" and "John lives in London."

Word sense disambiguation is the ability to determine the meaning of a word with multiple definitions, often from context; an example is the word well.

Word stemming produces the semantic root of a word by applying heuristics, for example flying becomes fly. A word stemmer will look for both forms in text.


On the other side are rule-based systems, which tend to have a higher degree of accuracy under certain conditions and require more work by users, who have to write rules and often develop a taxonomy or ontology. They tend to be simple to use. In rule-base systems, people define a world, or domain, and once defined, the system is able to identify concepts in text that are meaningful to the domain. It is often the case that rule-base systems are used to auto-classify content, and the output of that classification is metadata describing the core concepts in the text that can be stored for later use. Broad domains can be difficult to define, and many academics say that to get true semantic knowledge using rule-bases requires limiting the domain.

Rule-base systems also have a problem accounting for concepts that the user has not anticipated, but it is easier to understand why the rule-base failed, and it is easier to adjust the rule-bases than to retrain documents in natural language processing.

What do you call it?

Systems based on both natural language processing and rule-bases can be used to pull information out of documents of all types – e-mails, texts, Tweets, blog posts – but, increasingly, what you do with that information is defining what you call the process of extracting it.

Some make a distinction between text analytics and text analysis, which is the difference between finding the pertinent information in a piece of text versus using the text to give you answers about other things.

Predictive analytics uses statistics and other mathematical techniques to look at past data to predict future events. Generally, analysts look at many known data sets when trying to predict future events. So how does predictive analytics tie in with text analytics? The modern information-rich world is awash with text. The role of text analytics is to make sense of this text, and at its heart, produces quantitative data from text. In this way, text analytics acts to produce the data points that are used as inputs for a predictive analytics model.

One of the many inputs into predictive analytics processes are metrics of customer satisfaction derived from text. The development and assessment of these metrics is sentiment analysis, which uses text analytics to determine mood expressed in text such as social media posts, movie reviews, survey responses, customer e-mails, and so on. Businesses use sentiment analysis for an array of reasons, including understanding what drives consumer behavior, tracking how clients respond to new products, creating marketing schemes, and driving agile product development.

Unfortunately for businesses, sentiment is difficult to determine. Irony and sarcasm can be hard to detect in the spoken language, and machines have an even more difficult time of it. IBM and other software vendors say that they are making strides in this area. However, machines still have a difficult time with neutral language. Some language with no sentiment connotations can still be seen as positive by people – something machines would miss. For example, noting that you had just seen a movie could be seen as positive by people but neutral by sentiment analysis software. Advances are being made in sentiment analysis, however, and much of it revolves around looking at the placement of words in a statement instead of simply counting negative and positive words.

In addition to sentiment analysis, organizations are using text analytics to enhance search and to turn text into data so that it can be used in a variety of applications. Finding the critical information in unstructured text and tagging it for later use and recall allows organizations to use unstructured text in a variety of applications.


Analytics terms

Data analytics: Looking at raw numerical data to find useful information from the data. The process often includes cleaning and normalizing the data to get it into a digestible format.

Text mining: Text mining and text analytics are sometimes used interchangeably. Experts say there is an important distinction. Text mining looks at text as words and extracts the numbers of words within documents and the number of kinds of words within kinds of documents.

Text analytics: Extracts quality information from unstructured data to distill meaning from the text. It goes beyond counting words to extract meaning and give some context to that meaning. It often can give you the why of something that has occurred, where data analytics can tell you what has occurred.

Text analysis: Seth Grimes makes a distinction between text analytics and text analysis, saying, "So we have text analytics on the one hand – text as data, fueling quantitative methods communicate business-required insights – and text analysis on the other, techniques that characterize and describe a text itself."

Sentiment analysis: Text analytics used to determine the mood of text as positive, negative or neutral.

Predictive analytics: a branch of data analytics concerned with predicting future conditions from current or past conditions.

Content analytics: This moves us beyond looking at only text to look at video, audio and images.


The real world need for text analytics

Organizations are finding many ways to use text analytics – and discovering the "important" information in the text is often the goal. UNC Health Care is using natural language processing from IBM to sift through unstructured data, such as doctor’s notes, registration forms, discharge summaries, and phone call notes, pairing it with structured data and using it to target high-risk patients and design prevention programs for them. The health potential is huge Accident Fund Insurance Company of America is using text analytics to search claim adjusters' notes to identify health issues that were aggravating workers who had suffered injuries on the job. The United States government is using text analytics to examine some of NASA's unstructured data to find potential problems with flights by looking at historical pilot reports and mechanics logs and to scan social media for evidence of terrorism and biological threats.

The future of text analytics

It is clear that organizations are using text analytics to help people find insights into and solutions for real world issues. Human analysts simply cannot process the amount of data that we produce today, and machine assistance helps us uncover information we did not know we have. But what of the future of text analytics?

Industry watchers say that we need more text analytics workers to create and program software, to understand how to turn text into data and to open doors for text analytics at businesses and organizations. As text analytics grows, some speculate, we'll see more vendors specializing in industry verticals, such as healthcare, insurance, oil and gas and manufacturing. Along with that, or perhaps preceding it, we're likely to see more products and less technology that has to be manipulated by users. We'll also see analytics spread to all additional types of content, including video, voice, and photos.

In the near term, it seems sure that organizations will continue to use text analytics and data to fuel discovery and growth. To make sense of data and text, it seems increasingly likely that we need to search across data types stored in different places and formats to find the information that is so valuable. This will let us act on it for marketing and business decisions.

In addition, we are seeing an increase in the need to analyze text across data silos and formats so that organizations can find content quickly and discover new connections with data.

It is easy to see what we want text analytics to become; all we have to do is look to the sci-fi genre to find computers that understand what we say and that can converse with us. The state of the art is not there, but as enterprises see the deep benefits of text analytics, it is possible that we'll get closer.

Previous
Previous

Supercharging biometric data collection

Next
Next

Taxonomies, ontologies, and autoclassification