Transportation organization improves search results through text analytics

We conducted a pilot project to show that we could make content more findable for a transportation client’s employees, contractors, and the public through the use of text analytics.

The client faced several challenges to effective search, the first of which was diversity of content and language – project documents vary greatly in word use, form, and format. That is to say, documents from different groups within the organization use different words to refer to the same thing; the documents look different and may contain a variety of fields across the same forms from different groups; and content has many different formats including Word, PDF as images and PDF as text.

The client encourages users to store content in SharePoint, but many struggle to find documents they know they have added to their libraries. As a result, many users also keep content in mailboxes, on shared drives and in paper files. This wreaks havoc on document management policies and has long-term effects on the client’s ability to conduct business. By increasing findability of content, we hoped to increase confidence in SharePoint’s ability to store and return content and, consequently, decrease dependencies on other content storage practices. A longer term goal is to provide a resource for searching across content platforms that will help users conduct one search and find all content pertaining to that search.

To achieve these ends, we needed to increase SharePoint’s native abilities, provide a common language for client content, and overcome some of the format challenges.

Put taxonomies and synonyms to work with structured data

We began by creating a common language for the client using the industry-standard best practice of a hierarchical taxonomy with synonyms for terms. For example, the client had three unique identifiers for each project. We picked one of those identifiers as primary and added the other two identifiers as synonyms. Now, when users look for one of these identifiers, they will find all documents related to a project, even if the only one of the identifiers is in the document.

But it quickly became clear that users sometimes need to know information that is not present in a document. To address this, we decided to provide information to the user that was not in the document when it was created and stored by adding known facts to documents. For example, when a user searches for a project, they also see who worked on it, the dollar amount awarded for the project, the type of project and where it was – even if none of that information is in the document. This expanded the classification model to a true ontology, giving users a 360-degree view of each project and allows them to find connected information easily.

In addition, we conducted human analysis of content to discover trends in concepts around work issues. The client wanted to see if we could identify problems in certain types of reports as an early warning system of problems that extend contracts and cost time and money. We discovered those concepts and mapped them in the ontology.

Human parsing of content is key

We began by assessing both the client’s information management needs and how text analytics could answer those.

To do so we conducted in-person interviews regular producers and users of content and sat down with IT to make sure we understood how content is handled in the client’s environment and how our work might help find content across content stores that use a variety of software technologies.

We mined 10,000 documents stored in SharePoint for terms commonly used by the client’s groups and divided those terms into categories such as equipment, materials, contractors, and suppliers. We also mined databases for information so that we could supplement discovered terms with known facts.

We then combed thousands of documents for how each term was related to others and to discover synonyms for terms. Finally, we performed manual text analytics to discover related concepts in documents and mapped those concepts to one another.

We loaded all of this information into taxonomy/metadata management software, mapping synonyms, adding known information to project information, and creating relationships among terms as we went along. We named those relationships so that users can see at a glance how terms and concepts work together – allowing them to search more quickly and driving discovery of unknown information.

We use all of this information to analyze each document and add metadata to SharePoint, enhancing its term store and metadata management capabilities.

Generated metadata create facets for search

We use metadata to increase the accuracy of search results beyond a simple keyword search. In addition, users can narrow search results through facets, which help users find exactly the information they need.

Users can begin search in two ways: with a free text search familiar to all of us who search the Internet and with an ontology-driven search. Users simply begin typing and the system will automatically suggest any terms that match in the ontology. The second option allows users to search for any document that is about a term in the ontology instead of one that only mentions a term. For example, a document might mention a contractor’s physical address, but the project the contractor is working on is in in another location. A simple text search would return the first, but an ontology-driven search would return only the second.

Users may also narrow search results by facets, which we have defined in the ontology. Many retail sites use this function, so it may be familiar to users of Amazon.com, Netflix and so on. These sites let you search for an item, then narrow the results. So you may search for horror movies, then narrow your results by star ratings.

The same principle applies here. Users can start by searching for all projects within a group, then narrow results by projects that involve a particular road or use a specific material. The more facets a user selects, the more narrow the results.

Text analytics improves search results

By the end of our project, we were able to demonstrate distinct improvements in findability of documents. The combination of structured (i.e., linked through the project identifier) and unstructured information allows the user to conduct rule-based searches that would not be possible in most search environments by using related information from outside of the documents.

In addition, the use of synonyms in the rule-based search provides greater precision and recall in a number of searches. For example, a rule-based search is able to find documents with any project identifiers for a search containing only one project identifier.

The previous search was adequate for searches of terms that do not have many synonyms or different contexts. However, terms that apply to multiple contexts are better suited to a rule-based search. For example, a search for a specific road may return results using the number in a different context; a vanilla search for projects in a specific district may find contractor addresses in a city with the same name. The rule-based search is able to provide some structure to these searches to improve result precision.

The rule-based search interface can prevent misspellings by suggesting search terms from the ontology, resulting in greater search recall and precision. Similarly, the rule-based search interface can suggest terms to further filter within facets, allowing a user to add specificity to a search through name recognition.