The Technology Underpinning Codeit Professional - Damien Gouriet, Head of AI

13 June 2018

The Technology Underpinning Codeit Professional - Damien Gouriet, Head of AI
Over a year ago, we set ourselves a design goal and mission for Codeit Professional. The coding tools available to market researchers, to handle open ended verbatim text, seemed to us woefully inadequate for the 21st century! Research also showed that whilst verbatim comments were known to contain genuine and actionable insight, the cost, slow speed and management meant that very little coding of these texts was actually undertaken.

Guiding Principles

We decided to build a radically new coding platform, from scratch, based on the latest technology breakthroughs in Artificial Intelligence, Text Analytics and Machine Learning.  In particular, we were excited by the promise of Artificial Intelligence being able to dramatically enhance the quality and speed of coding. We can see the impressive progress being made by these techniques in various areas, including Computer Vision and Natural Language Processing – especially when AI is combined with Machine Learning.

Applying these new technological techniques to a specific function (free text coding) within a specific business area (market research) of course threw up many challenges – many of which were not solvable through a simple Machine Learning “add on”. It also became clear that we needed to factor in the important role human coders play. Our initial work indicated – rather quickly – that currently available techniques were some distance away from automatically coding open-ended responses in the required volumes and with the required quality.

It was determined that the best way to tackle this problem would be by designing a complete Artificial Intelligence system, combining both the latest Machine Learning advances with sophisticated, more traditional NLP approach, such as Text Matching. Enabling the coder to interact with the system not only makes the coding faster but has proven to be more consistent using CodeIt. 

Three Technology Components

1. Text Matching

In principle, the easiest way to categorize open-ended responses in an automated fashion is to match new texts with known coded verbatims. Such a system is particularly helpful for Brand Coding: the goal is to categorize unknown brands mentioned by the user into a fixed known brand list. Using past examples, it is easy to quickly find the correct brand entered by the user. (Examples are Make & Model of a Car, Name of a Drug Prescribed, Make & Model of a Mobile Phone.)

However, the limitations of this technique became apparent very quickly! Firstly, to account for all the different variations of a brand (misspellings, typos...) required more training examples than would be feasible for most projects. Secondly, we identified a lack of consistency between coders, and over time in the training data. And finally, outside of Brand Coding, the number of matches is (too) often quite small, so any benefits of automated coding end up being quite limited.

2. Regular Expressions

The next component we utilise is a rule-based system based on regular expressions. A regular expression allows the user to define a text pattern of interest in a text. For example, the regular expression “regex|regular” will match if the text contains the words “regex” or “regular”. The vertical bar symbol “|” means “or” in the language of regex . The user can specify regular expressions directly in the Codeit interface. Any new verbatim matching that regular expression will then automatically be coded appropriately. Whilst it solves the problems of text matching as noted above, the setup and maintenance can be complex (and potentially costly), and the resulting regular expressions difficult to understand. However, it is certainly a more powerful technique than text matches alone, and improves coder productivity significantly.

3. Machine Learning

To overcome the limits of the first 2 layers, the CodeIt Artificial Intelligence uses a Machine Learning layer. Based on past examples, the algorithm is trained and uses that learning to automatically code new open-ended responses. The technology never requires the user to intervene to setup the algorithm, or to maintain it. It is a much more effective technique than the simple Text Match process, but it isn’t fully customizable by a user, in the way that the Regular Expression layer can be. Crucially, the Machine Learning doesn’t use simple text or word matching but rather calculates how two verbatim are semantically “close” to each other. Based on that calculation, the system can infer codes from past training examples. For example, if “the service was great” was categorized as Code1, a verbatim “awesome service” will be calculated as semantically close, and therefore automatically coded by the Machine Learning layer as Code1. The better the quality and quantity of past coded examples, the more accurately the system performs.

Completing the Picture

By combining all these layers in a single process flow, Codeit Professional provides great flexibility to massively increase coding throughput, with excellent consistency. Typically, in real life scenarios, we see a job that might have taken 10 hours by human coding be completed in only 4h using Codeit Professional – and crucially suffering no loss of quality against human coding. In ongoing and tracking studies, we see even larger performance gains.

In a sense we have the perfect virtuous circle here – as the system is given more and more examples at the different layers, those layers will become better and better at automatically coding future material. 

The Machine Learning layer is particularly important and takes the process well beyond the constraints of relatively straightforward Natural Language Processing (NLP). This is the layer where the system can perform semantic coding, which NLP simply cannot do. By semantic coding, we mean the idea of capturing the similarity meaning or ideas expressed in language, even though the language may not even contain the same words. For example, imagine a code “Low Cost”. It is clear that if someone actually says “Low Cost” then this is the appropriate code, but what if they say “It’s less expensive” for which the code “Low Cost” is still the right choice. It is the Machine Learning/Artificial Intelligence layer that performs this function – by learning from examples provided by human coders. 

Interestingly, the technology behind machine learning is improving at an astonishing pace and promises even greater performance gains in the medium to long term.

Back to Blog