Open-source NLP tools for information extraction from free-form text

By: Shaun Mathew, Michael Guerzhoy, & Ujash Joshi

A substantial portion of the information available about patients in Electronic Health Records (EHR) is stored as free-form dictated notes. Even the diagnosis will often only be found in the patient’s progress notes. Natural Language Processing (NLP) can be used to extract patient information such as diagnoses, smoking status, or prescribed medication.

At the Li Ka Shing Centre for Healthcare Analytics, Research and Training (LKS-CHART) we are developing our own NLP tool in order to streamline the process of information extraction from clinical notes. In this post, we review two currently available open-source tools for information extraction from free-form medical text notes – Canary and NLPReViz.


Canary is an open-source natural language processing tool developed at Brigham and Women’s Hospital, a Harvard Medical School teaching affiliate. Canary is a Graphical User Interface (GUI) for developing and running rule-sets for information extraction. If a fragment of the text being processed matches one of the rules, it is extracted, and can be processed further. Rules are developed by classifying words within documents using customizable ontologies and matching phrases using regular expressions.

The Canary Pipeline information extraction pipeline consists of the following steps:

  1. Data preprocessing
  2. Word class definition
  3. Phrase structure definition
  4. Flag creation
  5. Output criteria creation
  6. Post-processing

1. Data Preprocessing

Canary takes in plain text files as input. The format of the files is very simple: documents need to be separated by a pre-defined separator, and numbered using a unique id. It is straightforward to convert your input data into Canary’s format.

Canary provides a convenient interface for preprocessing input texts. Texts can be normalized by, for example, replacing every instance of “can’t” with “cannot.” This simplifies rule creation later on. Conveniently, a list of pre-made text replacement rules comes with Canary, which of course can be extended.

2. Word class creation

Word classes are core to Canary’s functionality. Word classes are defined using lists of words or regular expressions. When performing text extraction, Canary will try to resolve each word in a sentence to a word class.

Word classes such as DIAGNOSIS will be task specific. (For example, if we are interested in knowing whether the patient has Deep Vein Thrombosis (DVT), we’ll try to match on “Deep Vein Thrombosis” and “DVT”). Canary comes with some pre-made word classes that can be re-used for any task (the NEGATION class is an example).


Canary screenshot 2.png

3. Phrase structure creation

After defining word classes, we can define phrase structures. Phrase structures are defined recursively using other phrase structures, or using word classes. For example, if we can have a level-1 phrase structure DIAGNOSIS which simply matches a word class, we can then define a higher-level phrase, “NEGATEDDIAGNOSIS,” by matching on the pattern NEGATION DIAGNOSIS

4. Flag creation

A detected phrase can be “flagged” in order to extract information from the detected phrase (for example, if the phrase contains a blood pressure measurement), or in order to classify the document.

5. Output generation

The user can specify which detected phrases the tool outputs.

6. Output generation

The output that Canary generates generally needs to be post-processed further using a custom script.

Canary Pros & Cons


  • Canary’s system of recursively-defined phrases, combined with regular expressions is powerful enough to cover a lot of use-cases
  • Canary comes with useful pre-made definitions of word classes and phrase classes


  • Canary is not really point-and-click: you will need to preprocess your data to conform to Canary’s input format, you will need to post-process the output to get what you need, and there is a learning curve as well
  • There will be some snags when defining your rule-sets: if multiple phrase classes can match to the same sequence of words, only the first phrase class will be matched. This is particularly problematic when we’d like to potentially match both negated and non-negated phrases. Matching phrases that don’t have a pre-defined length is also tricky.
  • Errors when creating the project pipeline are not easily visible. Canary also does not prevent the user from executing a project with errors, which causes the program to crash.

Overall: Clinicians and researchers with some computer programming savvy will find Canary useful for a wide range of problems, though some coding will need to be done before and after using the tool, and the tool won’t solve every one of your NLP needs.


NLPReViz is a GUI tool aimed at clinicians and clinical researchers. The tool is aimed at building Support Vector Machine (SVM)-based document classifiers. The user can interactively label documents, review the current system output and retrain the SVM model. The features the SVM model uses (word presence/absence) can be defined interactively using the GUI.

NLPReViz tries to address the problem that plagues anyone who has tried to use NLP for medical texts: the training set always seems too small. The user can increase the size of the training set as they go.

For a clinician who is able to abstract medical documents on the fly, the concept of NLPReViz is very appealing.

NLPReViz Pros & Cons


  • Interactive labeling is a very appealing concept (although clinical expertise is necessary to take full advantage of it)
  • There is no need to manually create rule-sets: the systems trains an SVM instead


  • NLPReViz is not really a mature product: installing it can be quite tricky, and using it on new datasets requires tweaking the Python code
  • SVM won’t be able to solve every one of your problems: heuristic rules will always be necessary for training sets that are not very large.

Overall: the concept behind NLPReViz is exciting, but the software is not quite mature enough to be easily adopted.


View more articles & publications…

Comments are closed.

Create a website or blog at

Up ↑

%d bloggers like this: