By: Greg Arbour
November 23rd, 2018
Introduction
A large amount of data available to CHART data scientists comes in the form of raw text. Common examples include: admission notes, post-operative notes, nursing notes, and discharge notes. These notes offer a lot of information that isn’t captured in structured Electronic Health Records (EHR) data. Making use of these notes involves plenty of preproccessing. For example, these notes are commonly typed into a text box and will frequently contain spelling mistakes. To take full advantage of the information in the notes correcting misspelled words is an important preprocessing step. The following blog will demonstrate how to correct misspelled words using the R programming language. We will explore the popular hunspell package, and show how to add custom dictionaries to deal with medical data.
The code for the blog post can be found in the following github repository.
The Hunspell Package
The hunspell package is an extension of R to work with Hunspell, the spell checker used by Libre office and Rstudio among others.
Spell checkers work by:
- Identifying misspelled words (i.e. words that aren’t contained in the dictionary being used)
- Suggesting replacement words
The code below shows the basic usage of the hunspell package
A problem arises when trying to correct misspelled words that aren’t contained in the dictionary. For instance, medical terms often do not appear in the default hunspell dictionary.
Medications like “Olanzapine” and acronyms like “COPD” are flagged as incorrect. As such, it is impossible to identify and replace misspellings of these words; e.g. “Olazapine”, “Olanzapinee”.
The solution is to install a medical dictionary. This doesn’t overwrite the current dictionary. It merely adds to it. There are a number of medical dictionaries available, though the version i have found that works easily with hunspell in R is:
https://github.com/glutanimate/hunspell-en-med-glut-workaround
Clone the github repository or download and extract the folder. Then set the DICPATH
argument using Sys.setenv() to the location that you have set the dictionary. I have placed the dictionary files in the github repository for this blog.
You can verify that the folder was successfully added by listing all the folders that hunspell will search.
Use the
dictionary()
function to install dictionaries found in the DICPATH. Here we install the dictionary titled ‘en_US’ (from the downloaded github repo).
Now we’ll rerun the spell check on our words vector.
It now recognizes the medical terms and is able to suggest them as corrections to misspelled words.
Adding custom words to a dictionary
In cases where there are still terms missing from the dictionary, you can manually add them to the en_US.dic file. Ipen up the file in Word (for Windows, ensuring to maintain Unicode(UTF-8) endcoding) or in text editor (for Macs). Copy/paste whichever words you wish to add into the dictionary. Use one line per word. You do not need to make any edits to the rest of the document or to the accompanying .aff file. I have saved the custom dictionary to a new file (“/custom_dictionary/en_US_custom.dic”)
Load the dictionary as before with the dictionary()
function. We can then pass this new dictionary as an argument to the hunspell_check()
and hunspell_suggest()
functions
Conclusions
We’ve seen how the hunspell can be used to check spelling and suggest correct spellings of text data in R. In addition we showed how you can extend husnpell by adding a dictionary to correct the spelling of medical words and how to add your own words to a dictionary.
Checking the quality of raw text data is an instrumental part of data preprocessing. If you are using some kind of frequency cut-off or tf idf to identify words to use in a bag of words model or in topic modeling, you may miss important terms that are misspelled. By using the methods above you can identify misspelled words and potentially correct misspellings. It is important to remember not to blindly accept suggested words as then can be incorrect and have unintended consequences. Preprocessing raw text data is a time consuming task, but one that will reward you with a rich set of data that is not often found in EHR records.