Natural language processing in Clinical Text Analysis and Health Care Knowledge Extraction System

“Data”, “Information”, “Knowledge” and “Wisdom” are the keywords in today’s data-driven society. Well, I guess all of you at least know about the literal meaning of these terms and most of you are actually aware of the concept behind these terms and some of you must be aware of the context of the terms. It’s a hierarchical pyramid in which wisdom, which is at the top of the pyramid, can be attained through raw data, which is at the bottom of the pyramid. This pyramid is generally referred to as the DIKW pyramid in information systems. Any data analysis system heavily relies on raw data to derive information out of it. NLP systems are one step ahead of the data analysis system as it not only derives information from data but also extracts knowledge based on derived information. with the proper and precise knowledge, wisdom may be achieved.

Here is a question, is it possible for any computer program to extract meaningful knowledge from an unstructured medical text written by a doctor using his/her ball pen on a piece of paper? Well, not all computer programs can do this, but special programs based on machine learning and NLP algorithms can easily perform such task with a great level of accuracy and efficiency.

Today natural language processing (NLP) and information extraction system is widely in use in Biomedicine. Tons of literature in PMC (PubMed Central) database of NCBI (National Centre for Biotechnology Information) are there on the role/use of NLP’s in biomedicine and health care. I was keen to know the rate of growth of literature of NLP in biomedicine and therefore I searched PMC database that has the largest collection of biomedical literature and below is the trend of last 20 years (excluding the present year). The trend shows a tremendous increase in literature after 2006, and from the last 2 years, more than 6000 literature are getting published every year in the clinical field only.

Screenshot 2020 04 28 at 11.05.21 PM

What is NLP?

Natural languages are the languages in which humans communicate like English, Hindi, Tamil, Punjabi, Chinese, French, etc. Different human languages are prevalent in a different geography. On the other hand, artificial languages are the languages in which computers and computer programs communicate like, Java, Python, Scheme, etc.  Artificial languages have a set of specific syntax and grammar and specific compilers are built to process these artificial languages. The technique that enables computers to understand natural language is NLP.

In another word, Natural Language Processing (NLP) is an automated computational technique to interpret human language by machines. It takes unstructured text (human language) as their input and produces a structured and meaningful representation of those texts as an output.

It is best defined, in a more technical way, as a range of computational techniques for analysing and representing naturally occurring text at one or more levels of linguistic analysis to achieve human-like language processing for a range of applications. It is a beautiful amalgamation of computer science, statistics, artificial intelligence and computational linguistics. These computational techniques can be used to analyse naturally occurring texts of any language. The texts can be oral or written. In addition, the text data to be analysed should ideally be gathered from actual usage, and should not be modified or altered for NLP for the system to achieve better accuracy.

The problem associated with NLP

By the above the description of NLP it looks like NLP is an easy task to achieve however its other way round. Interpreting human language by a machine involves overcoming several problems and two problems, in particular, make the processing of natural languages difficult and cause different techniques to be used than those associated with the construction of compilers for processing artificial languages. These problems are (i) the level of ambiguity that exists in natural languages and (ii) the complexity of semantic information contained in even simple sentences.

Why use NLP in Healthcare?

Medical and healthcare domain creates, manages and uses a huge variety of unstructured and semi-structured text documents including clinical notes, referral letters, pathology reports, electronic health record (EHR), etc. in day to day operations. Automated access to these contents (unstructured and semi-structured texts) is the need of the hour to improve standards of care, to provide automated semantic solutions or to evaluate treatment outcome. The knowledge content of these unstructured and semi-structured textual documents of healthcare and the clinical domain is immense and is the basis to meet the above goals. NLP in healthcare systems enables automated extraction of structured information from unstructured clinical notes, facilitates comparison of information contained in the texts, semantic annotation of the unstructured text, enables advance searching and summarization of the unstructured text into a meaningful document.

NLP is still very young especially for healthcare domain however in such a small time it has made its presence felt; with time its algorithms are getting mature to deliver reliable NLP based systems. It can now easily disambiguate between word senses, performs event discovery with ease like detection of sign/ symptoms/ test/ procedure/ disease/ diagnosis/ mediation/etc. It can even detect negative and uncertain words or sentences with great precision, it can perform temporal inking (time expression discovery) of the concepts in unstructured text with ease, it can resolve coreferences, and it can identify an associated location in the body  (like the location of any pain or location of any abnormality like a kidney stone, etc).

Role of Natural Language Processing (NLP) in Healthcare

NLP has several potential applications in the healthcare domain. As we now know that NLP can ameliorate the precision and inclusiveness of Electronic Health Records (EHR) by translating the unstructured free text into structured and standardized information.

Medical Practitioners make heavy use of abbreviation and acronyms while writing clinical notes, which makes them highly ambiguous. NLP systems help in identifying ambiguous data and enhance their usefulness. Though this has been a major challenge for NLP systems and is still prone to errors in disambiguating word sense, however solutions are there to improve its precision. Acronyms can be taken care off by using a dictionary lookup file, but again its accuracy is directly proportional to the exhaustivity and accuracy of lookup file.

In the recent past, the healthcare domain has seen an explosion of data and information and there is a need to find the best ways to extract relevant information to satisfy the information needs. NLP helps in processing such data using data mining and machine learning algorithms to help extract relevant information and in knowledge discovery.

Use-cases of NLP in the Healthcare Domain

NLP has got several potential use-cases by identifying and extracting key details from unstructured or semi-structured large texts. Some of its use-cases as described by open health natural language processing consortium (OHNLP) are listed below:

  • Patient cohort identification– NLP system uses data extraction and machine learning algorithms to identify patient cohort on the basis of defined rules and set of inclusion/exclusion criteria by querying unstructured and semi-structured clinical notes and other clinic texts. This may help in better understanding of patient’s condition and also can be used for feasibility study and potential participant recruitment.
  • Clinical decision support– Developing automated systems to assist decision-making in clinical settings utilize clinical narratives or even clinical guidelines. One of the many use-cases is in automated follow-ups and care for breast cancer patients. The manual process of providing care for such patients is to be replaced with automated decision support to assist medical practitioners. Automated decisions are provided on the basis of current condition (current test results) of the patient and matching its condition with the guidelines, patient history and family medical history. Automated systems, mostly semantic systems, will then fires rule engine to come up with a plausible explanation of the condition and will also propose drugs and other remedies. Trained human experts currently do this process, which is error-prone, expensive and time-consuming. NLP can be applied to extract the necessary information for decision support from laboratory test reports and to build patient profiles that become the basis for determining automated follow-up recommendation. Since a certain portion of important information associated with follow up recommendations requires semantic understanding to be precisely extracted, the use of semantic parsing in clinical reports will boost the quality of patient profiles and therefore facilitate better breast cancer examination, prevention and cure.
  • Health care quality research– Text data from physician’s observations could be used in health care quality research activity to automatically determine the quality of life utilizing NLP based systems in compliance with the national health guidelines.
  • Personalized medicine– The patient’s medication histories and their responses are of great concern to future medical treatment. In particular, the detection of medication side effects is an important issue for patient safety and pharmacovigilance. The tracking of patient’s medication intake along with the accompanying side effects plays critical roles to advance personalized medicine and identify genetic marker of undesired medication effects. A substantial amount of medication-related information resides in unstructured clinical narratives and also a fair amount of it requires better contextual understanding to be correctly retrieved.
  • Bio-Surveillance– NLP has been used for bio-surveillance for detecting emerging infectious diseases and acts of bioterrorism by collecting and analysing clinical data from health care organizations. These data include the chief complaint fields from outpatient encounters and emergency department visits. Keyword searches look for the occurrence of such word forms as “sore throat” but may miss such related notions as “pain upon swallowing” or “throat feels raw.” NLP techniques such as distributed semantics that can automatically detect words and phrases that are semantically related to a list of provided keywords would enable more accurate identification of cases.
  • Drug development– NLP can also accelerate the development of drugs. One application is to mine drugs that can be repurposed based on a large set electronic medical record (EMR) data. NLP can also facilitate the post-market surveillance of a drug or discover adverse drug events.
  • Reverse Conversion – Converting data into other direction from machine-readable format to natural language text for research, reporting and educational purposes.
  • Text Summarization– There are two different use cases of text summarization. One is to accelerate the chart review process so that clinical information in multiple reports of the same patient can be extracted and visualized so that clinicians can quickly grasp the major past medical history of a patient without performing lengthy chart reviews. The second use case is to summarize clinical information across a group of patients enabled using NLP techniques. The new annotation process will be able to group and summarize annotation text hits at both patient and document levels. This information could be used by researchers or physicians to identify medical terms that will help them to find patients cohorts based on term frequency and/or relevance by combining this information along with all other produced annotation information.
  • Optical character recognition– Using OCR to convert images, PDF documents or scanned medical record and imaging reports into text files to be parsed and analysed with NLP system. For instance, scanned referral notes of doctors could be converted into text files using OCR and then text files can be analysed with semantic annotators and NLP system to translate referral notes into case report forms as demonstrated in my paper.
  • Speech Recognition – NLP systems can be utilized to allow a medical practitioner to dictate notes that can then be converted into text.

NLP can help healthcare providers make a better decision

There are n numbers of instances where NLP helped in better decision-making under health care setup. Some of the common cases are early detection of diseases based on patients’ clinical parameters, identification of clinical patterns and new clinical conditions of diseases. We will be discussing some of the recent and interesting cases where NLP driven technology have helped in better decision-making capacity of the healthcare provider.

  • In June 2017, Saeed Hassanpour, et al., published an article in the Journal of Digital Imaging, where the author has developed natural language processing method to automatically extract clinical findings in radiology reports. NLP helps in categorising these reports based on the level of change and significance according to the radiology-specific information model. This can help clinicians quickly understand the key observations in radiology reports and facilitate clinical decision support.
  • Glen Coppersmith et al., in 2017 presented a very interesting study on mental health analysis in clinical whitespace (time gap between visits to clinics) using the social interaction data on behaviour, beliefs, mood and wellbeing. Authors used internal chats, emails, file sharing, etc. and used NLP based emotion and sentiment classifiers to show that company exhibit increase in ‘joy’ around major holidays and after major software release and exhibit increase in negative sentiment near the major deadline.
  • Matthew Scotch et al., in 2016, utilized a combination of concept annotation, bio-medical ontology and natural language processing pipeline for understanding contraceptive use among female Veterans seeking care at Veterans Administration (VA) healthcare facilities. They utilized the contraceptive use information in the VA’s electronic health record (EHR) with the consideration of both free text and semi-structured data. They achieved high precision (0.83) and recall value (0.84) to support the research of contraceptive use among females.
  • SV Ramanan et al., in 2016 used clinical NLP annotator, Cocoa, for dense annotation of free-text critical care discharge summaries from an Indian hospital. The author claims that they are the first to do such a big annotation task in India and have created the first annotated clinical corpus in India. They have shown that the Cocoa system can be used on larger data sets of clinical summaries to extract desired data from unstructured notes, and may thus prove useful for cohort analysis, given the absence of structured EHR’s in the Indian clinical context.
  • In 2014, Alexandra M Roch, et al., created an NLP based automated tool for screening pancreatic cyst for early detection of pancreatic cancer with mean sensitivity and specificity of 99.9% and 98.8%, respectively. As many as 3% of computed tomography (CT) scans detect pancreatic cysts. Because pancreatic cysts are incidental, ubiquitous and poorly understood, follow-up is often not performed. Pancreatic cysts may have significant malignant potential and their identification represents a ‘window of opportunity’ for the early detection of pancreatic cancer. This highly accurate system can help capture patients ‘at-risk’ of pancreatic cancer in a registry.
  • Apart from this, there are many other instances of NLP use in the healthcare domain.

Future of Natural Language Processing in the healthcare domain

Well 100% perfection in NLP is still awaited and it may not even be possible but still, it really provides valuable and beneficial information when used wisely. While NLP can offer the advanced diagnostic ability, it heavily relies on the quality of documentation by health care providers. With time,  precision and accuracy the system has improved tremendously.

NLP has given a new dimension to healthcare institution and is already a billion-dollar industry. There are already several clinical domain startups relishing the power of NLP and creating products to enhance the medication and decision-making capabilities of health care providers.

Natural language processing is the base on which cognitive computing, sentiment analysis, semantic text analytics, voice recognition and image retrieval are based upon. More the data available for NLP systems to make use of better would be the efficiency of the NLP system. Quality documentation capturing all the specifics helps NLP algorithms to perform better in distinguishing “wanted” from “unwanted” information. Cases that are hard for humans to distinguish is most likely would be harder for NLP systems to interpret.

We can expect more clinicians in future will be using voice commands to log data rather than filling out forms. NLP is useful for identifying clinical gaps and can help organizations reduce labour costs.

Students, researchers and other beginners (including startups) can boost their NLP proficiency by first deploying it in low-risk scenarios to build confidence in NLP tools and techniques while avoiding the risk of adverse events. Some of the NLP tools that I have used and will recommend to budding NLP researchers to use are Apache cTAKES, MetaMap, Apache OpenNLP and GATE. Apache cTAKES and MetaMap are specifically designed for use in health care domain to map biomedical text with inbuilt libraries of UMLS concepts (UMLS Metathesaurus), SNOMED-CT ontology, and similar other biomedical ontologies. Apache OpenNLP and GATE are machine learning-based generic toolkits for processing natural language text.

Smart machine learning algorithms for NLP integrated with the healthcare provider’s routine workflow will be the key for any productive gain by the NLP technology.  NLP is just an algorithm, and it depends intrinsically on the input and corrections it gets from its users to improve over time. Though it is not 100% perfect, it’s a tool that is going to evolve and get better as you feed it more information and perform regular correction in algorithms.


Thanks a ton for reading this article and it’s my humble request to share it with your friends whom you think may be relevant.