Introduction
In the realm of medical data analysis, the ability to accurately parse and interpret documents is paramount. Medical documents, ranging from patient records to research reports, are rich in complex, structured, and unstructured data. Extracting this information accurately and efficiently is critical for patient care, medical research, and the development of healthcare policies.
This article introduces a powerful trio of libraries—Tesseract/PyPDF2, and spaCy serving a unique role in the document parsing process. Tesseract, coupled with PyPDF2, offers robust Optical Character Recognition (OCR) capabilities, essential for converting images and PDFs into machine-readable text. spaCy, a cutting-edge natural language processing (NLP) library, excels at analyzing text to extract meaningful information, such as medical entities and terms.
By combining these tools, we can develop a comprehensive pipeline based approach to parse medical documents effectively.
Section 1: Understanding the Libraries
In navigating the complexities of medical document parsing, we leverage three key libraries, each playing a vital role in extracting and analyzing data.
Tesseract/PyPDF2
Optical Character Recognition (OCR) technology, exemplified by Tesseract, converts images to text, critical for digitizing medical records. PyPDF2 complements this by extracting text and images from PDFs, crucial for preparing documents for OCR. Together, they form a powerful duo for accessing the wealth of information in scanned documents and PDFs.
spaCy
spaCy brings advanced Natural Language Processing (NLP) to the table, analyzing extracted text for meaningful insights. With capabilities like entity recognition and dependency parsing, spaCy excels at understanding complex medical terminology and extracting relevant information from the narrative text of patient records and research articles.
Section 2: Setting Up the Environment
Start by creating a new directory for your project. This directory will house all your project files, including Python scripts, data files, and the virtual environment. If you’re using the command line, you can follow these steps:
mkdir doc_parser
cd doc_parser
Initialize a virtual environment by running:
python3 -m venv venv
This command creates a new directory named venv within your project directory, where the virtual environment files are stored. Activate the virtual environment with the following command:
On Windows:
.\venv\Scripts\activate
On macOS and Linux:
source venv/bin/activate
With the virtual environment activated, install the necessary libraries using pip
pip install PyPDF2 spacy pdf2image
For spaCy’s language models (necessary for NLP tasks), install the English model with:
python -m spacy download en_core_web_sm
For Tesseract OCR, you might need to install the Tesseract engine separately as it’s not a Python package. Instructions can vary depending on your operating system, so refer to the official Tesseract GitHub page for detailed installation guides.
Now, your project environment is set up with a virtual environment containing all the necessary dependencies for document parsing. You’re ready to move on to extracting text, analyzing language, and processing tables from medical documents.
Section 3: Parsing with Tesseract and PyPDF2
The first step will be to extract the data from the PDF/Document image, to perform that we have 2 ways either we do it with tesseract or PyPDF2. While tesseract is good for perfroming OCR on images but sometimes OCR does not work well in that case we can extract data from PDF using PyPDF2.
So we will have 2 functions to extract data from the PDF
- If PDF contain images then we will perform OCR using tesseract
- If it contain plain text we will extract it using PyPDF2
Let’s see how to utilise tesseract for OCR
# extractUsingTesseract.py
import pytesseract
from pdf2image import convert_from_path
def convert_pdf_to_text(pdf_path):
images = convert_from_path(pdf_path)
extracted_texts = []
for i in range(len(images)):
# Convert each page to text
text = pytesseract.image_to_string(image=images[i], config=r"--psm 3")
extracted_texts.append(text)
# Concatenate all extracted texts
final_text = "\n".join(extracted_texts)
return final_text
for what psm
is you can refer to this great article, as it is a topic for another article in itself,
now for extraction using PyPDF2
from PyPDF2 import PdfReader
def load_pdf_text(pdf_path):
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += "\n" + page.extract_text()
return text
this concludes our text extraction , we can feed the text output from our tesseract script to spacy for pattern recognization.
After running the above code on a dummy medical bill
we will get the following output
AIIMS
Invoice No: 2484
Hospital details:
nothing
Contact Details: 972764
Discharge Date:
01 Apr 2024AIIMS
#65, Defence Enclave boh road, Ambala Cantt.
Patient Information
Patient Name:
Shivam Sharma
Gaurdian Name:
Piyush Sharma
Insurance A vl:
Yes
Consultant:
Aryan Choudhary
MBBSPatient Issue:
Liver problem
Admit Date:
01 Apr 2024
Age:
23
Room Category:
SingleAddress:
#22, Railway colony , Ambala Cantt.
Mobile:
8859043839
Details Price Amount
Bed and Room || ₹20000 ₹20000
Oxygen Cylinder || ₹3000 ₹3000
Nursing and care || ₹5000 ₹5000
Food and Medicine || ₹1800 ₹1800
Pay By
Cash
Amount: ₹ 29800Tax: 30 %
CGST : 15 % - ₹ 6876.92
SGST : 15 % - ₹ 6876.92
Taxable Amount: ₹ 22923.08
Total Amount: ₹ 29800
Remark:
IN CASE OF EMERGENCY CONSUL T IMMEDIA TELY IF YOU GET
PAIN,P AINFUL MOVEMENTS, REDNESS,PUS OR BLEEDING
.FOLLOW UP AFTER 5 DA YS . MEET Aryan Choudhary , nothingAryan Choudhary
nothing
* This is computer generated invoice signature not required created at 01 Apr 2024 at 17:29
Section 4: Natural Language Processing with spaCy
We will be feeding the output from above process to a spacy NLP instance
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp(text)
now we will create a pattern to extract a patient’s name from the text, for help with creating patterns for spacy you can consult the documentation of spacy, it was very easy to get started with.
def get_patient_name(doc):
pattern = [
{"LOWER": "patient"},
{"LOWER": "name"},
{"IS_PUNCT": True},
{"IS_SPACE": True, "OP": "+"},
{"LOWER": {"REGEX": "^[a-zA-Z0-9]"}},
{"LOWER": {"REGEX": "^[a-zA-Z0-9]"}},
]
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add("patient_name", pattern)
matches = matcher(doc)
if you want to visualize what matches look like you can temporarily add this block of code to your script
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
the output should be something similar to
9612498649678633594 patient_name 42 48 Patient Name: Shivam Sharma
this will fetch us the name of the patient, similarly you can create different patterns to parse diffent things from the document.
for extracting billing info you can have similar patterns, suppose you want to extract total amount the patient have to pay you can have the following pattern.
pattern = [
{"LOWER": "total"},
{"LOWER": "amount"},
{"IS_PUNCT": True},
{"IS_SPACE": True, "OP": "+"},
{"LOWER": "₹"},
{"IS_SPACE": True, "OP": "+"},
{"LOWER": {"REGEX": "^[a-zA-Z0-9]"}},
]
Conclusion
Our exploration of document parsing has shown that combining the OCR technology of Tesseract with PyPDF2, and the NLP power of spaCy enables thorough processing of medical documents. We’ve established an environment, parsed text, and structured data, which can significantly streamline medical data analysis. This integrated approach holds great promise for enhancing patient care and advancing medical research, showcasing the transformative impact of technology in healthcare data management.