A comprehensive guide to PDF document parsing: Leveraging Tesseract, PyPDF2 & spaCy

Introduction

In the realm of medical data analysis, the ability to accurately parse and interpret documents is paramount. Medical documents, ranging from patient records to research reports, are rich in complex, structured, and unstructured data. Extracting this information accurately and efficiently is critical for patient care, medical research, and the development of healthcare policies.

This article introduces a powerful trio of libraries—Tesseract/PyPDF2, and spaCy serving a unique role in the document parsing process. Tesseract, coupled with PyPDF2, offers robust Optical Character Recognition (OCR) capabilities, essential for converting images and PDFs into machine-readable text. spaCy, a cutting-edge natural language processing (NLP) library, excels at analyzing text to extract meaningful information, such as medical entities and terms.

By combining these tools, we can develop a comprehensive pipeline based approach to parse medical documents effectively.

Section 1: Understanding the Libraries

In navigating the complexities of medical document parsing, we leverage three key libraries, each playing a vital role in extracting and analyzing data.

Tesseract/PyPDF2

Optical Character Recognition (OCR) technology, exemplified by Tesseract, converts images to text, critical for digitizing medical records. PyPDF2 complements this by extracting text and images from PDFs, crucial for preparing documents for OCR. Together, they form a powerful duo for accessing the wealth of information in scanned documents and PDFs.

spaCy

spaCy brings advanced Natural Language Processing (NLP) to the table, analyzing extracted text for meaningful insights. With capabilities like entity recognition and dependency parsing, spaCy excels at understanding complex medical terminology and extracting relevant information from the narrative text of patient records and research articles.

Section 2: Setting Up the Environment

Start by creating a new directory for your project. This directory will house all your project files, including Python scripts, data files, and the virtual environment. If you’re using the command line, you can follow these steps:

mkdir doc_parser
cd doc_parser

Initialize a virtual environment by running:

python3 -m venv venv

This command creates a new directory named venv within your project directory, where the virtual environment files are stored. Activate the virtual environment with the following command:

On Windows:

.\venv\Scripts\activate

On macOS and Linux:

source venv/bin/activate

With the virtual environment activated, install the necessary libraries using pip

pip install PyPDF2 spacy pdf2image

For spaCy’s language models (necessary for NLP tasks), install the English model with:

python -m spacy download en_core_web_sm

For Tesseract OCR, you might need to install the Tesseract engine separately as it’s not a Python package. Instructions can vary depending on your operating system, so refer to the official Tesseract GitHub page for detailed installation guides.

Now, your project environment is set up with a virtual environment containing all the necessary dependencies for document parsing. You’re ready to move on to extracting text, analyzing language, and processing tables from medical documents.

Section 3: Parsing with Tesseract and PyPDF2

The first step will be to extract the data from the PDF/Document image, to perform that we have 2 ways either we do it with tesseract or PyPDF2. While tesseract is good for perfroming OCR on images but sometimes OCR does not work well in that case we can extract data from PDF using PyPDF2.

So we will have 2 functions to extract data from the PDF

If PDF contain images then we will perform OCR using tesseract
If it contain plain text we will extract it using PyPDF2

Let’s see how to utilise tesseract for OCR

# extractUsingTesseract.py

import pytesseract
from pdf2image import convert_from_path


def convert_pdf_to_text(pdf_path):
    images = convert_from_path(pdf_path)

    extracted_texts = []
    for i in range(len(images)):
        # Convert each page to text
        text = pytesseract.image_to_string(image=images[i], config=r"--psm 3")
        extracted_texts.append(text)

    # Concatenate all extracted texts
    final_text = "\n".join(extracted_texts)

    return final_text

for what psm is you can refer to this great article, as it is a topic for another article in itself, now for extraction using PyPDF2

from PyPDF2 import PdfReader

def load_pdf_text(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += "\n" + page.extract_text()
    return text

this concludes our text extraction , we can feed the text output from our tesseract script to spacy for pattern recognization.

After running the above code on a dummy medical bill Alt text

we will get the following output

AIIMS
Invoice No:  2484
Hospital details:
nothing  
Contact Details: 972764
Discharge Date:   
01 Apr 2024AIIMS
#65, Defence Enclave boh road, Ambala Cantt.
Patient Information
Patient Name:   
Shivam Sharma
Gaurdian Name:   
Piyush Sharma
Insurance A vl:  
Yes
Consultant:   
Aryan Choudhary  
MBBSPatient Issue:   
Liver problem
Admit Date:   
01 Apr 2024
Age:   
23
Room Category:   
SingleAddress:   
#22, Railway colony , Ambala Cantt.
Mobile:   
8859043839
Details Price Amount
Bed and Room ||  ₹20000 ₹20000
Oxygen Cylinder ||  ₹3000 ₹3000
Nursing and care ||  ₹5000 ₹5000
Food and Medicine ||  ₹1800 ₹1800
Pay By  
Cash  
Amount: ₹  29800Tax: 30 %
CGST : 15 % - ₹  6876.92
SGST : 15 % - ₹  6876.92
Taxable Amount: ₹  22923.08
Total Amount: ₹  29800
Remark:
IN CASE OF EMERGENCY CONSUL T IMMEDIA TELY IF YOU GET
PAIN,P AINFUL MOVEMENTS, REDNESS,PUS OR BLEEDING
.FOLLOW UP AFTER 5 DA YS . MEET  Aryan Choudhary , nothingAryan Choudhary  
nothing
* This is computer generated invoice signature not required created at 01 Apr 2024 at 17:29

Section 4: Natural Language Processing with spaCy

We will be feeding the output from above process to a spacy NLP instance

import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp(text)

now we will create a pattern to extract a patient’s name from the text, for help with creating patterns for spacy you can consult the documentation of spacy, it was very easy to get started with.

def get_patient_name(doc):
    pattern = [
        {"LOWER": "patient"},
        {"LOWER": "name"},
        {"IS_PUNCT": True},
        {"IS_SPACE": True, "OP": "+"},
        {"LOWER": {"REGEX": "^[a-zA-Z0-9]"}},
        {"LOWER": {"REGEX": "^[a-zA-Z0-9]"}},
    ]
    matcher = spacy.matcher.Matcher(nlp.vocab)
    matcher.add("patient_name", pattern)
    matches = matcher(doc)

if you want to visualize what matches look like you can temporarily add this block of code to your script

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

the output should be something similar to

9612498649678633594 patient_name 42 48 Patient Name: Shivam Sharma

this will fetch us the name of the patient, similarly you can create different patterns to parse diffent things from the document.

for extracting billing info you can have similar patterns, suppose you want to extract total amount the patient have to pay you can have the following pattern.

pattern = [
    {"LOWER": "total"},
    {"LOWER": "amount"},
    {"IS_PUNCT": True},
    {"IS_SPACE": True, "OP": "+"},
    {"LOWER": "₹"},
    {"IS_SPACE": True, "OP": "+"},
    {"LOWER": {"REGEX": "^[a-zA-Z0-9]"}},
]

Conclusion

Our exploration of document parsing has shown that combining the OCR technology of Tesseract with PyPDF2, and the NLP power of spaCy enables thorough processing of medical documents. We’ve established an environment, parsed text, and structured data, which can significantly streamline medical data analysis. This integrated approach holds great promise for enhancing patient care and advancing medical research, showcasing the transformative impact of technology in healthcare data management.