Extract any named entities from PDF using custom Spacy pipeline

6 min readDec 3, 2023

Portable Document Format (PDF) stands as a predominant file format for distributing content, particularly in academic, scientific, corporate, and legal realms. Its strength lies in preserving the original layout and structure of documents. However, when it comes to data retrieval, analysis, or manipulation, PDFs pose certain challenges. This is where the importance of text extraction and structuring becomes evident. In this blog post, we delve into the process of constructing a pipeline specifically designed for extracting text from PDFs with the following processing using Natural Language Processing (NLP) techniques. Crucially, this pipeline incorporates Named Entity Recognition (NER), a subfield of NLP focused on identifying and classifying key information elements in text, such as names of people, organizations, locations, expressions of time, quantities, monetary values, and more. NER plays a pivotal role in transforming unstructured text into structured data, enabling more effective information extraction and analysis. We will explore how this pipeline performs in extracting and identifying these named entities in real-world, complex scenarios, thereby demonstrating its utility and effectiveness in practical applications.

Requirements

Before starting this project, you should ensure that you have installed Python3.8 or above on your machine. In this work, we will require two Python libraries:

Pdfminer — the library for performing layout analysis and data parsing;
Spacy — is a library for industrial Natural Language Processing (NLP) applications, including Named Entity Recognition (NER).

Before installing the required libraries, ensure that your pip, setuptools and wheel are up to date. For that, run the following command:

pip install -U pip setuptools wheel

To install the described libraries, run the following command:

pip install pdfminer.six spacy

Additionally, you need to download one of Spacy models:

python -m spacy download en_core_web_sm

Also, we recommend to check our article /where we use Large Language Models (LLMs) to extract custom structured tables from PDF.

Text and entity extraction

First of all, we need to import all necessary libraries for the project

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
from tqdm import tqdm
import re

We have a top-level function process_document that takes a path to a PDF document, a concrete page number, which we are going to process to extract text.

def process_document(pdf_path, page_ids=None):
   extracted_pages = extract_pages(pdf_path, page_numbers=page_ids)


   page2content = {}


   # Process each extracted page
   for extracted_page in tqdm(extracted_pages):
       page_id = extracted_page.pageid
       content = process_page(extracted_page)
       page2content[page_id] = content


   return page2content

If you need to parse all pages, you can set page_ids equal to None. The function returns a dictionary where keys indicate the page number, and the value is a string value that contains textual information in the right order extracted from the page.

The process_page function parses the whole PDF page, extracting textual, it iterates through a list of sorted by-position high-level elements of a page and transforms it into string format.

def process_page(extracted_page):
   content = []


   # Get a sorted list of elements based on 
                # their Y-coordinate in reverse order
   elements = [element for element in extracted_page._objs]
   elements.sort(key=lambda a: a.y1, reverse=True)


   for i, element in enumerate(elements):
       # Extract text if the element is a text container 
       # and text extraction is enabled
       if isinstance(element, LTTextContainer):
           line_text = extract_text_and_normalize(element)
           content.append(line_text)

   # Combine and clean up the extracted content
   content = re.sub('\n+', '\n', ''.join(content))
   return content

To make text extracted from PDFs more readable for machines and humans, we should deal with indentation properly. Actually, we need to be sure that we have no newline characters that split unfinished sentences. It’s essential because otherwise, we can add some bias to language models, so the final result will be worse.

def extract_text_and_normalize(element):
   # Extract text from line and split it with new lines
   line_texts = element.get_text().split('\n')
   norm_text = ''
   for line_text in line_texts:
       line_text=line_text.strip()
       # empty strings after striping convert to newline character
       if not line_text:
           line_text = '\n'
       else:
           line_text = re.sub('\s+', ' ', line_text)
           # if the last character is not a letter or number,
                                # add newline character to a line
           if not re.search('[\w\d\,\-]', line_text[-1]):
               line_text+='\n'
           else:
               line_text+=' '
       # concatenate into single string
       norm_text+=line_text
   return norm_text

Entities extraction

Named Entity Recognition (NER), it has emerged as a fundamental task for many NLP applications. It involves the intricate process of detecting specific entities within a body of text and subsequently categorising them into predefined classes. NER can be useful in tasks when you need to analyse large amounts of text and pick specific information you are the most interested. It’s one of the first stages of information extraction pipelines that aims to build knowledge graphs from raw text. Moreover, NER can be used for filtering and classification of content. Unlike standard text classification that categorises entire documents or chunks of text into broad themes, NER delves deeper, extracting and categorising individual entities such as persons, organisations, locations, and other predefined categories. So, let’s implement the NER pipeline on real data using Spacy models.

In our case, we will deal with the financial report of one of the largest pharma companies — Pfizer, and our task is to extract financial information about the top products of the company. To download the report, run the following command:

!curl "https://s28.q4cdn.com/781576035/files/doc_financials/2022/ar/PFE-2022-Form-10K-FINAL-(without-Exhibits).pdf" > pfizer-report.pdf

To simplify the pipeline we will put a page where such information is located:

pdf_path = 'pfizer-report.pdf'
page2content = process_document(pdf_path, page_ids=[9])

First of all, we need to initialize the NLP object and exclude all unnecessary pipelines, it will significantly accelerate the work of the pipeline:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

Let’s visualize the work of the Spacy model using displacy:

from spacy import displacy

text = page2content[1]
doc = nlp(text)
displacy.serve(doc, style="ent")

Here are the results of the Spacy pipeline:

Results of entity recognition by custom Spacy model

It’s a pretty good result for such a tiny model however, the model wrongly matched drug names as persons. Unfortunately, the current model we are using does not support the recognition of drugs; however, this is an issue for any supervised models trained on fixed set of entity classes. In real-life cases, we deal with a large variety of entities, and it’s essential to have the ability to specify them in run-time.

Let’s create a custom Spacy pipeline that uses our zero-shot NER API.

import requests
from spacy.language import Language
from spacy.tokens import Span

@Language.factory("knowledgator_ner")
class KnowledgatorNERComponent:
    def __init__(self, nlp, name, API_KEY, labels_to_add, labels_to_exclude):
        self.url = "https://zero-shot-ner.p.rapidapi.com/token_searcher/ner"
        self.headers = {
            "content-type": "application/json",
            "X-RapidAPI-Key":API_KEY,
            "X-RapidAPI-Host": "zero-shot-ner.p.rapidapi.com"
        }
        self.labels_to_add = labels_to_add
        self.labels_to_exclude = set(labels_to_exclude)

    def get_positions(self, doc):
        start2token_id = {}
        end2token_id = {}
        for tok in doc:
            start = tok.idx
            start2token_id[start] = tok.i
            end = len(tok.text) + tok.idx
            end2token_id[end] = tok.i+1
        return start2token_id, end2token_id
    
    def filter_predefined_ents(self, custom_ents, predefined_ents):
        filtered_ents = []
        custom_ents_pos = [(ent.start, ent.end) for ent in custom_ents]

        for ent in predefined_ents:
            if ent.label_ in self.labels_to_exclude:
                continue
            overlap = False
            for cust_ent_start, cust_ent_end in custom_ents_pos:
                if (ent.start < cust_ent_end and ent.end > cust_ent_start):
                    overlap = True
                    break
            if not overlap:
                filtered_ents.append(ent)

        return filtered_ents

    def __call__(self, doc):
        try:
            payload = {"text": doc.text, "labels": self.labels_to_add}
            entities = requests.post(self.url, json=payload, headers=self.headers).json()['entities']
        except:
            return doc
        start2token_id, end2token_id = self.get_positions(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for ent in entities:
            # Generate Span representing the entity & set label
            ent_start = ent['start']
            ent_end = ent['end']
            if ent_start in start2token_id:
                start = start2token_id[ent_start]
            else:
                continue
            if ent_end in end2token_id:
                end = end2token_id[ent_end]
            else:
                continue
            entity = Span(doc, start, end, label=ent['entity'].upper())
            spans.append(entity)
        # Replacing overlaping entities
        filtered_ents = self.filter_predefined_ents(spans, doc.ents)
        doc.ents = spans+filtered_ents
        return doc  # don't forget to return the Doc!

labels_to_add = ['drug', 'pharma company', 'disease']
labels_to_exclude = ['WORK_OF_ART', 'NORP', 'PERSON', 'ORG']
API_KEY = "your_api_key"
nlp.add_pipe("knowledgator_ner", config={"API_KEY": API_KEY, "labels_to_add": labels_to_add, "labels_to_exclude": labels_to_exclude})

You need to enter your API key, which you can get by registering on RapidAPI and subscribing to our API. Additionally, you set a list of the custom labels you need to add, the more labels you add, the more time the system needs to process it. Additionally, you can specify which original labels you want to exclude. It can be helpful if the original labels can interfere with the results of our zero-shot NER or if the Spacy models demonstrate poor performance.

Let’s visualize the results of the custom NER pipeline.

Results of entity recognition augmenting Spacy model with Knowledgator NER module

Hooray, right now, we are able to recognize such entities as pharmaceutical drug names.

Extract any named entities from PDF using custom Spacy pipeline

Requirements

Text and entity extraction

Entities extraction

Written by Knowledgator Engineering