Collect structured data from the Internet using Web Scraper and LLMs.

Knowledgator Engineering
4 min readOct 1, 2023

Web articles are one of the primary sources of information for market research, competitor analysis, and customer feedback reviews. The progress in NLP with emerging large language models allows quick and accurate collection and structuring of information depending on the needs. The main question is how to realise it efficiently according to different requirements. In this article, we will develop a simple pipeline for extracting texts from web pages and transform it into a custom table for future analysis.

Requirements

Before starting this project, you should ensure that you have installed Python 3.8 or above on your machine. In this work, we will require the following specialized Python libraries:

  • BeauftifulSoup — the library for parsing HTML and XML pages;
  • Selenium — is Python binding for Selenium Webdriver that is used for automatic web browser interactions;
  • LXML — powerful XML processing library combining libxml2/libxslt with the ElementTree API;
  • Requests — a simple and elegant library that allows you to send HTTP/1.1 requests extremely easily;
  • Pandas — library for manipulating data frames.

To install the described libraries, run the following command:

pip install selenium
pip install beautifulsoup4
pip install lxml
pip install requests
pip install pandas

Additionally, depending on your preferred browser, you should install web drivers. In cases with Chrome browser, please visit the following link. You need to download web drivers depending on your version of Chrome and operating system. After downloading, just unzip the archive and put it into the folder with the code we will create for the project. For Firefox users, please visit the page. If you need to unpack tar.gz file, run the following command:

tar -xf archive.tar.gz

Where archive.tar.gz is the path to your downloaded archived web driver.

If you have any challenges, don’t worry we will propose an alternative approach that offers even more options.

How to extract text from a Web page

The first step is to initialise the driver. The code will be slightly different at this stage depending on the webdriver you use. Below you can find the version for Chrome:

from selenium import webdriver
from bs4 import BeautifulSoup

# Set up Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")

# Setting up WebDriver with Chrome options
driver_path = "chromedriver.exe" # Update this path
driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)

In the case of Firefox, the code will look this way:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup


# Set up Firefox options
firefox_options = Options()
firefox_options.add_argument("--headless")


# Setting up WebDriver with Firefox options
driver_path = "geckodriver" # Update this path
driver = webdriver.Firefox(executable_path=driver_path, options=firefox_options)

To get a text from the webpage we developed the following functions:

def get_text_from_webpage(url):
driver.get(url) # access and render a page


# Getting page source and making soup
soup = BeautifulSoup(driver.page_source, 'lxml')


texts = []
# Example: Extracting all the text inside <p> tags
for p_tag in soup.find_all('p'):
texts.append(p_tag.text)

return '\n'.join(texts)

Firstly, we run webdriver to get a page source, then parse it with BeautifulSoup and after that, search for each “p” tag that means paragraph and extract text from it.

As one can assume, such an approach is not ideal, first of all, we can collect a lot of trash text that can create a bias for future processing by LLMs. To solve that, we have developed a web parser that can fix this issue by intelligently cleaning text, moreover, it additionally classifies webpage by topic as well as the core website where the article can be located, extract keywords, detect language, authors of the article, last changes and many other things. To try it, please visit the RapidAPI page. After registration and subscription, you will get the API key that you need just to enter in the code below:

import requests


SCAPER_URL = "https://web2meaning.p.rapidapi.com/parse"
API_KEY = "YOUR API KEY"


SCRAPER_HEADERS = {
"content-type": "application/json",
"X-RapidAPI-Key": API_KEY,
"X-RapidAPI-Host": "web2meaning.p.rapidapi.com"
}


def get_text(url):
payload = { "url": url , "clean body": True, 'article category': True}


response = requests.post(SCAPER_URL, json=payload, headers=SCRAPER_HEADERS)
return response.json()["body"]

Intelligent table extraction from a text

Super, right now, we have components for extraction text from a webpage, let’s build something more useful. Imagine you need to find information about emerging startups in the news and structure such information to the table. You need to have information like company name, it’s description, key technologies and country. In most cases, we can’t build some general rule-based approach to parse such information, for that, we definitely need some LLMs. But don’t worry, our team has developed a solution that can extract any table from text given just column names. You can find it on this RapidAPI page.

To get a table from textual data, we have created the following function:

URL = "https://text2table.p.rapidapi.com/text2table"
API_KEY = "YOUR_API_KEY"


HEADERS = {
"content-type": "application/json",
"X-RapidAPI-Key": API_KEY,
"X-RapidAPI-Host": "text2table.p.rapidapi.com"
}


def construct_table(text, columns):
payload = {
"text": text,
"columns": columns
}


response = requests.post(URL, json=payload, headers=HEADERS)
result = response.json()
return result

Let’s run the whole pipeline on a real example.

import pandas as pd


url = 'https://startupwiseguys.com/news/9-b2b-saas-startups-to-watch-from-the-wise-guys-saas-tallinn-2022-batch/'
text = get_text(url)
columns = ['startup name', 'description', 'key technologies', 'country']


table = construct_table(text, columns)
df = pd.DataFrame(table)
df.to_csv('startups_info.csv')

Wonderful, we got the table we wanted. We can easily scale such a pipeline into analysis of thousands and millions of web pages. It can significantly save time for us, so we will able to be more focused on analysis and exploring of the data.

--

--

Open-Source ML research company focused on developing fundamental encoder-based model for information extraction https://knowledgator.com/