Text preprocessing for large language models.

We show which data large language models use and how to preprocess your own text data.

Industry

various

Topic

NLP, LLM

Tools

Torch, Transformers, DeepSpeed

Project duration

2 weeks

Overview of LLM posts

LLM dataset

To train a large language model, a large dataset is used. Most large language models have in common that they are trained largely on website data. We take a closer look at the dataset of what is currently the best-known language model, LLaMA by Meta/Facebook.^{LLaMA Paper}

The most important facts about the dataset used for LLaMA training are:

Different sources
The dataset consists of a mixture of data from different sources, including CommonCrawl (en), C4, GitHub, Wikipedia, Gutenberg and Books3, ArXiv, and Stack Exchange. The data is relatively diverse and includes not only regular website data but also books, code, and scientific texts.
Publicly available data
The data used to train LLaMA comes from publicly accessible datasets, which is important with regard to licensing terms and additional legal foundations.
Different languages
Most of the data used is in English. For example, in the CommonCrawl and C4 datasets, non-English pages are removed. The scientific papers are also mainly in English, as are comments in GitHub code, posts on StackExchange, ... Only parts of Wikipedia cover 20 different languages, including German, Spanish, French, and Italian.
Data preprocessing
Complex preprocessing steps are performed for each dataset. These include deduplication (removing similar texts), filtering for English language (except Wikipedia), filtering out low-quality content, removing hyperlinks or comments, and filtering code by licenses.

In summary, the LLaMA training dataset comprises various datasets and these undergo complex preprocessing steps to produce a high-quality dataset. In general, the model has knowledge of a wide range of languages, albeit only to a very limited extent.

Open-source datasets

The individual components of the dataset for LLaMA training are publicly available, but the dataset itself after preprocessing is not. If you want to compile your own dataset, the links to the individual components are listed here:

You will repeatedly encounter the term Huggingface here: Huggingface is a company specializing in the development of tools and libraries for Natural Language Processing (NLP). Its best-known library, Transformers, enables the implementation of state-of-the-art NLP models. In addition, it also provides a platform where pre-trained models and datasets can be shared by the community.

Dataset	Links
CommonCrawl	CommonCrawl GitHub
C4	GitHub - Huggingface - Washington Post Analyse
GitHub	Google BigQuery - Info Google Codelabs
Wikipedia	Wikipedia Dumps - Huggingface
Books	Huggingface Pile Books3 - Gutenberg - PG19
arXiv	arxiv
StackExchange	archive - Huggingface

There are various projects that have built a dataset comparable to that of the LLaMA network:
RedPajama dataset on Huggingface
The disadvantage here is that mainly English texts were used again. Only the Wikipedia articles had different languages.

In addition, a comparable language model to the LLaMA models was trained, called Falcon, and the dataset for it was also published: Huggingface

The advantage compared to the datasets presented above is that, for the large variant of the network, another dataset was used (unfortunately not published) in which European languages accounted for 7%. Of this, 26% of the data is German. This corresponds to a total share of 1.82%. This corresponds to a total share of 1.82%, which of course is still rather low.

Data generation

The web datasets are created through web scraping via crawling. Large quantities of web pages are downloaded and the text is extracted. In Europe, there is also a project called OSCAR (Open Super-large Crawled Aggregated coRpus), which produces a large open-source dataset.^{Oscar Datensatz}
In general, scraping must be approached carefully and there are currently still legal concerns. In the case of StableDiffusion, for example (an AI for image generation), a possible opt out procedure was introduced only recently so that images are not used for training.^{MIT Technology Review} It also becomes apparent that scraping is viewed critically by website operators and that access should no longer be free of charge. ^{Heise Artikel über StackExchange}^{t3n Artikel über Reddit} If you want to build your own web scraping pipeline, this can be done using existing tools such as Scrapy, StormCrawler, or similar.

Own data

How can we bring our own documents into large language models? In general, there are different approaches. Either these texts can be trained into the language model, or you use a general language model and use text embeddings. Similar text passages to the query are searched in existing documents and fed to the language model. In the first step, we show how to train the data into the language model.

HTML

import requests 
from bs4 import BeautifulSoup

URL = "https://www.gutenberg.org/files/2229/2229-h/2229-h.htm"

# Herunterladen der .html Website 
r = requests.get(URL)
soup = BeautifulSoup(r.content, "html.parser")

# Entfernen des Inhaltsverzeichnisses, der Überschriften, ...  
text = soup.find_all("p")
text = [p.text for p in text]

# Entfernen der Einleitung am Anfang
text = text[5:]
print(" ".join(text[:1]))

Output

DIREKTOR.
Ihr beiden, die ihr mir so oft,
In Not und Trübsal, beigestanden,
Sagt, was ihr wohl in deutschen Landen
Von unsrer Unternehmung hofft?
[...]

Subsequent data preprocessing:

# Filtern des Textes, damit nur die Konversationen und keine weiteren Einleitungen o.ä. verwendet wird: 
final_text = []
for raw_text in text:
    split_txt = raw_text.lstrip().split("\r\n", 1)
    if split_txt[0].replace(".", "").isupper():
        final_text.append("\n".join(split_txt)) 
        
final_text[:2]

Output

['DIREKTOR.\nIhr beiden, die ihr mir so oft,\r\n [...]',
'DICHTER.\nIhr fühlet nicht, wie schlecht ein solches Handwerk sei!\r\n [...]']

EPUB

import requests
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup

URL = "https://www.gutenberg.org/ebooks/2229.epub.noimages"
r = requests.get(URL)

with open('Faust.epub', 'wb') as f: 
    f.write(r.content)

book = epub.read_epub("Faust.epub")
items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))

# Ab hier vergleichbar zu html-Dateien
final_text = []
for item in items: 
    soup = BeautifulSoup(item.get_body_content().decode('utf-8'), "html.parser")

    # Entfernen des Inhaltsverzeichnisses, der Überschriften, ..., u.ä.  
    text = soup.find_all("p")
    text = [p.text for p in text]

    # Filtern des Textes, damit nur die Konversationen und keine weiteren Einleitungen o.ä. verwendet wird: 
    for raw_text in text:
        split_txt = raw_text.lstrip().split("\n", 1)
        if split_txt[0].isupper():
            final_text.append("\n".join(split_txt)) 
            
final_text[:2]

Output

['DIREKTOR.\nIhr beiden, die ihr mir so oft,\nIn Not und Trübsal, beigestanden,\n [...]',
'DICHTER.\nIhr fühlet nicht, wie schlecht ein solches Handwerk sei!\n [...]']

PDF

For the PDF, we use a PDF printout of the previously used website. Here, the text was extracted relatively simply and some sections were omitted for simplification. The effort required to process PDF files is significantly greater, especially when the text is not embedded. In that case, an OCR conversion would first have to be performed.

def get_sprecher(text_list:list)->list:
    """Gibt die einzelnen Sprecher zurück. Der Anfang wird verworfen!"""
    sprecher_list = []

    for i, text in enumerate(text_list): 
        if text.isupper(): 
            sprecher_list.append(i)
            
    merged_sprecher = []
    for sprecher, _ in enumerate(sprecher_list): 
        try:
            merged_sprecher.append("\n".join(text_list[sprecher_list[sprecher]: sprecher_list[sprecher+1]]))
        except IndexError:
            merged_sprecher.append("\n".join(text_list[sprecher_list[sprecher]:]))            
    return merged_sprecher

import fitz

doc = fitz.open("Faust.pdf")

alle_konversationen = []
for page in range(3, 130): 
    text = doc[page].get_text()
    text = text.split("/139")[1].lstrip().split("\n")
    alle_konversationen.extend(get_sprecher(text))
alle_konversationen[:2]

Output

['DIREKTOR.\nIhr beiden, die ihr mir so oft,\nIn Not und Trübsal, beigestanden,\n [...]',
'DICHTER.\nIhr fühlet nicht, wie schlecht ein solches Handwerk sei!\n [...]']

In our next blog post we show how the preprocessed data can be used to adapt a large language model to Faust.

Text preprocessing for large language models.

LLM dataset

Different sources

Publicly available data

Different languages

Data preprocessing

Open-source datasets

Data generation

Own data

HTML

EPUB

PDF