Why ULMFiT can still matter: training and inference times

Hubert Karbowy

Publication: 2021-07-12 12:40Updated: 2022-12-27 01:31

Table of contents

Training efficiency
Inference efficiency

In real-life applications computational efficiency of ML models is as important as evaluation metrics

In our previous post we briefly summarized the architecture of an ULMFiT language model and introduced our port of ULMFiT to Tensorflow. In this article we would like to focus on the model’s efficiency at training and inference.

Training efficiency

A common drawback of most transformer-based monolingual models is that they take long to pretrain. Training times of several days on TPUs are not uncommon if you want to obtain a reasonable quality model from scratch (and this will be weeks on GPUs). Also, you will need a rather large corpus of source documents.

Resuming from a checkpoint of an off-the-shelf multilingual model and fine-tuning it to a monolingual one can shorten the training, but will still probably be in the order of days on GPUs. Also, BERT base has 110 million parameters and its computational complexity is quadratic in terms of sequence length because of self-attention.

These practical limitations make it difficult to obtain not just monolingual models, but domain-specific or task-specific ones as well. For instance, at edrone our need was to create a model that can perform well on the language typically seen in e-commerce. This is quite a challenge and requires several experiments on corpora with texts as diverse as product descriptions from hardware stores and advice columns on gardening. It is very easy to see that constant training and retraining transformers could be impractical.

In contrast, a recurrent neural network’s computational complexity is linear in terms of weights per time step. Our ULMFiT models with a vocabulary of 35,000 subwords have about 34 million parameters – less than a third of BERT base. A monolingual model can be pre-trained on a general Wikipedia corpus overnight on a single GPU once and for good.

After 20 epochs we arrived at a very decent perplexity score of about 70. Any further fine tuning, for example to our e-commerce domain, takes less than an hour with the corpus size of about 10% the original data used for pretraining. In our case, we saw perplexity scores of around 20 on various e-commerce corpora of product descriptions.

Inference efficiency

The computational efficiency of a neural network is important not only for training, but also at inference. This text is from a Polish Wikipedia article on De bello Gallico:

"O wojnie galijskiej – pamiętniki Juliusza Cezara, opisujące 9 lat wojen galijskich (58-50 p.n.e.). Cezar napisał siedem ksiąg swojej relacji, z których każda obejmuje wydarzenia jednego roku. Wybuch wojny domowej przerwał pracę Cezara nad utworem. W późniejszym okresie Aulus Hircjusz dopisał ósmą księgę wojny galijskiej, mającą stanowić łącznik z innym dziełem Cezara, O wojnie domowej. Cezar wprowadził w swoim dziele zupełnie nowy styl pisania pamiętników. Pisał o sobie w trzeciej osobie, prostym i zwięzłym językiem, beznamiętnością zbliżoną do raportów wojskowych, nadając swoim relacjom formę corocznego sprawozdania, zamiast wychwalania własnych dokonań. Kunszt pisarski Cezara został doceniony nawet przez jego zaciekłych przeciwników politycznych, m.in. Cycerona. Oprócz opisu działań wojennych, Cezar przekazuje w swojej relacji liczne informacje o ludach zamieszkujących Galię, ich zwyczajach, kulturze i wierzeniach. Pierwsze zdanie księgi pierwszej pamiętników, brzmiące 'Gallia est omnis divisa in partes tres...' (wchodzące w skład wstępu, który ze względu na prostotę jest często stosowany w podręcznikach do łaciny jako czytanka) zostało szczególnie spopularyzowane dzięki zacytowaniu w refrenie utworu Jacka Kaczmarskiego Lekcja historii klasycznej. Fragmenty tekstu Bellum Gallicum przytacza Julian Tuwim w wierszu Nad Cezarem."

For a simple comparison, we ran this snippet through HerBERT base and Polish ULMFiT encoders. Both models tokenize the above text into approximately 290 tokens. Here are the inference times on a Quadro RTX 6000 GPU:

HerBERT base

In [37]: time herbert_base(bello)
CPU times: user 5.1 s, sys: 4.1 s, total: 9.19 s
Wall time: 392 ms

ULMFiT

In [51]: time ulmfit_rnn_encoder(repres)
CPU times: user 36.7 ms, sys: 7.35 ms, total: 44.1 ms
Wall time: 41.5 ms

To sum up, thanks to its lighter network architecture, ULMFiT offers very good value in terms of efficiency.

Hubert Karbowy

NLP engineer. Degree in computer science and linguistics. Focused on language modelling and dialog systems. Formerly NLP engineer at Samsung and contributed to Bixby’s development.