March 12, 2024

Potential Polish version of GPT: successful AI collaboration between Gdansk University of Technology and OPI


Gdańsk Tech and OPI developed a Polish generative model called Qra, trained on a data corpus containing only Polish text. Initially, the corpus used about 2TB of raw text data in total, but as a result of cleaning and deduplication processes, it was reduced by almost two times to maintain the best quality and unique content. This is the first generative model trained on such a large Polish text resource using significant computing power. In comparison, Llama, Mistral and GPT models are trained mainly on English data, with only a small part of their training corpus consisting of Polish data.

The most complex version of the model trained on STOS over a month

At the IT Competence Center STOS of the Gdansk University of Technology, one of the most modern IT centers in the region with the supercomputer Kraken, a computing environment dedicated to building artificial intelligence models was created. A cluster of 21 NVidia A100 80GB graphics cards was used in the process. The team spent about six months preparing the environment, creating tools and models, training them (based on content from fields such as law, technology, social sciences, biomedicine, religion, and sports), and testing them. Thanks to the extensive infrastructure available at STOS, the actual training process of the most complex models was shortened from several years to about one month.

Qra has a good command of Polish

As a result of the collaboration between Gdańsk University of Technology and OPI, the research team created three models of different complexity (Qra 1B, Qra 7B, and Qra 13B). Models Qra 7B and Qra 13B achieve significantly better incomprehensibility results than the original models Llama-2-7b-hf (Meta) and Mistral-7B-v0.1 (Mistral-AI), in terms of their ability to model the Polish language in terms of comprehension, lexical layers, or the grammar itself.

The perplexity measurement tests were performed on the first set of 10,000 sentences from the PolEval-2018 test set, as well as a further set of 5,000 long and demanding documents created in 2024.

Solutions that require better language understanding

The Qra model is the basis for IT solutions that handle issues and processes that require a deeper understanding of the Polish language.

At the moment, Qra is a basic language model that can generate grammatically and stylistically correct answers in Polish. The content produced is of very high quality, as can be seen especially by the confusion measure. The team is currently working on tuning the model and plans to validate its capabilities in text classification, summarization and question answering.

The developed model has been made publicly available in a dedicated OPI-Gdańsk Tech repository on the huggingface platform, where anyone can download it and adapt it to their own domain, problem or task, including providing answers.


Source link