"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis•December 6, 2025

Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski

In this episode, Marek Kozlowski discusses Poland's sovereign AI strategy with Project PLUM, focusing on creating small, locally-adapted language models that preserve Polish cultural nuances, offer cost advantages, and provide on-premise solutions for businesses and government sectors.

AI & Machine Learning

National Information Processing Institute of Poland

PKO Bank Polski

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this episode, Nathan explores Polish AI sovereignty with Marek Kozlowski, head of the AI Lab at Poland's National Information Processing Institute. Marek discusses Project PLLuM (Polish Large Language Models), which aims to create smaller, localized AI models that compete with frontier models by focusing on Polish language, culture, and values. (03:17) The conversation delves into how 90% of training data for major models is English and Chinese, leaving Polish with only about 1% representation. (08:24) Marek explains their strategy of language adaptation on base models like LLaMA and Mistral, combined with organic human-curated instruction data, to achieve competitive performance for Polish use cases while maintaining transparency, sovereignty, and cost advantages.

• Main themes: AI sovereignty through localized models, regulatory challenges in the EU, the technical approach of language adaptation, and the strategic importance of maintaining national AI capabilities

Speakers

Marek Kozlowski

Marek Kozlowski serves as head of the AI Lab at Poland's National Information Processing Institute, where he leads Project PLLuM (Polish Large Language Models). He spearheads Poland's national AI sovereignty initiative, focusing on developing transparent, locally-controlled AI models adapted for Polish language and culture. His work involves coordinating a consortium of six to eight institutes and universities funded by Poland's Ministry of Digital Affairs to create competitive alternatives to global AI models.

Key Takeaways

Focus on Localized Models Over General Purpose Giants

Rather than competing directly with frontier models on general capabilities, Poland's strategy focuses on creating smaller, domain-specific models that excel in Polish language and cultural contexts. (05:11) Marek argues that for specific business and government use cases, a well-trained 8B parameter model can match the performance of much larger cloud-based models when fine-tuned for particular tasks. This approach offers better cost control, data privacy, and regulatory compliance while serving actual user needs more effectively than general-purpose models.

Language Adaptation Preserves Capabilities While Adding Local Knowledge

The PLLuM project uses "language adaptation" - continuing pre-training of base models like LLaMA on Polish text corpora. (59:40) This technique maintains the model's existing capabilities in other languages and domains while significantly improving Polish language understanding and cultural knowledge. Though some forgetting occurs, the models retain competency in English and other areas while gaining native-level fluency in Polish idioms, cultural references, and domain-specific knowledge.

Organic Human-Curated Data Beats Synthetic Instructions

Unlike many AI projects that rely heavily on synthetic data generation, PLLuM emphasizes "organic" instruction and preference data created and validated by humans. (19:17) Marek's team employs hundreds of annotators to create manual instructions and preferences, believing this approach produces higher linguistic quality than synthetic alternatives. This human-centric data curation is resource-intensive but crucial for achieving native-level language generation and avoiding the degradation that comes from low-quality synthetic training data.

AI Sovereignty Requires Transparency and Local Control

Beyond just having national AI models, true sovereignty demands transparency in training processes, data sources, and model architectures. (16:02) PLLuM publishes detailed "cookbook" documentation of their training process and open-sources samples of their datasets, going beyond just releasing model weights. This transparency enables other countries to replicate their approach and ensures that Poland maintains genuine control over its AI infrastructure rather than depending on black-box systems from global providers.

Business Use Cases Don't Require Frontier-Scale Models

Most business and government applications need models that excel at 10-20 specific tasks rather than general-purpose capabilities across thousands of tasks. (44:15) For on-premise deployments where organizations face GPU and energy constraints, smaller fine-tuned models often outperform few-shot approaches with large cloud models. This insight suggests that the future of enterprise AI may favor specialized local models over massive general-purpose systems, especially in regulated industries and government applications.

Statistics & Facts

90% of training data for major language models consists of English and Chinese text, with Polish representing only about 1% of typical training corpora. (08:24) This massive imbalance means that Polish speakers receive significantly degraded performance compared to English and Chinese users.
Poland has compiled several hundred billion tokens of Polish text data after deduplication and filtering, but falls short of the trillion tokens typically needed for training large models from scratch. (39:54) This data constraint forces them to use language adaptation techniques rather than full pre-training.
For effective domain adaptation, organizations need at least 10 billion tokens of clean, domain-specific data, which typically requires 30-40 billion tokens before deduplication and filtering. (65:27) This threshold means only very large companies have sufficient data for meaningful domain adaptation projects.