Skip to main content
Kimu
2025 | October 08

A new chatbot in Basque that can be installed on in-house servers: Kimu

  • Orai has created a lightweight model that works well in Basque and can easily be adapted to the needs of companies and institutions.
  • It can be used for a wide range of tasks: to answer queries about documents, create content, do summaries and translations, correct texts, etc.
  • It facilitates privacy and is more sustainable than large models. Kimu also performs well in Spanish and English.
  • Access to the website to try out this lightweight chatbot can be obtained by invitation.

Orai has developed Kimu, a chatbot for Basque designed to help companies and organizations in their daily work. The model is lightweight, so it can be installed on servers and computers in companies and organizations, thus enabling data privacy and confidentiality to be preserved. The model is capable of understanding and executing various tasks requested in Basque in natural language by the user. “It can be used in a variety of tasks at work, such as translating and summarizing, answering queries about documents, extracting information, correcting and adapting texts,” explained Xabier Saralegi, Head of NLP Technologies at Orai. However, depending on the needs of companies and organizations, the model can also be adapted to specific use-cases to further improve the quality of the results. What is more, although it has been created for Basque, Kimu also performs well in several other languages, for example, Spanish, English, Italian, etc.

One major advantage of the Kimu model is its small size: with 9 billion parameters, it falls within the category of Small Language Models (SLMs) among LLMs. Open-source Small Language Models (open-source SLMs) perform competitively in large languages (Spanish, English, etc.), but not for limited-resource languages, such as Basque. And low-resource languages lack sufficient resources to create models like this from scratch. Indeed, Orai researchers are using cross-lingual transfer learning techniques, among other things, incorporating capabilities in Basque into these models.

Despite the fact that SLMs are smaller when compared with ChatGPT, DeepSeek, Claude and other LLMs, they offer competitive quality, especially when adapted to special needs, and, on the whole, offer several noteworthy advantages: they are more lightweight and faster, and require fewer resources and less energy. “The cost of hardware to serve the model is significantly reduced. So more economical equipment is needed to install these models. Larger free models require much more expensive equipment, and the improvement they offer in terms of the quality of the results is not that high in many tasks. So, given the balance between result quality and consumption, this Basque model is unrivalled,” explained Saralegi. In addition, lightweigth models of this type can be more easily customized to adapt to specific domains, and are more environmentally sustainable.

Orai has created a Beta website (https://kimu.orai.eus) to demonstrate the capacity and potential of the Kimu model. That is where users will have the chance to test the Kimu model; for the time being, access will be available by invitation.

Teaching Basque to a foundational model and combining it with a model capable of performing tasks 

Huge amounts of data or text are needed to produce large language models. However, for less-resourced languages it is very difficult to make use of them. By taking open weight models that in fact perform well in other languages as the baseline, Orai researchers are exploring various strategies to find viable solutions for the Basque language.

Kimu is the example of that: “We have combined a foundational model that we have adapted to Basque with a post-trained model that is not adapted to Basque,” said Ander Corral, an Orai researcher. Foundational models are the type used as the foundational for generative artificial intelligence, while instructed models are capable of understanding and executing tasks. That way, they have managed to create a model able to follow instructions in Basque.

The method used only needs one collection of texts for language adaptation purposes. A corpus is used to teach Basque to the foundational model that does not know Basque well. “During the experimentation phase we used the Zelai Haundi corpus created by Orai, a corpus of 500 million words which only has free license content,” explained the Orai researchers. The experiments were conducted using the Google Gemma and Meta Llama models. Models of this type are designed for large languages and do not perform well with limited-resource languages.

Orai researchers conducted experiments not only with Basque, but also with Swahili and Welsh “to check that our method could be applied to other low-resource languages. In all languages, a significant improvement in the results of the baseline systems was achieved by using our method”, they added.

The LLM in the hands of technology companies and research centres

All the models created for Basque and other languages have been made available on the HuggingFace platform for the purpose of sharing and using open AI models and resources. This enables technology companies and research centres to use them in development and research projects that involve understanding and generating Basque (RAGs, conversational agents, etc.). The research article has been accepted at the EMNLP (Empirical Methods in Natural Language Processing) international conference, one of the most prestigious conferences in the field of NLP, and the work is due to be presented in November.

Smart assistants