Research

Research lines

The emergence of neural language models has led to a paradigm change in natural language processing. Massive collections of text are used to train neural language models which learn the generic knowledge of languages. This generic knowledge can be successfully reused for neural language models enabling them to learn specific exercises in language processing. So in order to learn specific exercises, they do not need a lot of training data and they achieve very good results. In addition, multilingual neural language models can train by using examples of a single language, and the model learnt in this way will be capable of processing more languages.

The main research lines we have in progress are:

Assessment of neural language models.
Transfer learning so that neural language models can learn specific tasks.
Transfer learning between languages.
Neural language models for poorly-resourced languages.

Solutions:

Search engines and knowledge management

Data monitoring and analytics

Smart assistants

In the digital age, being able to extract structured information from sources in which human language is codified is hugely important. Extracting that knowledge from today's massive volumes of information (big data) is opening up new possibilities for us to conduct macro analyses, to offer innovative ways of consuming information or to facilitate decision-making processes. Our research is focused on NLU (Natural Language Understanding) tasks such as text classification, entity extraction, opinion mining or question answering. In recent years, neural approaches have been very successfully applied in NLU tasks, and they are in fact the techniques we use on a day-to-day basis.

The main research lines we have in progress are:

Multilingual search systems.
Question-answer systems.
Emotion analysis.
Semantic metadata extraction.
Big data surveillance systems.

Solutions:

Search engines and knowledge management

Data monitoring and analytics

Smart assistants

In this multilingual, global context, machine translation systems are going from strength to strength. The growth in neural networks over recent years has led to an unprecedented qualitative leap in translation quality, and so opportunities have opened up to develop more intelligent systems that are capable of detecting shades of meaning.

That is why our research aims to develop state-of-the-art systems in the field of machine translation. To do this, we are using the latest neural paradigms to create monolingual as well as multilingual systems. These neural paradigms need large quantities of data while training. As a result, data extraction, filtering and cleaning are essential when exploiting quality data. We are aware that it is important to personalise systems so that they can be adapted to the users' needs; that is why one of our priorities is domain specialisation and specialised terminology. Most of today’s systems translate each sentence in isolation without taking the general context in which the sentence appears into consideration. We are also involved in whole document translations.

The main research lines we have in progress are:

Gender bias analysis
Whole document translation
Integration of specialised terminology
Data filtering and cleaning
Domain specialisation
Multilingual translation

Solutions:

Machine translation

Dialogue assistants can be of two types: those that aim to offer as natural a dialogue as possible and those that aim to fulfil commands and operations. The former types tend to be used for leisure purposes. The latter, by contrast, are used to help people perform specific tasks; for example, to complete administrative formalities, to make purchases or to respond to questions. Companies and the administration are offering more and more of the latter type of dialogue assistants to provide their customers or the general public with a better service.

Dialogue systems concentrate on certain aspects: user intention, dialogue context, understanding or production of language, and today, neural architectures are being successfully used to implement these components.

The main research lines we have in progress are:

User intention detection.
Strategies based on limited training data.
Transfer learning between languages.

Solutions:

Smart assistants

Speech processing is about making computers capable of handling speech, and one of these is ASR or Automatic Speech Recognition.

In speech recognition we explore automatic transcription and subtitling systems that go beyond systems that achieve good results under good conditions. So, in two languages, we work on methods to develop ASR systems designed to transcribe audio material in local forms of speech or in non-formal registers, and also on systems that work in noisy environments (for example, for the purpose of interacting with industry 4.0 machines via speech).

We are also working on personalisation so that when transcribers are given local words, place names and proper names, they can correctly transcribe them. We are also working on live transcription and subtitling, which are tremendously useful in all kinds of sessions, video calls or courses. Another aim is to enable people with mobility disabilities to take advantage of ASR as a dictation tool; in particular, this is geared towards education and children. Finally, we are also working on speaker identification so that the person who has made each utterance can be automatically tagged by means of subtitles or transcription.

The main research lines we have in progress are:

Personalised speech recognition
Speech recognition in local forms of speech
Speech recognition in non-formal registers
Speech recognition in noisy, industrial environments
Infant speech recognition
Systems geared towards dictation (for accessibility)
Live transcription and subtitling
Identification of speakers

Solutions:

Automatic transcription and subtitling

Smart assistants

Speech processing is about making computers capable of processing speech. One of the ways of doing this involves language synthesis or creation (TTS or Text-to-Speech).

We have various lines of research on speech synthesis up and running. One of our aims is to achieve voice cloning with less and less material by using multi-speaker network systems. One of the main challenges right now is to achieve high quality speech synthesis of a speaker using a single sentence uttered by that speaker. We are also exploring cross-lingual techniques and through them we can change the language for any voice. With few sentences in a language of one voice we aim to synthesise that voice so that it will speak a different language. And then to tackle the gender bias of virtual assistants, we have created a voice prototype and ambiguous gender. One of our challenges is to improve the quality of that voice. Finally, we are aiming to incorporate emotion into synthesis systems. Most current synthesis systems use a neutral style and that places limitations on the use of dubbing. We want to prevent style loss in dubbing by transmitting emotions and expressiveness.

The main research lines we have in progress are:

Personalised speech synthesis
Neutral speech synthesis
Speech synthesis with emotion
Voice imitation using small samples

Solutions:

Voice cloning

Smart assistants

The process to produce text has been changing significantly in recent years, and computer tools to assist in text writing are being increasingly used. Automatic checkers are among these tools. These checkers detect errors in text and present possible corrections to the user. Checkers can be used on various levels: spelling, lexis, grammar or style. They are very effective tools in the text production process, in particular, to ensure a high quality of text.

The main research lines we have in progress are:

Neural grammar checking based on synthetic data.

Solutions:

Search engines and knowledge management

Machine translation

Search engines and knowledge management

Discover the Orai

solutions

Ongoing projects

ADITEK

Mycroft.eus

DomEus

ALAI 4.0

UdalBOT

Multihub

Bikohitz

Hiekadi - Heraldabide

Tanper

Tando

Neurolagun

Cogile

DeepText

Resources

resources

Doctoral thesis

Baliabide urriko hizkuntzetarako hizkuntza-eredu neuronalak

Gorka Urbizu Garmendia

Doktorego-tesia. Informatika Fakultatea, EHU. Donostia. 2025

Predicate Matrix: an Interoperable Lexical Knowledge Base for Predicates

Lopez de Lacalle, M.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2023.

Application of Singing Synthesis Techniques to Bertsolaritza

Sarasola, X.

Doktore-tesia. Bilboko Ingeniaritza Eskola, UPV/EHU. Bilbo. 2020

Multilingual Sentiment Analysis in Social Media

San Vicente, I.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2019. [aurkezpena]

Bertsobot: gizaki-robot arteko komunikazio eta elkarrekintzarako portaerak

Pagoaga, A.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2017.

Integrazioa hizkuntzaren prozesamenduan. Anotazio-eskemak eta elkarreragingarritasuna. Testuen prozesatze masiboa, datu handien teknikak erabiliz

Beloki, Z.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2017.

CLIR Teknikak baliabide urriko hizkuntzetarako

Saralegi, X.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2017.

Idiomatikotasunaren karakterizazio automatikoa: izena+aditza konbinazioak

Gurrutxaga, A.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2014.

The Web as a Corpus of Basque

Leturia, I.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2014.

Euskararen ezagutza-base lexikala: Euskal WordNet

Pociello, E.

Doktore-tesia. Euskal Filologia Saila, UPV/EHU. Leioa. 2008.

Neural language models

Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Urbizu, G., Corral, A., Saralegi, X. eta San Vicente, I.

In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 519–530 November 8-9, 2025

DIPLomA: Efficient Adaptation of Instructed LLMs to Low-Resource Languages via Post-Training Delta Merging

Sarasua, I., Corral, A. eta Saralegi, X.

in Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24898–24912 November 4-9, 2025

Personality Assessment on Spanish and Basque Texts using In-Context Learning Techniques

Saizar, A., Lopez de Lacalle, M. eta Saralegi, X.

Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025

Assessing Small Language Models for Translating Spanish Instructions into Behavior Trees

Saizar, A., Corral, A., Lopez de Lacalle, M., Urbizu, G. eta Saralegi, X.

Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Ander Corral, Ixak Sarasua Antero, Xabier Saralegi

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Generating Multiple-Choice Questions in Spanish and Basque using LLMs: A Comparative Manual Evaluation

Maddalen López de Lacalle, Xabier Saralegi, Aitzol Saizar

Procesamiento del Lenguaje Natural, Revista nº 74, marzo de 2025

BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language

Muitze Zulaika and Xabier Saralegi

In Proceedings of the 31st International Conference on Computational Linguistics, pages 4753–4767 January 19–24, 2025.

How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque

Urbizu, G., Zulaika, M., Saralegi, X., and Corral, A.

In LREC-COLING 2024, pages 8334–8348 20-25 May, 2024

Scaling Laws for BERT in Low-Resource Settings

Urbizu, G., San Vicente, I., Saralegi, X., Agerri, R. and Soroa, A.

In Findings of the Association for Computational Linguistics: ACL 2023, pages 7771–7789 July 9-14, 2023

Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

Urbizu, G., San Vicente, I., Saralegi,X., and Corral, A.

In Findings of the Association for Computational Linguistics: ACL2023, pages 3826–3836 July 9-14, 2023

BasqueGLUE: A Natural Language Understanding Benchmark for Basque

G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa.

In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June, 2022. Marseille, France

Grammatical Error Correction for Basque through a seq2seq neural architecture and synthetic examples

Beloki, Z., Saralegi, X., Ceberio, K., & Corral, A.

In Procesamiento del Lenguaje Natural, 65, 13-20. 2020.

Give your Text Representation Models some Love: the Case for Basque

Agerri, R., San Vicente, I., Campos, J.A., Barrena, A., Saralegi, X., Soroa, A. and Agirre, E.

In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). pp. 4781‑4788. 2020.

Speech Technologies

Genero aldetik anbiguoa den hizketaren sintesia euskaraz hizlari-bektoreen manipulazioaren bidez

Xabier Sarasola, Ander Corral, Igor Leturia, Iñigo Morcillo

Ekaia: Euskal Herriko Unibertsitateko zientzi eta teknologi aldizkaria, ISSN 0214-9001, Nº. Extra 47, 2025 (Ejemplar dedicado a: Adimen artifiziala), págs. 113-124

Hizlari-bektore manipulazioaren bidezko genero-anbiguoko hizketaren sintesia euskaraz

Sarasola, X., Corral, A., Leturia, I. eta Morcillo, I.

EKAIA EHUko Zientzia eta Teknologia aldizkaria 2024

Automatic Speech Recognition for Gascon and Languedocian Variants of Occitan

Iñigo Morcillo, Igor Leturia, Ander Corral, Xabier Sarasola, Michaël Barret, Aure Séguier, Benaset Dazéas

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1969–1978, Torino, Italia. ELRA and ICCL.

Basque-speaking Smart Speaker based on Mycroft AI

Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia Portela, Arkaitz Anza, and Jaione Martinez

In: Rehm, G. (eds) European Language Grid. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-17258-8_15. 2023.

Mycroft.eus: The open source Basque-speaking smart speaker

Igor Leturia (2021)

META-Forum 2021. Berlin (virtual), November 15-17

ELG Pilot Project: Basque-speaking smart speaker based on Mycroft AI

Igor Leturia (2020)

META-Forum 2020. Berlin (virtual), December 1-3

Neural Text-to-Speech Synthesis for an Under-Resourced Language in a Diglossic Environment: the Case of Gascon Occitan

Ander Corral, Igor Leturia, Aure Séguier, Michäel Barret, Benaset Dazéas, Philippe Boula de Mareüil, and Nicolas Quint. 2020.

In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 53–60, Marseille, France. European Language Resources association.

Massively multilingual accessible audioguides via cell phones.

Cortes, I., Leturia, I., Alegria, I., Astigarraga, A., Sarasola, K., & Garaio, M. (2018).

In Proceedings of The 21st Annual Conference of the European Association for Machine Translation (EAMT2018), Alacant, Spain.

Hirikia: Language Technology projects in the frame of the European Capital of Culture 2016

Rodrigo Agerri, Aitzol Astigarraga, Iñaki Alegria, Itziar Cortes, Arantza Diaz de Ilarraza, Igor Leturia, Kepa Sarasola (2017)

META-Forum 2017. Brussels, November 13/14

Strategic projects

MULTILINGTOOL, Development of an Automatic Multilingual Subtitling and Dubbing System

Ander Corral, Xabier Sarasola, Iker Manterola, Josu Murua, Itziar Cortes, Igor Leturia, Xabier Saralegi

In EAMT 2024 - The 25th Annual Conference of The European Association for Machine Translation

LINGUATEC: Desarrollo de recursos lingüı́sticos para avanzar en la digitalización de las lenguas de los Pirineos

Aldabe, I, Aztiria, J., Beltrán, F., Bras, M., Ceberio K., Cortes, I., Coyos J.D., Dazeas B., Esher, L., Labaka, G., Leturia, I., Sarasola, K., Séguier A. and Sibille J.

In Procesamiento del Lenguaje Natural (SEPLN), ISSN 1989-7553. 2019.

Fostering digital representation of eu regional and minority languages: the digital language diversity project

Soria, C., Russo, I., Quochi, V., Hicks, D., Gurrutxaga, A., Sarhimaa, A. and Tuomisto, M.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016.

Euskara Aro Digitalean

Hernáez I. (UPV/EHU), Navas E. (UPV/EHU), Odriozola I. (UPV/EHU), Sarasola K. (UPV/EHU), Diaz de Ilarraza A. (UPV/EHU), Leturia I. (Elhuyar Fundazioa), Diaz de Lezana A. (Eusko Jaurlaritza), Oihartzabal B. (UMR 5478 IKER), Salaberria J. (UMR 5478 IKER)

Liburu zurien bilduma. META-NET. 2012.

Web Communication Protocols for Coordinating the Modules of AnHitz, a Basque-Speaking Virtual 3D Expert on Science and Technology

Leturia I., del Pozo A., Oyarzun D., Iturraspe U., Arregi X., Sarasola K., Diaz de Ilarraza A., Navas E., Odriozola I., Sainz O.

In Web Services and Processing Pipelines in HLT workshop (WSPP2010). Valetta, Malta, 2010

Development and Evaluation of AnHitz, a Prototype of a Basque-Speaking Virtual 3D Expert on Science and Technology

Leturia I., del Pozo A., Arrieta K., Iturraspe U., Sarasola K., Diaz de Ilarraza A., Navas E., Odriozola I.

Computational Linguistics-Applications workshop. Mrągowo (Poland). 2009.

AnHitz, development and integration of language, speech and visual technologies for Basque

Arrieta K., Leturia I., Iturraspe U., Diaz de Ilarraza A., Sarasola K., Hernáez, I., Navas, E.

In Universal Communication, 2008. ISUC'08. Second International Symposium on (pp. 338-343). IEEE. Osaka (Japan). 2008.

Machine Translation

Morphology Aware Source Term Masking for Terminology-Constrained NMT

Ander Corral and Xabier Saralegi

In Findings of the Association for Computational Linguistics: EACL 2024

Gender Bias Mitigation for NMT Involving Genderless Languages

Ander Corral and Xabier Saralegi

In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 165–176 Abu Dhabi, December 7–8, 2022

Elhuyar submission to the Biomedical Translation Task 2020 on terminology and abstracts translation

Ander Corral and Xabier Saralegi. 2020.

In Proceedings of the Fifth Conference on Machine Translation, pages 813–819, Online. Association for Computational Linguistics.

QUALES: Estimación Automática de Calidad de Traducción Mediante Aprendizaje Automático Supervisado y No-Supervisado

Etchegoyhen, T., Garcia, E.M., Azpeitia, A., Alegria, I., Labaka, G., Otegi, A., Sarasola, K., Cortes, I., Jauregi, A., Ellakuria, I. and Calonge, E.

Procesamiento del Lenguaje Natural. Vol 61. pp.143-146. 2018.

TADEEP: Traducción automática en profundidad

Alegria, I., Aranberri, N., Artetxe, M., Etxeberria, I., Gurrutxaga, A.,Iñurrieta, U., Labaka, G., Lersundi, M., Leturia, M., Màrquez, L., Mayor, A., Oronoz, M., Sarasola, K. and Urizar, R.

JORNADAS DE SEGUIMIENTO 2018. Subdivisión de Programas Temáticos Científico Técnicos. Área de Ciencias y TIC. Ministerio de Energia, Industria y Competitividad. Madril. 2018-06-20.

Neural Machine Translation of Basque

Etchegoyhen, T., Garcia, E.M., Azpeitia, A., Labaka, G., Alegria, I., Cortes, I., Jauregi, A., Santos, I.E., Martin, M. and Calonge, E.

In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (EAMT 2018). pp. 139. Alicante. 2018.

Massively multilingual accessible audioguides via cell phones

Cortes, I., Leturia, I., Alegria, I., Astigarraga, A., Sarasola, K. and Garaio, M.

In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (EAMT 2018). ISBN: 978-84-09-01901-4. 2018.

Improving access to educational courses via automatic machine translation - new developments in post-editing

Pietrzak J., Jáuregui, A., Van de Walle, j. and Eriksson, A.

In Proceedings of 7th International Technology, Education and Development Conference (INTED2013). 4-7 March, Valencia, Spain. 2013.

Morphological information management for the creation of a new pair of languages with different dialects in an open-source machine translation system

Aranbarri, G. and Cortes, I.

Procesamiento del Lenguaje Natural (SEPLN). 47. pp. 321-322. 2011.

OpenMT: Open Source Machine Translation Using Hybrid Methods

Alegria I., K. Sarasola, N. Castell, L. Màrquez, N. Areta, X. Saralegi

Jornada de Seguimiento de Proyectos. Programa Nacional de Tecnologías Informáticas. 2009

Opentrad: bringing to the market opensource based Machine Translators

Aizpurua I., Ramirez G., Pichel J., Waliño J.

Langtech 2008. Rome. 2008.

Mixing Approaches to MT for Basque: Selecting the best output from RBMT, EBMT and SMT

Alegria I., Díaz de Ilarraza A., Igartua J., Labaka G., Laskurain B., Lersundi M., Mayor A., Sarasola K., Casillas, A. and Saralegi, X.

In Proceedings of MATMT2008 workshop: Mixing Approaches to Machine Translation. 2008.

OpenTrad: Traducción automática de código abierto para las lenguas del Estado español

Alegria I., Arantzabal I., Forcada M.L., Gomez X., Padró L., Pichel, J.R. and Waliño, J.

Procesamiento del Lenguaje Natural, 27, pp.357-360. 2006.

Conversational assistants

Strategies for bilingual intent classification for small datasets scenarios

López de Lacalle, M., Saralegi, X., Saizar, A., Urbizu, G. and Corral, A.

Procesamiento del Lenguaje Natural, Revista nº 71, septiembre de 2023, pp. 137-147

Reducing annotation effort for Cross-lingual Transfer Learning: The case of NLU for Basque

López de Lacalle, M., Saralegi, X., López, I.

In Proceedings of the Workshop on Mixed-Initiative ConveRsatiOnal Systems 2021 (MICROS) @ ECIR2021, 2021. Lucca, Tuscany, 2021.

Building a Task-oriented Dialog System for languages with no training data: the Case for Basque

López de Lacalle, M., Saralegi, X., San Vicente, I.

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2796–2802 ,Marseille, 11–16 May 2020.

Information retrieval and extraction (IR-IE)

Measuring Presence of Women and Men as Information Sources in News

Zulaika, M., Saralegi, X., San Vicente, I.

In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Information retrieval and question answering: A case study on COVID-19 scientific literature.

Arantxa Otegi, Iñaki San Vicente, Xabier Saralegi, Anselmo Peñas, Borja Lozano, Eneko Agirre. 2022.

Knowledge-Based Systems, Volume 240, 2022, 108072, ISSN 0950-7051

GEPSA, a tool for monitoring social challenges in digital press

San Vicente, I., Saralegi, X., & Zubia, N

In Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion (pp. 46–50). EACL2021, Kiev, Ukraine. Association for Computational Linguistics. 2021.

Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels.

Saralegi, X., San Vicente, I.

In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol 12657. Springer, Cham. 2021.

Evaluating translation quality and clir performance of query sessions

Saralegi, X., Agirre, E. and Alegria, I.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016.

TweetNorm: a benchmark for lexical normalization of Spanish tweets

Alegria, I., Aranberri, N., Comas, P.R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J. and Zubiaga, A.

Language Resources and Evaluation, volume 49, issue 4, pp. 883-905. 2015.

TweetLID: a benchmark for tweet language identification

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza A., Fresno, V.

Language Resources and Evaluation (2015). DOI: 10.1007/s10579-015-9317-4

Overview of tweetlid: Tweet language identification at sepln 2014

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza A., Fresno, V.

In Proceedings of the TweetLID Worshop at SEPLN2014. Girona. pp. 1-11. 2014.

TweetNorm_es: an Annotated Corpus for Spanish Microtext Normalization

Alegria, I., Aranberri, N., Comas, P.R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J. and Zubiaga, A.

In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland. 2014.

Elhuyar at Tweet-Norm 2013

Saralegi, X. and San Vicente, I.

In proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Tweet Normalization Workshop at SEPLN (Tweet-Norm 2013). Madrid. ISBN: 978-84-695-8349-4. 2013.

Introducción a la Tarea Compartida Tweet-Norm 2013: Normalización Léxica de Tuits en Español

Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J. and Zubiaga, A.

In proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Tweet Normalization Workshop at SEPLN (Tweet-Norm 2013). Madrid. ISBN: 978-84-695-8349-4

Extracción automática de fichas de recursos turísticos de la web

Manterola, I., Saralegi X. eta Bilbao S.

In Turitec 2012: IX Congreso Nacional Turismo y Tecnologías de la Información y las Comunicaciones (pp. 31-42). Universidad de Málaga (UMA).

Dictionary and Monolingual Corpus-based Query Translation for Basque-English CLIR

Saralegi, X. eta Lopez de Lacalle, M.

In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). Malta. 2010.

Estimating Translation Probabilities from the Web for Structured Queries on CLIR

Saralegi, X. and Lopez de Lacalle, M.

In European Conference on Information Retrieval (ECIR 2010). Milton Keynes. pp. 586-589. 2010.

Elhuyar-IXA: semantic relatedness and crosslingual passage retrieval

Agirre E., Ansa O., Arregi X., Lopez de Lacalle M., Otegi A., Saralegi X. and Zaragoza H.

In Proceedings of Workshop of the Cross-Language Evaluation Forum for European Languages (CLEF 2009). pp. 273-280. 2009.

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co-occurrence-Based Selection

Saralegi, X. and Lopez de Lacalle, M.

In Proceedings of 7th International Workshop on Text-Based Information Retrieval (TIR 2009), 20th International Workshop on Database and Expert Systems Application, 2009. DEXA'09. (pp. 398-404). IEEE. Linz. 2009.

Analysis and performance of morphological query expansion and language-filtering words on Basque web searching

Leturia I., Gurrutxaga A., Areta N., Pociello E.

In Proceedings of the 6th International Conference on Language Resources and Evaluations (LREC’08). Marrakech, Morocco. 2008.

Similitud entre documentos multilingües de carácter técnico en un entorno Web

Saralegi, X. and Alegria, I.

Procesamiento del Lenguaje Natural, nº39 (SEPLN 2007), pp. 71-78. 2007.

EusBila, a search service designed for the agglutinative nature of Basque

Leturia I., Gurrutxaga A., Areta N., Alegria I., Ezeiza A.

In Proceedings of Improving non-English web searching (SIGIR 2007 - iNEWS’07) workshop. pp. 47-54. Amsterdam. 2007.

Lexicon and terminology extraction

Baliabide lexikoen sarea: Baldintza filologiko eta teknikoak eta aplikazioak

Lindemann, D., and San Vicente, I.

In Hitzak sarean: Pello Salabururi esker onez. Laka Itziar (Arg.) Bilbo: EHU Argitalpen Zerbitzua, ISBN: 978-84-1319-111-9. 107 orr. 2019.

Verbal Multiword Expressions in Basque Corpora

Inurrieta, U., Aduriz, I., Estarrona, A., Gonzalez-Dios, I., Gurrutxaga, A., Urizar, R., and Alegria, I.

In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) 86-95. 2018.

Lexikoaren Behatokia: leiho bat XXI. mendeko hedabideetako euskarari

Artola, X., Ezeiza, N., Gurrutxaga, A., Sagarna, A. and Urkia, M.

In SENEZ aldizkaria, 48. zk., 201-209, EIZIE. 2017.

Bilingual Dictionary Drafting: Connecting Basque Word Senses to Multilingual Equivalents

Lindemann, D. and San Vicente, I.

In Proceedings of EURALEX 2016, 898–905. Tbilisi: Tbilisi State University, 2016.

Building Corpus-based Frequency Lemma Lists

Lindemann, D. and San Vicente, I.

Procedia – Social and Behavioral Sciences, vol. 198, pp. 266–277, Jul. 2015.

Idiomatikotasunaren karakterizazio automatikoa: izena+ aditza konbinazioak

Gurrutxaga, A., Alegria, I. and Artola, X.

In EKAIA Euskal Herriko Unibertsitateko Zientzi eta Teknologi Aldizkaria, Ale berezia: Euskal Tesien 10 pasarte, 47-68. 2015.

Euskarazko maiztasun lemategia gaurko teknologien ikuspuntutik

Lindemann, D. and San Vicente, I.

In Ibon Sarasola, Gorazarre. Homenatge, Homenaje, 441–456. Bilbao: UPV-EHU, 2015.

Corpusetan oinarritutako hiztegi elebidun berria sortzen

Lindemann, D., and I. San Vicente.

In Proceedings of IkerGazte: Nazioarteko ikerketa euskaraz. Durango, Basque Country, 2015/05.

Bilingual Dictionary Drafting. The Example of German-Basque, a Medium-density Language Pair

Lindemann, D., Manterola, I., Nazar, R., San Vicente, I. and Saralegi, X.

In Proceedings of the XVI EURALEX Conference. Bolzano/Bozen,p. 563–576. 2014

Combining different features of idiomaticity for the automatic classification of noun+ verb expressions in Basque

Gurrutxaga, A. and Alegria, I.

In Proceedings of the 9th Workshop on Multiword Expressions (MWE9)- NAACL HLT 2013. pp. 116-125. Atlanta, Georgia, USA. 2013.

GARATERM: euskararen erregistro akademikoen garapenaren ikerketarako lan-ingurunea

Zabala I., Lersundi M., Leturia I., Manterola I., Santander G.

Xabier Alberdi eta Pello Salaburu (ed.) Terminologia naturala eta terminologia planifikatua euskararen normalizazioari begira. UPV/EHUko Argitalpen Zerbitzua: 98-114 ISBN: 978-84-9860-809-0. 2013

Building a Basque-Chinese Dictionary by using English as a Pivot

Saralegi, X., Manterola, I. and San Vicente, I.

In Proceedings of the 8th international conference on Language Resources and Evaluation, LREC’12. pp. 1443-1447. 23-25 May, Istanbul, Turkia. 2012.

Measuring the compositionality of NV expressions in Basque by means of distributional similarity techniques

Gurrutxaga, A. and Alegria, I.

In Proceedings of the Eight International Conference on Language Resources and Evaluation LREC’12. pp. 2389-2394. 23-25 May, Istanbul, Turkia. 2012.

Analizing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

Saralegi, X., Manterola, I. and San Vicente, I.

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011). pp. 846-856. Edinburgo. July, 2011.

Automatic extraction of NV expressions in Basque: basic issues on cooccurrence techniques

Gurrutxaga, A. and Alegria, I.

In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011). pp. 2-7. Association for Computational Linguistics. ACL/HLT conference. Portland. 2011.

Mining Term Translations from Domain Restricted Comparable Corpora

Saralegi, X., San Vicente, I. and López de Lacalle, M.

In Procesamiento del lenguaje Natural, 41, pp.273-280. 2008.

Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain

Saralegi, X., San Vicente, I. and Gurrutxaga, A.

In Proceedings of Building and using Comparable Corpora workshop (BUCC) - LREC 2008. pp. 27-32. Marrakech. 2008

Elexbi, a basic tool for bilingual term extraction from Spanish-Basque parallel corpora

Gurrutxaga, A., Saralegi, X., Ugartetxea, S. and Alegria, I.

In Proceedings of the 12th EURALEX International Congress of Lexicography. pp.159-165. Torino. 2006.

Erauzterm: euskarazko terminoak erauzteko tresna erdiautomatikoa

Gurrutxaga, A., Saralegi, X., Ugartetxea, S. and Alegria, I.

Mendebalde Kultur Alkartea, IX. Jardunaldiak: Euskera zientifiko-teknikoa. Bilbao. 2005.

Euskara-gaztelania terminologia Elebidunaren Erauzle Automatikoa

Gurrutxaga, A., Pagoaga, A., Saralegi, X., Ugartetxea, S. and Alegria, I.

EHU/UPV. Bilbao. 2005.

A Xml-Based Term Extraction Tool for Basque

Alegría, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S. and Urizar, R.

In Proceedings of the 4th International Conference on Language Resources and Evaluations (LREC 2004). Lisbon. 2004.

Linguistic and Statistical Approaches to Basque Term Extraction

Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S. and Urizar, R.

GLAT 2004: The production of specialized texts. Barcelona. 2004.

Semantic and ontologies

Predicate Matrix. Automatically extending the semantic interoperability between predicate resources

López de Lacalle M., Laparra E., Aldabe I. and Rigau G.

Language Resources and Evaluation. June 2016, Volume 50, Issue 2, pp 263–289. 2016.

A multilingual predicate matrix

Lopez de Lacalle, M., Laparra, E., Aldabe, I. and Rigau, G.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016

Predicate Matrix: extending SemLink through WordNet mappings

López de Lacalle M., E. Laparra and G. Rigau

In Proceedings of the 9th international conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland. 2014.

First Steps Towards a Predicate Matrix

Lopez de Lacalle, M., Laparra, E. and Rigau, G.

In Proceedings of the 7th International Global Wordnet Conference. GWC 2014. Tartu, Estonia. 2014

Analyzing the Sense Distribution of Concordances Obtained by Web As Corpus Approach

Saralegi, X. and Gamallo, P.

Lecture Notes in Computer Science (LNCS) nº 7816 , Alexander Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing. Springer. 13th International Conference, CICLing 2013. Samos, Greece. 2013.

Methodology and construction of the Basque WordNet

Pociello, E., Agirre, E. and Aldezabal, I.

In Language Resources and Evaluation. Volume 45, Issue 2, pp 121–142. Springer. ISSN 1574-020X. May 2011.

WNTERM: Combining the Basque WordNet and a Terminological Dictionary

Pociello E., Gurrutxaga A., Agirre E., Aldezabal I. and Rigau G.

In Proceedings of the 6th International Conference on Language Resources and Evaluations (LREC 2008). Marrakech 2008.

Opinion Mining - Sentiment Analysis

Polarity lexicon building: to what extent is the manual effort worth?

San Vicente, I., and X. Saralegi.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016

EliXa: A modular and flexible ABSA platform

San Vicente, I., Saralegi, X. and Agerri, R.

In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, 2015/06/04, pp. 748–752. 2015.

Sentimenduen analisirako lexikoen sorkuntza

San Vicente, I. and Saralegi, X.

In Proceedings of IkerGazte: Nazioarteko ikerketa euskaraz. Durango, Basque Country, 2015/05. 2015.

Looking for Features for Supervised Tweet Polarity Classification

San Vicente, I. and Saralegi, X.

In Proceedings of the TASS Workshop at SEPLN2014. Girona. 2014.

Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages

San Vicente, I., Agerri, R. and Rigau, G.

In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014). Gothenburg, Sweden. 2014.

Elhuyar at TASS2013

Saralegi, X. and San Vicente, I.

In Proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Workshop on Sentiment Analysis at SEPLN (TASS2013). Madrid. ISBN: 978-84-695-8349-4. 2013.

Polarity Classification of Tourism Reviews in Spanish

San Vicente, I. and Saralegi, X.

In Proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Madrid. ISBN: 978-84-695-8349-4. 2013.

Cross-Lingual Projections vs. Corpora Extracted Subjectivity Lexicons for Less-Resourced Languages

Saralegi, X., San Vicente, I. and Ugarteburu, I.

Lecture Notes in Computer Science (LNCS) nº 7817 , Alexander Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing. Springer. 13th International Conference, CICLing 2013. Samos, Grezia. 2013.

TASS: Detecting Sentiments in Spanish Tweets

Saralegi X., San Vicente I.

In Proceedings of the First Workshop on Sentiment Analisis at SEPLN (TASS 2012). 7 September, Castelló de la Plana, Spain. 2012.

Corpus resources

Tweetmt: A parallel microblog corpus

San Vicente, I., I. Alegria, C. España-Bonet, P. Gamallo, H. G. Oliveira, E.M. Garcia, A. Toral, A. Zubiaga, and N. Aranberri.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016.

Recursos en euskera para la herramienta NLTK para enseñanza de procesamiento del lenguaje natural

Manterola, I., de Ilarraza, A.D., Gojenola, K. and Sarasola, K.

In Procesamiento del Lenguaje Natural, 45 (SEPLN 2010). pp.305-306. 2010.

Begiratu bat corpus-baliabideei

Areta N., Gurrutxaga A., Leturia I.

BAT Soziolinguistika aldizkaria, 62. alea. 2008.

ZT Corpus: Annotation and tools for Basque corpora

Areta N., Gurrutxaga A., Leturia I., Alegria I., Artola X., Díaz de Ilarraza A., Ezeiza N., Sologaistoa A.

In Proceedings of Corpus Linguistics 2007. Birmingham. 2007.

Structure, Annotation and Tools in the Basque ZT Corpus

Areta, N., Gurrutxaga, A., Leturia, I., Polin, Z., Saiz, R., Alegria, I., Artola, X., de Ilarraza, A.D., Ezeiza, N., Sologaistoa, A. and Soroa, A.

In Proceedings of the fifth International Conference on Language Resources and Evaluations (LREC 2006) (pp. 1406-1411). Genoa. 2006

Zientzia eta teknologiaren corpusa

Alegria I., Artola X., Díaz de Ilarraza A., Ezeiza N., Sologaistoa A., Soroa A., Valverde A., N. Arteta, A. Gurrutxaga, I. Leturia, R. Saiz.

In Euskera zientifiko-teknikoa: Normalizaziotik homologazinora. Mendebalde Kultura Alkartea. Bilbao. 2005.

Zientzia eta teknologiaren corpusa. Diseinua eta metodologia

Areta, N., Gurrutxaga, A., Leturia, I., Polin, Z., Saiz, R., Alegria, I., Artola, X., Diaz de Ilarraza, A., Ezeiza, N., Sologaistoa, A., Soroa, A. and Valverde, A.

EHU/UPV. Bilbao. 2005.