Investigación

Líneas de investigación

El surgimiento de modelos lingüísticos neuronales ha cambiado el paradigma de procesamiento del lenguaje natural. Los modelos lingüísticos neuronales son entrenados con enormes volúmenes de textos, y adquieren un conocimiento genérico de las lenguas. Este conocimiento genérico puede ser reutilizado con éxito para que los modelos lingüísticos neuronales aprendan tareas concretas de procesamiento del lenguaje. Gracias a ello, no necesitan muchos datos de entrenamiento para aprender ejercicios concretos y dan muy buenos resultados. Además, es posible entrenar modelos lingüísticos multilingües con ejemplos de un sólo idioma, y el modelo obtenido de ese modo será capaz de procesar más idiomas.

Principales líneas de investigación actuales:

Evaluación de modelos lingüísticos neuronales.
Transfer learning como herramienta de aprendizaje de tareas concretas para modelos lingüísticos neuronales.
Transfer learning entre idiomas.
Modelos lingüísticos neuronales para idiomas con pocos recursos.

Soluciones:

Motores de búsqueda y gestión del conocimiento

Monitorización y analítica de datos

Asistentes inteligentes

En la era de la digitalización, es de suma importancia tener la capacidad de extraer información estructurada de las fuentes en las que el lenguaje humano está codificado. La posibilidad de extraer este conocimiento de los descomunales volúmenes de información actuales (big data) nos ofrece nuevas oportunidades de realizar macroanálisis, proporcionar novedosas formas de consumo de información o facilitar la toma de decisiones. Investigamos tareas NLU (Natural Language Understanding) capaces de buscar clasificaciones de textos, extracción de entidades y opiniones o respuestas a preguntas. Durante los últimos años, las aproximaciones neuronales están siendo aplicadas con gran éxito en las tareas NLU, que son precisamente las técnicas que utilizamos en nuestro día a día.

Principales líneas de investigación actuales:

Sistemas de búsqueda multilingüe.
Sistemas de preguntas y respuestas.
Análisis de sentimientos.
Extracción de metadatos semánticos.
Sistemas de vigilancia big data.

Soluciones:

Motores de búsqueda y gestión del conocimiento

Monitorización y analítica de datos

Asistentes inteligentes

En este contexto global y multilingüe, los sistemas de traducción automática están cobrando cada vez mayor fuerza. El gran crecimiento que han experimentado las redes neuronales durante los últimos años ha traído consigo un salto cualitativo sin precedentes en la calidad de las traducciones, y, por lo tanto, se han abierto posibilidades de desarrollar sistemas más inteligentes, con capacidad de precisar más matices en los idiomas.

Por lo tanto, en el área de la traducción automática, nuestras investigaciones tienen como objetivo el desarrollo de sistemas punteros. Para ello, utilizamos los últimos paradigmas neuronales en la creación de sistemas monolingües y multilingües. Tales paradigmas necesitan grandes cantidades de datos en la fase de entrenamiento. Por tanto, la extracción, el filtrado y la depuración de datos son fundamentales para explotar datos de calidad. Somos conscientes de que la personalización de los sistemas tiene una gran importancia a la hora de adaptarse a las necesidades del usuario; por ello, la especialización del dominio y la terminología especializada son una de nuestras prioridades. La mayoría de los sistemas actuales traducen cada frase por separado, sin tener en cuenta el contexto general en el que se encuentran. También trabajamos en traducciones a nivel de documento.

Principales líneas de investigación actuales:

Análisis del sesgo de género
Traducción a nivel de documento
Integración de terminología especializada
Filtrado y depuración de datos
Especialización de dominio
Traducción multilingüe

Soluciones:

Traducción automática

Hay dos tipos de sistemas de conversación: los que persiguen el objetivo de ofrecer una conversación lo más natural posible y los que tienen el objetivo llevar a cabo órdenes u operaciones. Los primeros se utilizan en el tiempo de ocio. Los segundos, en cambio, se utilizan para ayudar a las personas en tareas concretas, como, por ejemplo, trámites administrativos, compras o respuestas a preguntas. Las empresas y las administraciones cada vez ofrecen más sistemas de conversación del segundo tipo a clientes y ciudadanos en general, para una atención de mayor calidad.

Los sistemas de conversación tienen en cuenta diferentes aspectos: la intención de los usuarios, el contexto de la conversación y la comprensión o la producción lingüística. Actualmente, las arquitecturas neuronales están siendo utilizadas con éxito en la implementación de tales componentes.

Principales líneas de investigación actuales:

Detección de la intención del usuario.
Estrategias basadas en pocos datos de entrenamiento.
Transfer learning entre idiomas.

Soluciones:

Asistentes inteligentes

El procesamiento del habla se basa en la capacidad del ordenador para el tratamiento del habla, y uno de esos tratamientos es el reconocimiento del habla (ASR o Automatic Speech Recognition).

En el área del reconocimiento del habla, investigamos en sistemas de transcripción y subtitulación automática, más allá de sistemas que ofrecen buenos resultados en buenas condiciones. Así, estamos trabajando en métodos para desarrollar sistemas ASR capaces de transcribir audios en variedades lingüísticas locales y registros informales y en sistemas que funcionen en entornos ruidosos (por ejemplo, para la interacción con máquinas de la industria 4.0 mediante el lenguaje).

También estamos trabajando en la personalización, aportando términos, toponimia y nombres propios locales al transcriptor, para que los pueda transcribir correctamente. También trabajamos en la transcripción y subtitulación directas, muy útiles en sesiones diversas, videollamadas o cursos. Otro de nuestros objetivos es que las personas con discapacidades motoras puedan utilizar el ASR como herramienta de dictado, principalmente en el ámbito de la educación y la población infantil. Por último, también nos dedicamos a la identificación de los oradores, para poder etiquetar automáticamente los fragmentos en los subtítulos y transcripciones.

Principales líneas de investigación actuales:
Personalización del reconocimiento del habla
Reconocimiento del habla en variedades locales
Reconocimiento del habla en registros informales
Reconocimiento del lengua en entornos ruidosos e industriales
Reconocimiento de voces infantiles
Sistemas encaminados al dictado (accesibilidad)
Transcripción y subtítulación directas
Identificación de oradores

Soluciones:

Transcripción y subtitulación automática

Asistentes inteligentes

El procesamiento del habla se basa en la capacidad del ordenador para su tratamiento. Uno de esos tratamientos es la síntesis o creación del habla (TTS o Text-to-Speech)

Tenemos varias líneas de investigación en el área de la síntesis del lenguaje. Uno de nuestros objetivos es obtener la clonación de voces utilizando cada vez menos material, mediante sistemas multispeaker de redes neuronales. Uno de los principales retos actuales es la obtención de una síntesis del habla de gran calidad con una sola frase dicha por una persona. También estamos investigando técnicas de cross-lingual, gracias a las cuales podemos cambiar de idioma cualquier voz. Pretendemos sintetizar una voz en un idioma basándonos en unas pocas frases en otro idioma. Por otro lado, para hacer frente al sesgo de género, hemos creado un prototipo de voz de género ambiguo. Uno de nuestros retos es mejorar la calidad de esta voz. Por último, también tenemos como objetivo incorporar la emoción en los sistemas de síntesis. La mayoría de los sistemas de síntesis actuales trabajan el estilo neutro, lo que los limita en su uso para el doblaje. Pretendemos evitar la pérdida de estilo al transmitir emociones y expresividad.

Principales líneas de investigación actuales:

Síntesis personalizada del habla
Síntesis neutra del habla
Síntesis del habla con emociones
Imitación de la voz con muestras pequeñas

Soluciones:

Voces personalizadas

Asistentes inteligentes

Estos últimos años, el proceso de producción de textos esta cambiando notablemente, y está cada vez más extendido el uso de herramientas informáticas para la redacción de textos. Entre estas herramientas se encuentran los correctores automáticos, que detectan errores en los textos y proponen correcciones al usuario. Las correcciones pueden ser ortográficas, de léxico, gramáticas o de estilo. Se trata de herramientas de gran eficacia en el proceso de creación de textos, principalmente a la hora de garantizar textos de máxima calidad.

Principales líneas de investigación actuales:

Corrección gramatical neuronal basada en datos sintéticos.

Soluciones:

Motores de búsqueda y gestión del conocimiento

Traducción automática

Motores de búsqueda y gestión del conocimiento

Descubre las soluciones

de Orai

Proyectos en marcha

TeLMar

Sarebide

LINGUATEC-IA

ADITEK

Mycroft.eus

DomEus

ALAI 4.0

UdalBOT

Multihub

Bikohitz

Hiekadi - Heraldabide

Tanper

Tando

Neurolagun

Cogile

DeepText

Recursos

Tesis doctorales

Baliabide urriko hizkuntzetarako hizkuntza-eredu neuronalak

Gorka Urbizu Garmendia

Doktorego-tesia. Informatika Fakultatea, EHU. Donostia. 2025

Predicate Matrix: an Interoperable Lexical Knowledge Base for Predicates

Lopez de Lacalle, M.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2023.

Application of Singing Synthesis Techniques to Bertsolaritza

Sarasola, X.

Doktore-tesia. Bilboko Ingeniaritza Eskola, UPV/EHU. Bilbo. 2020

Multilingual Sentiment Analysis in Social Media

San Vicente, I.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2019. [aurkezpena]

Bertsobot: gizaki-robot arteko komunikazio eta elkarrekintzarako portaerak

Pagoaga, A.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2017.

Integrazioa hizkuntzaren prozesamenduan. Anotazio-eskemak eta elkarreragingarritasuna. Testuen prozesatze masiboa, datu handien teknikak erabiliz

Beloki, Z.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2017.

CLIR Teknikak baliabide urriko hizkuntzetarako

Saralegi, X.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2017.

Idiomatikotasunaren karakterizazio automatikoa: izena+aditza konbinazioak

Gurrutxaga, A.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2014.

The Web as a Corpus of Basque

Leturia, I.

Doktore-tesia. Informatika Fakultatea, UPV/EHU. Donostia. 2014.

Euskararen ezagutza-base lexikala: Euskal WordNet

Pociello, E.

Doktore-tesia. Euskal Filologia Saila, UPV/EHU. Leioa. 2008.

Modelos de lenguaje neuronales

Are Social Biases in LLMs Consistent across Generative Tasks? A Case Study for Basque

Muitze Zulaika, Xabier Saralegi, Julia Shershneva, Lia Gonzalez, Arkaitz Fullaondo

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) DOI: 10.63317/52zk8uyjrw5k

Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Urbizu, G., Corral, A., Saralegi, X. eta San Vicente, I.

In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 519–530 November 8-9, 2025

DIPLomA: Efficient Adaptation of Instructed LLMs to Low-Resource Languages via Post-Training Delta Merging

Sarasua, I., Corral, A. eta Saralegi, X.

in Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24898–24912 November 4-9, 2025

Personality Assessment on Spanish and Basque Texts using In-Context Learning Techniques

Saizar, A., Lopez de Lacalle, M. eta Saralegi, X.

Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025

Assessing Small Language Models for Translating Spanish Instructions into Behavior Trees

Saizar, A., Corral, A., Lopez de Lacalle, M., Urbizu, G. eta Saralegi, X.

Procesamiento del Lenguaje Natural, Revista nº 75, septiembre de 2025

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Ander Corral, Ixak Sarasua Antero, Xabier Saralegi

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Generating Multiple-Choice Questions in Spanish and Basque using LLMs: A Comparative Manual Evaluation

Maddalen López de Lacalle, Xabier Saralegi, Aitzol Saizar

Procesamiento del Lenguaje Natural, Revista nº 74, marzo de 2025

BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language

Muitze Zulaika and Xabier Saralegi

In Proceedings of the 31st International Conference on Computational Linguistics, pages 4753–4767 January 19–24, 2025.

How Well Can BERT Learn the Grammar of an Agglutinative and Flexible-Order Language? The Case of Basque

Urbizu, G., Zulaika, M., Saralegi, X., and Corral, A.

In LREC-COLING 2024, pages 8334–8348 20-25 May, 2024

Scaling Laws for BERT in Low-Resource Settings

Urbizu, G., San Vicente, I., Saralegi, X., Agerri, R. and Soroa, A.

In Findings of the Association for Computational Linguistics: ACL 2023, pages 7771–7789 July 9-14, 2023

Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

Urbizu, G., San Vicente, I., Saralegi,X., and Corral, A.

In Findings of the Association for Computational Linguistics: ACL2023, pages 3826–3836 July 9-14, 2023

BasqueGLUE: A Natural Language Understanding Benchmark for Basque

G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa.

In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June, 2022. Marseille, France

Grammatical Error Correction for Basque through a seq2seq neural architecture and synthetic examples

Beloki, Z., Saralegi, X., Ceberio, K., & Corral, A.

In Procesamiento del Lenguaje Natural, 65, 13-20. 2020.

Give your Text Representation Models some Love: the Case for Basque

Agerri, R., San Vicente, I., Campos, J.A., Barrena, A., Saralegi, X., Soroa, A. and Agirre, E.

In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). pp. 4781‑4788. 2020.

Tecnologías del habla

Genero aldetik anbiguoa den hizketaren sintesia euskaraz hizlari-bektoreen manipulazioaren bidez

Xabier Sarasola, Ander Corral, Igor Leturia, Iñigo Morcillo

Ekaia: Euskal Herriko Unibertsitateko zientzi eta teknologi aldizkaria, ISSN 0214-9001, Nº. Extra 47, 2025 (Ejemplar dedicado a: Adimen artifiziala), págs. 113-124

Hizlari-bektore manipulazioaren bidezko genero-anbiguoko hizketaren sintesia euskaraz

Sarasola, X., Corral, A., Leturia, I. eta Morcillo, I.

EKAIA EHUko Zientzia eta Teknologia aldizkaria 2024

Automatic Speech Recognition for Gascon and Languedocian Variants of Occitan

Iñigo Morcillo, Igor Leturia, Ander Corral, Xabier Sarasola, Michaël Barret, Aure Séguier, Benaset Dazéas

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1969–1978, Torino, Italia. ELRA and ICCL.

Basque-speaking Smart Speaker based on Mycroft AI

Igor Leturia, Ander Corral, Xabier Sarasola, Beñat Jimenez, Silvia Portela, Arkaitz Anza, and Jaione Martinez

In: Rehm, G. (eds) European Language Grid. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-17258-8_15. 2023.

Mycroft.eus: The open source Basque-speaking smart speaker

Igor Leturia (2021)

META-Forum 2021. Berlin (virtual), November 15-17

ELG Pilot Project: Basque-speaking smart speaker based on Mycroft AI

Igor Leturia (2020)

META-Forum 2020. Berlin (virtual), December 1-3

Neural Text-to-Speech Synthesis for an Under-Resourced Language in a Diglossic Environment: the Case of Gascon Occitan

Ander Corral, Igor Leturia, Aure Séguier, Michäel Barret, Benaset Dazéas, Philippe Boula de Mareüil, and Nicolas Quint. 2020.

In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 53–60, Marseille, France. European Language Resources association.

Massively multilingual accessible audioguides via cell phones.

Cortes, I., Leturia, I., Alegria, I., Astigarraga, A., Sarasola, K., & Garaio, M. (2018).

In Proceedings of The 21st Annual Conference of the European Association for Machine Translation (EAMT2018), Alacant, Spain.

Hirikia: Language Technology projects in the frame of the European Capital of Culture 2016

Rodrigo Agerri, Aitzol Astigarraga, Iñaki Alegria, Itziar Cortes, Arantza Diaz de Ilarraza, Igor Leturia, Kepa Sarasola (2017)

META-Forum 2017. Brussels, November 13/14

Proyectos estratégicos

MULTILINGTOOL, Development of an Automatic Multilingual Subtitling and Dubbing System

Ander Corral, Xabier Sarasola, Iker Manterola, Josu Murua, Itziar Cortes, Igor Leturia, Xabier Saralegi

In EAMT 2024 - The 25th Annual Conference of The European Association for Machine Translation

LINGUATEC: Desarrollo de recursos lingüı́sticos para avanzar en la digitalización de las lenguas de los Pirineos

Aldabe, I, Aztiria, J., Beltrán, F., Bras, M., Ceberio K., Cortes, I., Coyos J.D., Dazeas B., Esher, L., Labaka, G., Leturia, I., Sarasola, K., Séguier A. and Sibille J.

In Procesamiento del Lenguaje Natural (SEPLN), ISSN 1989-7553. 2019.

Fostering digital representation of eu regional and minority languages: the digital language diversity project

Soria, C., Russo, I., Quochi, V., Hicks, D., Gurrutxaga, A., Sarhimaa, A. and Tuomisto, M.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016.

Euskara Aro Digitalean

Hernáez I. (UPV/EHU), Navas E. (UPV/EHU), Odriozola I. (UPV/EHU), Sarasola K. (UPV/EHU), Diaz de Ilarraza A. (UPV/EHU), Leturia I. (Elhuyar Fundazioa), Diaz de Lezana A. (Eusko Jaurlaritza), Oihartzabal B. (UMR 5478 IKER), Salaberria J. (UMR 5478 IKER)

Liburu zurien bilduma. META-NET. 2012.

Web Communication Protocols for Coordinating the Modules of AnHitz, a Basque-Speaking Virtual 3D Expert on Science and Technology

Leturia I., del Pozo A., Oyarzun D., Iturraspe U., Arregi X., Sarasola K., Diaz de Ilarraza A., Navas E., Odriozola I., Sainz O.

In Web Services and Processing Pipelines in HLT workshop (WSPP2010). Valetta, Malta, 2010

Development and Evaluation of AnHitz, a Prototype of a Basque-Speaking Virtual 3D Expert on Science and Technology

Leturia I., del Pozo A., Arrieta K., Iturraspe U., Sarasola K., Diaz de Ilarraza A., Navas E., Odriozola I.

Computational Linguistics-Applications workshop. Mrągowo (Poland). 2009.

AnHitz, development and integration of language, speech and visual technologies for Basque

Arrieta K., Leturia I., Iturraspe U., Diaz de Ilarraza A., Sarasola K., Hernáez, I., Navas, E.

In Universal Communication, 2008. ISUC'08. Second International Symposium on (pp. 338-343). IEEE. Osaka (Japan). 2008.

Traduccion automática

Morphology Aware Source Term Masking for Terminology-Constrained NMT

Ander Corral and Xabier Saralegi

In Findings of the Association for Computational Linguistics: EACL 2024

Gender Bias Mitigation for NMT Involving Genderless Languages

Ander Corral and Xabier Saralegi

In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 165–176 Abu Dhabi, December 7–8, 2022

Elhuyar submission to the Biomedical Translation Task 2020 on terminology and abstracts translation

Ander Corral and Xabier Saralegi. 2020.

In Proceedings of the Fifth Conference on Machine Translation, pages 813–819, Online. Association for Computational Linguistics.

QUALES: Estimación Automática de Calidad de Traducción Mediante Aprendizaje Automático Supervisado y No-Supervisado

Etchegoyhen, T., Garcia, E.M., Azpeitia, A., Alegria, I., Labaka, G., Otegi, A., Sarasola, K., Cortes, I., Jauregi, A., Ellakuria, I. and Calonge, E.

Procesamiento del Lenguaje Natural. Vol 61. pp.143-146. 2018.

TADEEP: Traducción automática en profundidad

Alegria, I., Aranberri, N., Artetxe, M., Etxeberria, I., Gurrutxaga, A.,Iñurrieta, U., Labaka, G., Lersundi, M., Leturia, M., Màrquez, L., Mayor, A., Oronoz, M., Sarasola, K. and Urizar, R.

JORNADAS DE SEGUIMIENTO 2018. Subdivisión de Programas Temáticos Científico Técnicos. Área de Ciencias y TIC. Ministerio de Energia, Industria y Competitividad. Madril. 2018-06-20.

Neural Machine Translation of Basque

Etchegoyhen, T., Garcia, E.M., Azpeitia, A., Labaka, G., Alegria, I., Cortes, I., Jauregi, A., Santos, I.E., Martin, M. and Calonge, E.

In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (EAMT 2018). pp. 139. Alicante. 2018.

Massively multilingual accessible audioguides via cell phones

Cortes, I., Leturia, I., Alegria, I., Astigarraga, A., Sarasola, K. and Garaio, M.

In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (EAMT 2018). ISBN: 978-84-09-01901-4. 2018.

Improving access to educational courses via automatic machine translation - new developments in post-editing

Pietrzak J., Jáuregui, A., Van de Walle, j. and Eriksson, A.

In Proceedings of 7th International Technology, Education and Development Conference (INTED2013). 4-7 March, Valencia, Spain. 2013.

Morphological information management for the creation of a new pair of languages with different dialects in an open-source machine translation system

Aranbarri, G. and Cortes, I.

Procesamiento del Lenguaje Natural (SEPLN). 47. pp. 321-322. 2011.

OpenMT: Open Source Machine Translation Using Hybrid Methods

Alegria I., K. Sarasola, N. Castell, L. Màrquez, N. Areta, X. Saralegi

Jornada de Seguimiento de Proyectos. Programa Nacional de Tecnologías Informáticas. 2009

Opentrad: bringing to the market opensource based Machine Translators

Aizpurua I., Ramirez G., Pichel J., Waliño J.

Langtech 2008. Rome. 2008.

Mixing Approaches to MT for Basque: Selecting the best output from RBMT, EBMT and SMT

Alegria I., Díaz de Ilarraza A., Igartua J., Labaka G., Laskurain B., Lersundi M., Mayor A., Sarasola K., Casillas, A. and Saralegi, X.

In Proceedings of MATMT2008 workshop: Mixing Approaches to Machine Translation. 2008.

OpenTrad: Traducción automática de código abierto para las lenguas del Estado español

Alegria I., Arantzabal I., Forcada M.L., Gomez X., Padró L., Pichel, J.R. and Waliño, J.

Procesamiento del Lenguaje Natural, 27, pp.357-360. 2006.

Asistentes conversacionales

Strategies for bilingual intent classification for small datasets scenarios

López de Lacalle, M., Saralegi, X., Saizar, A., Urbizu, G. and Corral, A.

Procesamiento del Lenguaje Natural, Revista nº 71, septiembre de 2023, pp. 137-147

Reducing annotation effort for Cross-lingual Transfer Learning: The case of NLU for Basque

López de Lacalle, M., Saralegi, X., López, I.

In Proceedings of the Workshop on Mixed-Initiative ConveRsatiOnal Systems 2021 (MICROS) @ ECIR2021, 2021. Lucca, Tuscany, 2021.

Building a Task-oriented Dialog System for languages with no training data: the Case for Basque

López de Lacalle, M., Saralegi, X., San Vicente, I.

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2796–2802 ,Marseille, 11–16 May 2020.

Recuperación y extracción de información (IR-IE)

Measuring Presence of Women and Men as Information Sources in News

Zulaika, M., Saralegi, X., San Vicente, I.

In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Information retrieval and question answering: A case study on COVID-19 scientific literature.

Arantxa Otegi, Iñaki San Vicente, Xabier Saralegi, Anselmo Peñas, Borja Lozano, Eneko Agirre. 2022.

Knowledge-Based Systems, Volume 240, 2022, 108072, ISSN 0950-7051

GEPSA, a tool for monitoring social challenges in digital press

San Vicente, I., Saralegi, X., & Zubia, N

In Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion (pp. 46–50). EACL2021, Kiev, Ukraine. Association for Computational Linguistics. 2021.

Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels.

Saralegi, X., San Vicente, I.

In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol 12657. Springer, Cham. 2021.

Evaluating translation quality and clir performance of query sessions

Saralegi, X., Agirre, E. and Alegria, I.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016.

TweetNorm: a benchmark for lexical normalization of Spanish tweets

Alegria, I., Aranberri, N., Comas, P.R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J. and Zubiaga, A.

Language Resources and Evaluation, volume 49, issue 4, pp. 883-905. 2015.

TweetLID: a benchmark for tweet language identification

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza A., Fresno, V.

Language Resources and Evaluation (2015). DOI: 10.1007/s10579-015-9317-4

Overview of tweetlid: Tweet language identification at sepln 2014

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza A., Fresno, V.

In Proceedings of the TweetLID Worshop at SEPLN2014. Girona. pp. 1-11. 2014.

TweetNorm_es: an Annotated Corpus for Spanish Microtext Normalization

Alegria, I., Aranberri, N., Comas, P.R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J. and Zubiaga, A.

In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland. 2014.

Elhuyar at Tweet-Norm 2013

Saralegi, X. and San Vicente, I.

In proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Tweet Normalization Workshop at SEPLN (Tweet-Norm 2013). Madrid. ISBN: 978-84-695-8349-4. 2013.

Introducción a la Tarea Compartida Tweet-Norm 2013: Normalización Léxica de Tuits en Español

Alegria, I., Aranberri, N., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J. and Zubiaga, A.

In proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Tweet Normalization Workshop at SEPLN (Tweet-Norm 2013). Madrid. ISBN: 978-84-695-8349-4

Extracción automática de fichas de recursos turísticos de la web

Manterola, I., Saralegi X. eta Bilbao S.

In Turitec 2012: IX Congreso Nacional Turismo y Tecnologías de la Información y las Comunicaciones (pp. 31-42). Universidad de Málaga (UMA).

Dictionary and Monolingual Corpus-based Query Translation for Basque-English CLIR

Saralegi, X. eta Lopez de Lacalle, M.

In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). Malta. 2010.

Estimating Translation Probabilities from the Web for Structured Queries on CLIR

Saralegi, X. and Lopez de Lacalle, M.

In European Conference on Information Retrieval (ECIR 2010). Milton Keynes. pp. 586-589. 2010.

Elhuyar-IXA: semantic relatedness and crosslingual passage retrieval

Agirre E., Ansa O., Arregi X., Lopez de Lacalle M., Otegi A., Saralegi X. and Zaragoza H.

In Proceedings of Workshop of the Cross-Language Evaluation Forum for European Languages (CLEF 2009). pp. 273-280. 2009.

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co-occurrence-Based Selection

Saralegi, X. and Lopez de Lacalle, M.

In Proceedings of 7th International Workshop on Text-Based Information Retrieval (TIR 2009), 20th International Workshop on Database and Expert Systems Application, 2009. DEXA'09. (pp. 398-404). IEEE. Linz. 2009.

Analysis and performance of morphological query expansion and language-filtering words on Basque web searching

Leturia I., Gurrutxaga A., Areta N., Pociello E.

In Proceedings of the 6th International Conference on Language Resources and Evaluations (LREC’08). Marrakech, Morocco. 2008.

Similitud entre documentos multilingües de carácter técnico en un entorno Web

Saralegi, X. and Alegria, I.

Procesamiento del Lenguaje Natural, nº39 (SEPLN 2007), pp. 71-78. 2007.

EusBila, a search service designed for the agglutinative nature of Basque

Leturia I., Gurrutxaga A., Areta N., Alegria I., Ezeiza A.

In Proceedings of Improving non-English web searching (SIGIR 2007 - iNEWS’07) workshop. pp. 47-54. Amsterdam. 2007.

Extracción de léxico y terminología

Baliabide lexikoen sarea: Baldintza filologiko eta teknikoak eta aplikazioak

Lindemann, D., and San Vicente, I.

In Hitzak sarean: Pello Salabururi esker onez. Laka Itziar (Arg.) Bilbo: EHU Argitalpen Zerbitzua, ISBN: 978-84-1319-111-9. 107 orr. 2019.

Verbal Multiword Expressions in Basque Corpora

Inurrieta, U., Aduriz, I., Estarrona, A., Gonzalez-Dios, I., Gurrutxaga, A., Urizar, R., and Alegria, I.

In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) 86-95. 2018.

Lexikoaren Behatokia: leiho bat XXI. mendeko hedabideetako euskarari

Artola, X., Ezeiza, N., Gurrutxaga, A., Sagarna, A. and Urkia, M.

In SENEZ aldizkaria, 48. zk., 201-209, EIZIE. 2017.

Bilingual Dictionary Drafting: Connecting Basque Word Senses to Multilingual Equivalents

Lindemann, D. and San Vicente, I.

In Proceedings of EURALEX 2016, 898–905. Tbilisi: Tbilisi State University, 2016.

Building Corpus-based Frequency Lemma Lists

Lindemann, D. and San Vicente, I.

Procedia – Social and Behavioral Sciences, vol. 198, pp. 266–277, Jul. 2015.

Idiomatikotasunaren karakterizazio automatikoa: izena+ aditza konbinazioak

Gurrutxaga, A., Alegria, I. and Artola, X.

In EKAIA Euskal Herriko Unibertsitateko Zientzi eta Teknologi Aldizkaria, Ale berezia: Euskal Tesien 10 pasarte, 47-68. 2015.

Euskarazko maiztasun lemategia gaurko teknologien ikuspuntutik

Lindemann, D. and San Vicente, I.

In Ibon Sarasola, Gorazarre. Homenatge, Homenaje, 441–456. Bilbao: UPV-EHU, 2015.

Corpusetan oinarritutako hiztegi elebidun berria sortzen

Lindemann, D., and I. San Vicente.

In Proceedings of IkerGazte: Nazioarteko ikerketa euskaraz. Durango, Basque Country, 2015/05.

Bilingual Dictionary Drafting. The Example of German-Basque, a Medium-density Language Pair

Lindemann, D., Manterola, I., Nazar, R., San Vicente, I. and Saralegi, X.

In Proceedings of the XVI EURALEX Conference. Bolzano/Bozen,p. 563–576. 2014

Combining different features of idiomaticity for the automatic classification of noun+ verb expressions in Basque

Gurrutxaga, A. and Alegria, I.

In Proceedings of the 9th Workshop on Multiword Expressions (MWE9)- NAACL HLT 2013. pp. 116-125. Atlanta, Georgia, USA. 2013.

GARATERM: euskararen erregistro akademikoen garapenaren ikerketarako lan-ingurunea

Zabala I., Lersundi M., Leturia I., Manterola I., Santander G.

Xabier Alberdi eta Pello Salaburu (ed.) Terminologia naturala eta terminologia planifikatua euskararen normalizazioari begira. UPV/EHUko Argitalpen Zerbitzua: 98-114 ISBN: 978-84-9860-809-0. 2013

Building a Basque-Chinese Dictionary by using English as a Pivot

Saralegi, X., Manterola, I. and San Vicente, I.

In Proceedings of the 8th international conference on Language Resources and Evaluation, LREC’12. pp. 1443-1447. 23-25 May, Istanbul, Turkia. 2012.

Measuring the compositionality of NV expressions in Basque by means of distributional similarity techniques

Gurrutxaga, A. and Alegria, I.

In Proceedings of the Eight International Conference on Language Resources and Evaluation LREC’12. pp. 2389-2394. 23-25 May, Istanbul, Turkia. 2012.

Analizing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

Saralegi, X., Manterola, I. and San Vicente, I.

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011). pp. 846-856. Edinburgo. July, 2011.

Automatic extraction of NV expressions in Basque: basic issues on cooccurrence techniques

Gurrutxaga, A. and Alegria, I.

In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011). pp. 2-7. Association for Computational Linguistics. ACL/HLT conference. Portland. 2011.

Mining Term Translations from Domain Restricted Comparable Corpora

Saralegi, X., San Vicente, I. and López de Lacalle, M.

In Procesamiento del lenguaje Natural, 41, pp.273-280. 2008.

Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain

Saralegi, X., San Vicente, I. and Gurrutxaga, A.

In Proceedings of Building and using Comparable Corpora workshop (BUCC) - LREC 2008. pp. 27-32. Marrakech. 2008

Elexbi, a basic tool for bilingual term extraction from Spanish-Basque parallel corpora

Gurrutxaga, A., Saralegi, X., Ugartetxea, S. and Alegria, I.

In Proceedings of the 12th EURALEX International Congress of Lexicography. pp.159-165. Torino. 2006.

Erauzterm: euskarazko terminoak erauzteko tresna erdiautomatikoa

Gurrutxaga, A., Saralegi, X., Ugartetxea, S. and Alegria, I.

Mendebalde Kultur Alkartea, IX. Jardunaldiak: Euskera zientifiko-teknikoa. Bilbao. 2005.

Euskara-gaztelania terminologia Elebidunaren Erauzle Automatikoa

Gurrutxaga, A., Pagoaga, A., Saralegi, X., Ugartetxea, S. and Alegria, I.

EHU/UPV. Bilbao. 2005.

A Xml-Based Term Extraction Tool for Basque

Alegría, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S. and Urizar, R.

In Proceedings of the 4th International Conference on Language Resources and Evaluations (LREC 2004). Lisbon. 2004.

Linguistic and Statistical Approaches to Basque Term Extraction

Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S. and Urizar, R.

GLAT 2004: The production of specialized texts. Barcelona. 2004.

Semántica y ontologías

Predicate Matrix. Automatically extending the semantic interoperability between predicate resources

López de Lacalle M., Laparra E., Aldabe I. and Rigau G.

Language Resources and Evaluation. June 2016, Volume 50, Issue 2, pp 263–289. 2016.

A multilingual predicate matrix

Lopez de Lacalle, M., Laparra, E., Aldabe, I. and Rigau, G.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016

Predicate Matrix: extending SemLink through WordNet mappings

López de Lacalle M., E. Laparra and G. Rigau

In Proceedings of the 9th international conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland. 2014.

First Steps Towards a Predicate Matrix

Lopez de Lacalle, M., Laparra, E. and Rigau, G.

In Proceedings of the 7th International Global Wordnet Conference. GWC 2014. Tartu, Estonia. 2014

Analyzing the Sense Distribution of Concordances Obtained by Web As Corpus Approach

Saralegi, X. and Gamallo, P.

Lecture Notes in Computer Science (LNCS) nº 7816 , Alexander Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing. Springer. 13th International Conference, CICLing 2013. Samos, Greece. 2013.

Methodology and construction of the Basque WordNet

Pociello, E., Agirre, E. and Aldezabal, I.

In Language Resources and Evaluation. Volume 45, Issue 2, pp 121–142. Springer. ISSN 1574-020X. May 2011.

WNTERM: Combining the Basque WordNet and a Terminological Dictionary

Pociello E., Gurrutxaga A., Agirre E., Aldezabal I. and Rigau G.

In Proceedings of the 6th International Conference on Language Resources and Evaluations (LREC 2008). Marrakech 2008.

Extracción de Opiniones - Análisis de Sentimiento

Polarity lexicon building: to what extent is the manual effort worth?

San Vicente, I., and X. Saralegi.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016

EliXa: A modular and flexible ABSA platform

San Vicente, I., Saralegi, X. and Agerri, R.

In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, 2015/06/04, pp. 748–752. 2015.

Sentimenduen analisirako lexikoen sorkuntza

San Vicente, I. and Saralegi, X.

In Proceedings of IkerGazte: Nazioarteko ikerketa euskaraz. Durango, Basque Country, 2015/05. 2015.

Looking for Features for Supervised Tweet Polarity Classification

San Vicente, I. and Saralegi, X.

In Proceedings of the TASS Workshop at SEPLN2014. Girona. 2014.

Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages

San Vicente, I., Agerri, R. and Rigau, G.

In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014). Gothenburg, Sweden. 2014.

Elhuyar at TASS2013

Saralegi, X. and San Vicente, I.

In Proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Workshop on Sentiment Analysis at SEPLN (TASS2013). Madrid. ISBN: 978-84-695-8349-4. 2013.

Polarity Classification of Tourism Reviews in Spanish

San Vicente, I. and Saralegi, X.

In Proceedings of “XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural”. Madrid. ISBN: 978-84-695-8349-4. 2013.

Cross-Lingual Projections vs. Corpora Extracted Subjectivity Lexicons for Less-Resourced Languages

Saralegi, X., San Vicente, I. and Ugarteburu, I.

Lecture Notes in Computer Science (LNCS) nº 7817 , Alexander Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing. Springer. 13th International Conference, CICLing 2013. Samos, Grezia. 2013.

TASS: Detecting Sentiments in Spanish Tweets

Saralegi X., San Vicente I.

In Proceedings of the First Workshop on Sentiment Analisis at SEPLN (TASS 2012). 7 September, Castelló de la Plana, Spain. 2012.

Corpora

Tweetmt: A parallel microblog corpus

San Vicente, I., I. Alegria, C. España-Bonet, P. Gamallo, H. G. Oliveira, E.M. Garcia, A. Toral, A. Zubiaga, and N. Aranberri.

In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, may 23-28. 2016.

Recursos en euskera para la herramienta NLTK para enseñanza de procesamiento del lenguaje natural

Manterola, I., de Ilarraza, A.D., Gojenola, K. and Sarasola, K.

In Procesamiento del Lenguaje Natural, 45 (SEPLN 2010). pp.305-306. 2010.

Begiratu bat corpus-baliabideei

Areta N., Gurrutxaga A., Leturia I.

BAT Soziolinguistika aldizkaria, 62. alea. 2008.

ZT Corpus: Annotation and tools for Basque corpora

Areta N., Gurrutxaga A., Leturia I., Alegria I., Artola X., Díaz de Ilarraza A., Ezeiza N., Sologaistoa A.

In Proceedings of Corpus Linguistics 2007. Birmingham. 2007.

Structure, Annotation and Tools in the Basque ZT Corpus

Areta, N., Gurrutxaga, A., Leturia, I., Polin, Z., Saiz, R., Alegria, I., Artola, X., de Ilarraza, A.D., Ezeiza, N., Sologaistoa, A. and Soroa, A.

In Proceedings of the fifth International Conference on Language Resources and Evaluations (LREC 2006) (pp. 1406-1411). Genoa. 2006

Zientzia eta teknologiaren corpusa

Alegria I., Artola X., Díaz de Ilarraza A., Ezeiza N., Sologaistoa A., Soroa A., Valverde A., N. Arteta, A. Gurrutxaga, I. Leturia, R. Saiz.

In Euskera zientifiko-teknikoa: Normalizaziotik homologazinora. Mendebalde Kultura Alkartea. Bilbao. 2005.

Zientzia eta teknologiaren corpusa. Diseinua eta metodologia

Areta, N., Gurrutxaga, A., Leturia, I., Polin, Z., Saiz, R., Alegria, I., Artola, X., Diaz de Ilarraza, A., Ezeiza, N., Sologaistoa, A., Soroa, A. and Valverde, A.

EHU/UPV. Bilbao. 2005.