Smíði skilvirkra þjálfunargagna fyrir vélþýðingar - verkefni lokið

Fréttatilkynning verkefnisstjóra

5.7.2024

Áreiðanlegar samhliða málheildir eru lykillinn að því að hægt sé að þjálfa þýðingarvélar, sem geta myndað nákvæmar þýðingar sem flæða vel á markmálinu. Skekkjur í þjálfunargögnum, sem koma til vegna rangrar samröðunar setninga eða ófullnægjandi síunar við smíði samhliða málheilda, geta spillt gæðum þýðingarvélar sem þjálfuð er á gögnunum. Of lítil samhliða málheild getur jafnframt orðið til þess að þýðingarvélin nái ekki tökum á málfræði eða öðrum blæbrigðum frum- og markmálanna og myndi þess vegna ónákvæmar þýðingar. Það getur hins vegar verið flókið og erfitt að tryggja hámarksgæði þjálfunargagna við úrvinnslu samhliða texta, ekki síst þegar um er að ræða texta á tungumálum sem fáir tala eða þegar flóknar beygingar og virk orðmyndun auka á vandann við að greina rýr gögn. Það er því afar mikilvægt að þróa nákvæmar aðferðir til að setja saman samhliða málheildir, sem miða að því að nýta sem allra best þau gögn sem til eru.

Í verkefninu var tekist á við þetta vandamál með því að kanna ýmsar aðferðir til að vinna gögn fyrir smíði samhliða málheilda, með það að leiðarljósi að hámarka notagildi gagnanna fyrir vélþýðingar. Í fyrsta lagi voru prófaðar nokkrar gerðir flokkara og matsaðferða til að sía samhliða málheildir. Skoðað var hversu árangursríkar þær eru til að fjarlægja setningapör sem geta dregið úr gæðum þýðingarvéla ef þau eru hluti þjálfunargagna og hversu líklegar aðferðirnar eru til að halda eftir þeim setningapörum sem búast má við að séu best fallnar til að bæta þýðingarvélarnar. Niðurstöðurnar bentu til þess að með því að sía sérstaklega fyrir hverja þýðingarátt má bæta gæði þýðinga þeirra véla sem þjálfaðar eru á gögnunum. Í öðru lagi voru skoðaðar mismunandi aðferðir við samröðun setninga, markvirkni þeirra borin saman og sýnt fram á að með því að láta margar mismunandi aðferðir vinna saman sé hægt að auka nákvæmni samröðunarinnar. Í þriðja lagi var aðferðum sem notaðar eru til að vinna samhliða gögn úr sambærilegum málheildum beitt til að draga nýtileg gögn úr setningum og setningapörum sem hafnað hefur verið á fyrri stigum í smíði þjálfunargagnanna. Sýnt var með nokkrum tilraunum að mögulegt er að nýta þessi gögn, sem yfirleitt er litið fram hjá, til að stækka samhliða þjálfunarmálheildir og bæta þýðingarvélar sem þjálfaðar eru á þeim. Að lokum voru þýðendur fengnir til að meta þýðingar, myndaðar af þýðingarvélum sem þjálfaðar eru á gögnum sem unnin hafa verið með okkar aðferðum. Það mat staðfestir gagnsemi aðferðanna sem kynntar eru og beitt í verkefninu. Niðurstöðurnar undirstrika mikilvægi vandaðrar greiningar og gagnavinnslu við smíði samhliða málheilda sem notaðar eru til að þjálfa þýðingarvélar, ekki síst þegar takmarkað magn gagna er fyrir hendi, og stuðla þannig að þróun nákvæmari og áreiðanlegri þýðingarvéla.

English:

For machine translation (MT) systems to produce accurate and fluent translations, reliable parallel corpora are key. Errors, due to misalignments or inadequate filtering during compilation of a parallel corpus, can have detrimental effects on the performance of an MT system trained on the data. Moreover, when the corpus is too small, the MT system may not be able to capture the complexities of the source and target languages and produce accurate translations. However, obtaining high-quality parallel data is often a challenging task, even more so for languages with a low number of speakers or rich morphology exacerbating the data sparsity problem. It is thus imperative to develop accurate methods for processing parallel corpora that can help make the most of what is available. The project addresses this challenge by exploring various methods for processing parallel corpora to maximize their usefulness for MT. First, a variety of classifiers and scoring mechanisms are explored for filtering parallel corpora, looking into how efficient they are at removing data detrimental to MT training and retaining useful data. The results indicate that filtering separately for different translation directions can yield better translations in downstream MT tasks. Second, different approaches to sentence alignment were explored. Showing that by combining multiple methods can improve alignment accuracy. Third, comparable corpora mining methods were applied to extract even more useful data from sentences that had previously been discarded, showing that this often overlooked data is a potential source of useful training data. Finally, translations generated by MT systems trained on our processed datasets were manually evaluated, confirming the advantages of our applied methods. The findings highlight the importance of careful processing and curation of parallel corpora for MT. Approaches for maximizing the utility of available parallel data are proposed, particularly for scenarios where resources are scarce, contributing to the development of more accurate and reliable MT systems.

Information on how the results will be applied:
The results have been applied in building a new version of the English-Icelandic parallel corpus. They have also been tested in building an Estonian-Lithuanian training set for MT and will be applied in generating better training data for English-Icelandic.

A list of the project’s outputs:
Publications from the project are listed below. Additionally we have published a number of datasets and software packages. Datasets and software in bold were developed while being supported from the grant.

1. Steinþór Steingrímsson, Örvar Kárason and Hrafn Loftsson. 2019. Augmenting a BiLSTM Tagger with a Morphological Lexicon and a Lexical Category Identification Step. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1161–1168, Varna, Bulgaria.

2. Steinþór Steingrímsson, Hrafn Loftsson and Andy Way. 2020. Effectively Aligning and Filtering Corpora under Sparse Data Conditions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 182–190, Online.

3. Haukur Páll Jónsson, Haukur Barri Símonarson, Vésteinn Snæbjarnarson, Steinþór Steingrímsson and Hrafn Loftsson. 2020. Experimenting with Different Machine Translation Models in Medium-Resource Settings. In Text, Speech, and Dialogue: 23rd International Conference, TSD 2020, Proceedings, pages 95–103, Brno, Czech Republic.

4. Steinþór Steingrímsson, Hrafn Loftsson and Andy Way. 2021. CombAlign: a Tool for Obtaining High-Quality Word Alignments. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 64–73, Reykjavik, Iceland (Online).

5. Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson and Andy Way. 2021. Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 8–17, Online.

6. Steinþór Steingrímsson, Hrafn Loftsson and Andy Way. 2021. PivotAlign: Leveraging High-Precision Word Alignments for Bilingual Dictionary Inference. In Proceedings of the Workshops and Tutorials held at LDK 2021, pages 190-199, Zaragoza, Spain.

7. Steinþór Steingrímsson, Luke O’Brien, Finnur Ingimundarson, Hrafn Loftsson and Andy Way. 2022. Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches. In Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference, pages 32–41, Marseille, France.

8. Steinþór Steingrímsson, Hrafn Loftsson and Andy Way. 2023. Filtering Matters: Experiments in Filtering Training Sets for Machine Translation. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Tórshavn, Faroe Islands.

9. Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson and Andy Way. 2023. Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation. In Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation. Macau, China.

10. Steinþór Steingrímsson, Hrafn Loftsson and Andy Way. 2023. Sentalign: Accurate and scalable sentence alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore, Singapore.

PhD Thesis:
Steinþór Steingrímsson. 2023. Effectively compiling parallel corpora for machine translation in resource-scarce conditions.

Datasets:
ParIce: English-Icelandic parallel corpus (21.10)
- A realigned and refiltered version of the ParIce corpus, with additional material.
ParIce Dev/Test Sets 21.10
- This package contains test/dev sets with paralleled segments in English and Icelandic. The sets are in the same domains as some of the ParIce data
Gold Alignments for English-Icelandic Word Alignments (21.04)
- Manually curated word alignments, containing 604 aligned sentences in the English-Icelandic language pair.
Icelandic-English test set for sentence alignment 21.10
- This package contains 10 documents in Icelandic and English, with text segments aligned manually on sentence level, in order to be used for testing and comparing alignment methods.
Icelandic-English Parallel Sentence Extraction Dataset 21.10
- dataset for testing the accuracy of parallel sentence extraction from comparable corpora
English-Icelandic/Icelandic-English glossary 21.09
- The glossary contains 232.950 Icelandic-English pairs. Automatic methods were used to build a candidate list, which was manually checked by human annotators or compared to available manually curated Icelandic-English/English-Icelandic dictionaries and word lists.
Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering (21.10)
This is a training set for a classifier, using one or more of the following scores to select good quality parallel segments in comparable corpora or for filtering parallel corpora: LASER, LaBSE and WAScore.
Icelandic coherence classification set
- contains 10,000 Icelandic sentences that are either annotated as acceptable Icelandic sentences or unacceptable, used to train a classifier to decide between the two classes.

Software:
ABLTagger
- A part-of-speech tagger for Icelandic. When released it outperformed all previously published taggers by a substantial margin.
- https://github.com/steinst/ABLTagger

CombAlign
- A tool for more accurate word alignment.
- https://github.com/steinst/CombAlign

AlignMan
- A tool for manually carrying out word alignments, and thereby to generate evaluation data sets for word alignment.
- https://github.com/steinst/AlignMan

PivotAlign
- A tool for accurately generating candidates for bilingual lexicon induction.
- https://github.com/steinst/PivotAlign

SentAlign
- An accurate and easy-to-use sentence aligner. It outperforms previous sentence aligners on commonly used evaluation suites.
- https://github.com/steinst/SentAlign

Heiti verkefnis: Smíði skilvirkra þjálfunargagna fyrir vélþýðingar/Compiling Effective Training Data for Low-Resource Machine Translation
Verkefnisstjóri: Steinþór Steingrímsson, Háskólanum í Reykjavík
Tegund styrks: Doktorsnemastyrkur
Styrktímabil: 2022
Fjárhæð styrks kr. 7.791.000
Tilvísunarnúmer Rannís: 228654









Þetta vefsvæði byggir á Eplica