Sámi - Estonian language technology cooperation
Similar languages, same technologies
The project aims at bringing together language technology research in Norway and Estonia over a common challenge: How to make robust models for complex morphologies like in the Saami and Estonian languages. The two cooperating research groups already use the same approaches, but the Tromsø group has experience in building a model that can be integrated into different practical applications. In this project we share a common infrastructure and open source tools, and put the morphological models into use for machine translation and advanced iCALL systems.
1. Duration and expected total cost of the project
The duration is estimated at September 2, 2013 - April 30, 2017, and the total cost to 229,400 euros (Tartu 197,860 and Tromsø 31,541).
Principal investigator: Heiki-Jaan Kaalep
Jaak Pruulmann-Vengerfeldt – development of Estonian morphology tools: linguistics
Sulev Iva – development of Võru morphology tools: linguistics
Tiina Puolakainen – development of Estonian Constraint Grammar parser: computational formalisms
Kadri Muischnek - development of Estonian Constraint Grammar parser: linguistics
Heli Uibo – development of FST tools for Estonian and Võru: computational formalisms
Dage Särg; Eleri Aedmaa (PhD student); Käbi Suvi (bachelor student); Keit Mõisavald (bachelor student)
Tarmo Vaino, programmer
2. Work in progress: state of affairs in January 2015
Results: http://testing.oahpa.no/eesti/. Currently operational are: Leksa, Morfa-S (nouns and verbs), Morfa-C (nouns), Vasta-S. Based on the lexicon from the textbook "E nagu Eesti" (ca 1000 words with translations into English, Russian, Finnish, Swedish and German).
Results: http://testing.oahpa.no/voro/. Currently operational are: Numra (all the games), Leksa, Morfa-S (nouns and verbs) and Morfa-C (nouns and verbs). The Leksa lexicons contains ca 1200 words with translations into Estonian, Finnish, English, North Sami and Norwegian and contains semantic classes.
- Estonian finite-state morphology, including Constraint Grammar.
- Estonian – Finnish statistical machine translation demo.
- Estonian – Finnish rule-based machine translation resources and documentation.
- Finnish – Saami rule-based machine translation.
3. Related efforts, stemming (partially) from the project
- Rule-based MT course Nov 2-13, Tartu (http://wiki.apertium.org/wiki/Tartu_Apertium_Course)
- ACL SIG for Computational Linguistics for Uralic Languages (http://gtweb.uit.no/sigur/)
- Heiki-Jaan Kaalep. (2015). Eesti verbi vormistik. Keel ja Kirjandus, 1, 1 - 15.
- Uibo, Heli; Pruulmann-Vengerfeldt, Jaak; Rueter, Jack; Iva, Sulev (2015). Oahpa! Õpi! Opiq! Developing free online programs for learning Estonian and Võro. In: Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015: 4th workshop on NLP for computer-assisted language learning, Vilnius, Lithuania, May 11-13, 2015. Ed. Elena Volodina, Lars Borin, Ildikó Pilán. Linköping: Linköping University Electronic Press, 51−64. (NEALT Proceedings Series 26 / Linköping Electronic Conference Proceedings 114).
- Heiki-Jaan Kaalep. (2016). Kas Google on ühe- või kahesilbiline sõna? (Is Google a mono- or a disyllabic word?). Keel ja Kirjandus, 1, 1 - 15.
- Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen (2016) Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies (LREC 2016, accepted). Final version to be published in the Proceedings of LREC 2016.
- Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen and Krista Liin (2016). Parsing Estonian: Tools and Resources. In: Second International Workshop on Computational Linguistics for Uralic Languages. To be published open-access in the repository of the University of Szeged.