SAMEST
Sámi - Estonian language technology cooperation
Similar languages, same technologies
Abstract
The project aims at bringing together language technology research in Norway and Estonia over a common challenge: How to make robust models for complex morphologies like in the Saami and Estonian languages. The two cooperating research groups already use the same approaches, but the Tromsø group has experience in building a model that can be integrated into different practical applications. In this project we share a common infrastructure and open source tools, and put the morphological models into use for machine translation and advanced iCALL systems.
1. Duration and expected total cost of the project
The duration is estimated at September 2, 2013 - April 30, 2017, and the total cost to 229,400 euros (Tartu 197,860 and Tromsø 31,541).
Principal investigator: Heiki-Jaan Kaalep
Research staff:
Jaak Pruulmann-Vengerfeldt – development of Estonian morphology tools: linguistics
Sulev Iva – development of Võru morphology tools: linguistics
Kadri Muischnek - development of Estonian Constraint Grammar parser: linguistics
Heli Uibo – development of FST tools for Estonian and Võru: computational formalisms
Other staff:
Maarja-Liisa Pilvik (PhD student); Keit Mõisavald (master student)
Tarmo Vaino, programmer
2. Work in progress: state of affairs in January 2017
- Estonian Oahpa (language learning)
Results: http://oahpa.no/eesti/. Currently operational are: Leksa, Morfa-S (nouns and verbs), Morfa-C (nouns), Vasta-S. Based on the lexicon from the textbook "E nagu Eesti" (ca 1000 words with translations into English, Russian, Finnish, Swedish and German).
- Võru Oahpa (language learning)
Results: http://oahpa.no/voro/. Currently operational are: Numra (all the games), Leksa, Morfa-S (nouns, adjectives and verbs) and Morfa-C (nouns and verbs). The Leksa lexicons contains ca 1200 words with translations into Estonian, Finnish, English, North Sami and Norwegian and contains semantic classes.
- Estonian finite-state morphology, including Constraint Grammar.
1. https://victorio.uit.no/langtech/trunk/langs/est/,
2. https://victorio.uit.no/langtech/trunk/experiment-langs/est/
- Estonian – Finnish statistical machine translation demo
- Estonian – Finnish rule-based machine translation resources and documentation.
- Finnish – Saami rule-based machine translation.
- Finnish – Estonian rule-based machine translation.
3. Related efforts, stemming (partially) from the project
- Rule-based MT course Nov 2-13 2015, Tartu (http://wiki.apertium.org/wiki/Tartu_Apertium_Course)
- ACL SIG for Computational Linguistics for Uralic Languages https://acl-sigur.github.io/)
4. Publications
- Heiki-Jaan Kaalep. (2015). Eesti verbi vormistik. Keel ja Kirjandus, 1, 1 - 15.
- Uibo, Heli; Pruulmann-Vengerfeldt, Jaak; Rueter, Jack; Iva, Sulev (2015). Oahpa! Õpi! Opiq! Developing free online programs for learning Estonian and Võro. In: Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015: 4th workshop on NLP for computer-assisted language learning, Vilnius, Lithuania, May 11-13, 2015. Ed. Elena Volodina, Lars Borin, Ildikó Pilán. Linköping: Linköping University Electronic Press, 51−64. (NEALT Proceedings Series 26 / Linköping Electronic Conference Proceedings 114).
- Heiki-Jaan Kaalep. (2016). Kas Google on ühe- või kahesilbiline sõna? (Is Google a mono- or a disyllabic word?). Keel ja Kirjandus, 1, 1 - 15.
- Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen (2016) Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies. (http://www.lrec-conf.org/proceedings/lrec2016/pdf/411_Paper.pdf)
- Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen and Krista Liin (2016). Parsing Estonian: Tools and Resources. In: Second International Workshop on Computational Linguistics for Uralic Languages.