New Language Resource Created by UT Library and Estonian National Library
Collaboration between the University of Tartu Library and the Estonian National Library has resulted in a new language resource, making accessible the frequency lists of n-grams which were created on the basis of newer Estonian works of fiction. The resource includes both the lists of n-grams as they appear in the texts and as lemma n-grams.
On 01 January 2017, an amendment of the Copyright Act took effect, permitting to use the files of publications which are protected by copyright for data mining in cases when they are used for non-commercial purposes (Copyright Act § 19 (31)). This possibility considerably broadens the accessibility of newer and more varied research data.
As the first attempt, we created a free language resource accessible to everybody, but as a form of additional research and training data it would be particularly useful for linguists and language technology and machine learning specialists.
For this resource, we were able to use the texts of Estonian works of fiction published mainly in 2017, which do not offer sufficient material for more exhaustive research. However, we hope to be able to continue our work and regularly add new data.
We created lists of both the token form n-grams and lemma n-grams, which are accessible in the University of Tartu data repository DataDOI: http://datadoi.ut.ee/handle/33/41