Digital Corpus & Syntactic Annotation through Universal Dependencies

October 2
“Digital Corpus and Syntactic Annotation through Universal Dependencies:

UD Treebanks for Coptic, Classical Chinese, Old Japanese, and Ainu

Oct 2 9:50-10:00 Japan
Oct 2 10:00-11:00 Japan
UD Treebanking for Coptic DH: Low Resource NLP Technologies for NER, Lexicography and Linked Open Data

Amir Zeldes (Georgetown University)


The Universal Dependencies project, which provides morphosyntactically analyzed data in over 100 languages, offers homogeneous annotation schemes and workflows for both Big Data languages such as English, and Low Resource languages often at the heart of Digital Humanities work. In this talk I will present work on a language from the latter group: Coptic, the language of 1st millennium Egypt. Thanks to progress in NLP technologies and the development of UD annotated data, our project, Coptic Scriptorium ( has been able to create fully automatic tools for analyzing Coptic data, including morphological analysis, part-of-speech tagging, lemmatization, parsing and entity recognition. These analyses feed a suite of tools enabling Named Entity Linking to open data such as Wikipedia, as well as automatic generation of lexicographic examples and entity-type based Word Sense Disambiguation in an online dictionary. This work shows that a variety of technologies often assumed to be relevant mainly for Big Data languages, such as Deep Learning, Transformers (BERT) and more, can work well when even modest amounts of richly annotated UD data are available for bootstrapping.

Oct 2 11:00-12:00 Japan
UD for lzh (Classical Chinese) ojp (Old Japanese) and ain (Ainu)

Koichi Yasuoka (Kyoto University, Institute for Research in Humanities)

Discussion and Concluding Remarks