Contextual Reasoning and Adaptation for Natural Language Processing

Language is implicit — it omits information. Filling this information gap requires contextual inference, background- and commonsense knowledge, and reasoning over situational context. Language also evolves, i.e., it specializes and changes over time. For example, many different languages and domains exist, new domains arise, and both evolve constantly. Thus, language understanding also requires continuous and efficient adaptation to new languages and domains — and transfer to, and between, both. Current language understanding methods, however, focus on high resource languages and domains, use little to no context, and assume static data, task, and target distributions.

The research in Cora4NLP aims to address these challenges. It builds on the expertise and results of the predecessor project DEEPLEE and is carried out jointly between DFKI’s language technology research departments in Berlin and Saarbrücken. Specifically, our goal is to develop natural language understanding methods that enable:

  • reasoning over broader co- and contexts
  • efficient adaptation to novel and/or low resource contexts
  • continual adaptation to, and generalization over, evolving contexts

Cora4NLP is funded by the German Federal Ministry of Education and Research (BMBF) under funding code 01IW20010.

Latest News

Recent Publications

Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum

We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.