Language is implicit — it omits information. Filling this information gap requires contextual inference, background- and commonsense knowledge, and reasoning over situational context. Language also evolves, i.e., it specializes and changes over time. For example, many different languages and domains exist, new domains arise, and both evolve constantly. Thus, language understanding also requires continuous and efficient adaptation to new languages and domains — and transfer to, and between, both. Current language understanding methods, however, focus on high resource languages and domains, use little to no context, and assume static data, task, and target distributions.
The research in Cora4NLP aims to address these challenges. It builds on the expertise and results of the predecessor project DEEPLEE and is carried out jointly between DFKI’s language technology research departments in Berlin and Saarbrücken. Specifically, our goal is to develop natural language understanding methods that enable:
Cora4NLP is funded by the German Federal Ministry of Education and Research (BMBF) under funding code 01IW20010.
We are happy to announce that two papers from Cora4NLP members have been accepted for publication at the 17th Conference of the European Chapter of the Association for Computational Linguistics. The conference will take place from May2nd to May 6th, 2023.
One paper from Cora4NLP authors has been accepted for publication at CoNLL 2022, the SIGNLL Conference on Computational Natural Language Learning. The conference is co-located with EMNLP 2022 and planned to be a hybrid meeting.
Two papers from Cora4NLP authors have been accepted for publication at EMNLP 2022, the 2022 Conference on Empirical Methods in Natural Language Processing. The conference is planned to be a hybrid meeting and will take place in Abu Dhabi, from Dec 7th to Dec 11th, 2022.
One paper from Cora4NLP authors has been accepted for publication at the Workshop on Information Extraction from Scientific Publications (WIESP). The workshop will be held at AACL-IJCNLP 2022, the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, which will take place as an online-only event from Nov.
One paper from Cora4NLP researchers has been accepted for oral presentation at COLING 2022, the 29th International Conference on Computational Linguistics. The conference is planned to be a hybrid meeting and will take place in Gyeongju, Korea, from Oct 12th - Oct 17th, 2022.
We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.
Prompting pre-trained language models has achieved impressive performance on various NLP tasks, especially in low data regimes. Despite the success of prompting in monolingual settings, applying prompt-based methods in multilingual scenarios has been limited to a narrow set of tasks, due to the high cost of handcrafting multilingual prompts. In this paper, we present the first work on prompt-based multilingual relation classification (RC), by introducing an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels. We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages, prompt variants, and English-task training in cross-lingual settings. We find that in both fully supervised and few-shot scenarios, our prompt method beats competitive baselines: fine-tuning XLM-R-EM and null prompts. It also outperforms the random baseline by a large margin in zero-shot experiments. Our method requires little in-language knowledge and can be used as a strong baseline for similar multilingual classification tasks.
Scholarly Argumentation Mining (SAM) has recently gained attention due to its potential to help scholars with the rapid growth of published scientific literature. It comprises two subtasks: argumentative discourse unit recognition (ADUR) and argumentative relation extraction (ARE), both of which are challenging since they require e.g. the integration of domain knowledge, the detection of implicit statements, and the disambiguation of argument structure. While previous work focused on dataset construction and baseline methods for specific document sections, such as abstract or results, full-text scholarly argumentation mining has seen little progress. In this work, we introduce a sequential pipeline model combining ADUR and ARE for full-text SAM, and provide a first analysis of the performance of pretrained language models (PLMs) on both subtasks. We establish a new SotA for ADUR on the Sci-Arg corpus, outperforming the previous best reported result by a large margin (+7% F1). We also present the first results for ARE, and thus for the full AM pipeline, on this benchmark dataset. Our detailed error analysis reveals that non-contiguous ADUs as well as the interpretation of discourse connectors pose major challenges and that data annotation needs to be more consistent.