一种基于小训练数据的领域特定文档分类的专家在环方法

An expert‐in‐the‐loop method for domain‐specific document categorization based on small training data

Journal of the Association for Information Science and Technology (JASIST) · 2022
被引 9
ABS 3

中文导读

针对标注数据少、需要领域专家深度阅读的文档分类问题,提出一种结合专家知识与计算模型的方法,用93份标注文档实现自动分类,发现专家知识能提升分类效果。

Abstract

Abstract Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in‐depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio‐ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.

文本分类机器学习信息检索环境科学数据科学