Boosting Drug Named Entity Recognition using an Aggregate Classifier

Korkontzelos, Ioannis, Piliouras, Dimitrios, Dowsey, Andrew W. and Ananiadou, Sophia (2015) Boosting Drug Named Entity Recognition using an Aggregate Classifier. Artificial Intelligence in Medicine (AIIM), 65 (2). pp. 145-153. ISSN 0933-3657 DOI

[img] PDF (Post-print)
Manuscript-AIMedicine-KorkontzelosEtAl.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (513kB)


Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. Methods: We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. Materials: Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. Results: Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. Conclusion: We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations.

Item Type: Article
Uncontrolled Keywords: Named entity annotation sparsity, Gold-standard vs. silver-standard annotations, Named entity recogniser aggregation, Genetic-programming-evolved string-similarity patterns, Drug named entity recognition
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Computing and Information Systems
Date Deposited: 08 Feb 2016 12:30

Archive staff only

Item control page Item control page