Classification of Spanish medical patient discharge summaries
team size
2-3 people
Technologies
Data Science
Deep Learning
Natural Language Processing
Maturity Level
Proof-of-Concept
CLASSIFICATION OF SPANISH MEDICAL PATIENT DISCHARGE SUMMARIES
As part of the “Artificial Intelligence for Europe” initiative we developed a system for the automatic detection and classification of medical diseases from patient discharge summaries in Spanish and English Language.
AI4EU is the European Union’s landmark Artificial Intelligence project, which seeks to develop a European AI ecosystem, bringing together knowledge, algorithms and tools to transform them into compelling solutions for users. AI4EU involves more than 70 partners, covering 21 countries and was funded with €20m by the Horizon 2020 framework program.
Our contribution to the AI4EU platform was published and can be accessed via the following link:
ABOUT THE CHALLENGE
With 1.8 M new cases diagnosed in 2018 and 48% mortality rate, Colorectal cancer (CRC) is the 2nd leading cause of cancer-related deaths worldwide. CRC has a 5-year survival rate of 92% if diagnosed in Stage I and only 11% if diagnosed in Stage IV. Screening programs are the most powerful tool to diagnose CRC early.
Our partner has developed the product “ColoFast”, an innovative blood-based test for the early detection of colorectal cancer based on a combination of biomarkers for patients between 50 to 85 years old.
The long-term objective of our partner relies on the development of a diagnostic algorithm. Based on risk factors identified from patients’ discharge summaries the objective is to identify patients with high risk of developing cancer or premalignant lesions.
Our Natural Language Processing (NLP) algorithm automatically predicts ICD-10 codes from Spanish and English free-text medical notes. It thereby automates the coding task that usually involves manual labour from a professional medical coder.
The predicted ICD-10 codes provide useful categorical features for further classification tasks for recognition of different types of cancer. These categorical features combined with other features extracted from electronic health records and blood-based features are expected to increase the accuracy of medical screening.
THE RESULTS
By utilizing the rich data annotations that were offered in the challenge dataset, the constructed model is able to precisely predict ICD-10 codes, obtaining performances better than most alternative scores. Our approach allows us to offer predictions that can be regarded as SOTA in the field of Spanish medical texts.
THE IMPACT
Our solution is able to distinguish between 2.552 different ICD-10 classes. We processed documents from over 1.000 patients.