Responsible and valid use of free text notes in electronic health records to improve medical prediction research
Short project summary
The availability of electronic health records (EHR), containing valuable information about clinical practice, is growing. With recent progress in artificial intelligence and natural language processing, the number of medical studies using natural language processing (NLP) and large language models (LLMs) to automatically collect study variables and outcomes from free text is rapidly increasing. Examples include the prediction of text-extracted fall events in elderly, or prediction of text-extracted side effects in patients using antipsychotic medication. However, the potential impact of errors made by NLP/LLMs on subsequent medical study results has received insufficient attention, and preconditions for responsible use of free text in such studies are absent (e.g., minimum text mining quality, reporting, but also interpretation pitfalls, including implications of the fact that absence of information is generally not evidence of absence in textual notes). This project aims to systematically study in what ways erroneous NLP/LLMs may induce bias in subsequent medical prediction studies and arrive at a set of preconditions and recommendations for responsible conduct, reporting, and interpretation of prediction research using variables automatically collected from free text.