Accessing primary care Big Data: the development of a software algorithm to explore the rich content of consultation records

BMJ Open. 2015 Aug 21;5(8):e008160. doi: 10.1136/bmjopen-2015-008160.

Abstract

Objective: To develop a natural language processing software inference algorithm to classify the content of primary care consultations using electronic health record Big Data and subsequently test the algorithm's ability to estimate the prevalence and burden of childhood respiratory illness in primary care.

Design: Algorithm development and validation study. To classify consultations, the algorithm is designed to interrogate clinical narrative entered as free text, diagnostic (Read) codes created and medications prescribed on the day of the consultation.

Setting: Thirty-six consenting primary care practices from a mixed urban and semirural region of New Zealand. Three independent sets of 1200 child consultation records were randomly extracted from a data set of all general practitioner consultations in participating practices between 1 January 2008-31 December 2013 for children under 18 years of age (n=754,242). Each consultation record within these sets was independently classified by two expert clinicians as respiratory or non-respiratory, and subclassified according to respiratory diagnostic categories to create three 'gold standard' sets of classified records. These three gold standard record sets were used to train, test and validate the algorithm.

Outcome measures: Sensitivity, specificity, positive predictive value and F-measure were calculated to illustrate the algorithm's ability to replicate judgements of expert clinicians within the 1200 record gold standard validation set.

Results: The algorithm was able to identify respiratory consultations in the 1200 record validation set with a sensitivity of 0.72 (95% CI 0.67 to 0.78) and a specificity of 0.95 (95% CI 0.93 to 0.98). The positive predictive value of algorithm respiratory classification was 0.93 (95% CI 0.89 to 0.97). The positive predictive value of the algorithm classifying consultations as being related to specific respiratory diagnostic categories ranged from 0.68 (95% CI 0.40 to 1.00; other respiratory conditions) to 0.91 (95% CI 0.79 to 1.00; throat infections).

Conclusions: A software inference algorithm that uses primary care Big Data can accurately classify the content of clinical consultations. This algorithm will enable accurate estimation of the prevalence of childhood respiratory illness in primary care and resultant service utilisation. The methodology can also be applied to other areas of clinical care.

Keywords: PRIMARY CARE.

Publication types

  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Adolescent
  • Algorithms*
  • Child
  • Child, Preschool
  • Electronic Health Records / standards*
  • Female
  • Humans
  • Infant
  • Infant, Newborn
  • Male
  • Natural Language Processing
  • New Zealand / epidemiology
  • Outcome Assessment, Health Care
  • Primary Health Care / statistics & numerical data*
  • Referral and Consultation / classification*
  • Referral and Consultation / standards
  • Respiratory Tract Diseases / epidemiology*
  • Sensitivity and Specificity
  • Software*