A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Marc P. Maurits; Ilya Korsunsky; Soumya Raychaudhuri; Shawn N. Murphy; Jordan W. Smoller; Scott T. Weiss; Thomas W.J. Huizinga; Marcel J.T. Reinders; Erik B. Van Den Akker; null More Authors

doi:10.1093/jamia/ocac008

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Marc P. Maurits^*, Ilya Korsunsky, Soumya Raychaudhuri, Shawn N. Murphy, Jordan W. Smoller, Scott T. Weiss, Thomas W.J. Huizinga, Marcel J.T. Reinders, Erik B. Van Den Akker, More Authors

^*Corresponding author for this work

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

3 Citations (Scopus)

31 Downloads (Pure)

Abstract

Objective: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. Material and Methods: We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. Results: We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache"clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. Discussion: Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. Conclusion: We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.

Original language	English
Pages (from-to)	761-769
Number of pages	9
Journal	Journal of the American Medical Informatics Association
Volume	29
Issue number	5
DOIs	https://doi.org/10.1093/jamia/ocac008
Publication status	Published - 2022

Keywords

clustering
electronic health records
electronic medical records
eMERGE
ICD
PhenoGraph

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1093/jamia/ocac008

ocac008Final published version, 804 KBLicence: CC BY-NC

Cite this

Maurits, M. P., Korsunsky, I., Raychaudhuri, S., Murphy, S. N., Smoller, J. W., Weiss, S. T., Huizinga, T. W. J., Reinders, M. J. T., Van Den Akker, E. B., & More Authors (2022). A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history. Journal of the American Medical Informatics Association, 29(5), 761-769. https://doi.org/10.1093/jamia/ocac008

@article{3ab2510bbbe14bee895bfdd49b2f5a00,

title = "A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history",

abstract = "Objective: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. Material and Methods: We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. Results: We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 {"}other headache{"}clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. Discussion: Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. Conclusion: We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes. ",

keywords = "clustering, electronic health records, electronic medical records, eMERGE, ICD, PhenoGraph",

author = "Maurits, {Marc P.} and Ilya Korsunsky and Soumya Raychaudhuri and Murphy, {Shawn N.} and Smoller, {Jordan W.} and Weiss, {Scott T.} and Huizinga, {Thomas W.J.} and Reinders, {Marcel J.T.} and {Van Den Akker}, {Erik B.} and {More Authors}",

year = "2022",

doi = "10.1093/jamia/ocac008",

language = "English",

volume = "29",

pages = "761--769",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "5",

}

Maurits, MP, Korsunsky, I, Raychaudhuri, S, Murphy, SN, Smoller, JW, Weiss, ST, Huizinga, TWJ, Reinders, MJT , Van Den Akker, EB & More Authors 2022, 'A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history', Journal of the American Medical Informatics Association, vol. 29, no. 5, pp. 761-769. https://doi.org/10.1093/jamia/ocac008

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history. / Maurits, Marc P.; Korsunsky, Ilya; Raychaudhuri, Soumya et al.
In: Journal of the American Medical Informatics Association, Vol. 29, No. 5, 2022, p. 761-769.

Research output: Contribution to journal › Article › Scientific › peer-review

TY - JOUR

T1 - A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

AU - Maurits, Marc P.

AU - Korsunsky, Ilya

AU - Raychaudhuri, Soumya

AU - Murphy, Shawn N.

AU - Smoller, Jordan W.

AU - Weiss, Scott T.

AU - Huizinga, Thomas W.J.

AU - Reinders, Marcel J.T.

AU - Van Den Akker, Erik B.

AU - More Authors, null

PY - 2022

Y1 - 2022

N2 - Objective: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. Material and Methods: We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. Results: We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache"clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. Discussion: Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. Conclusion: We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.

AB - Objective: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. Material and Methods: We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. Results: We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache"clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. Discussion: Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. Conclusion: We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.

KW - clustering

KW - electronic health records

KW - electronic medical records

KW - eMERGE

KW - ICD

KW - PhenoGraph

UR - http://www.scopus.com/inward/record.url?scp=85128489105&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocac008

DO - 10.1093/jamia/ocac008

M3 - Article

C2 - 35139533

AN - SCOPUS:85128489105

SN - 1067-5027

VL - 29

SP - 761

EP - 769

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 5

ER -

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this