Developing a Long COVID Phenotype for Postacute COVID-19 in a National Primary Care Sentinel Cohort: Observational Retrospective Database Analysis

Background: Following COVID-19, up to 40% of people have ongoing health problems, referred to as postacute COVID-19 or long COVID (LC). LC varies from a single persisting symptom to a complex multisystem disease. Research has flagged that this condition is underrecorded in primary care records, and seeks to better define its clinical characteristics and management. Phenotypes Results: The long-COVID phenotype differentiated people hospitalized with LC from people who were not and where no index infection was identified. The PCSC (N=7.4 million) includes 428,479 patients with acute COVID-19 diagnosis confirmed by a laboratory test and 10,772 patients with clinically diagnosed COVID-19. A total of 7471 (1.74%, 95% CI 1.70-1.78) people were coded as having LC, 1009 (13.5%, 95% CI 12.7-14.3) had a hospital admission related to acute COVID-19, and 6462 (86.5%, 95% CI 85.7-87.3) were not hospitalized, of whom 2728 (42.2%) had no COVID-19 index date recorded. In addition, 1009 (13.5%, 95% CI 12.73-14.28) people with LC were hospitalized compared to 17,993 (4.5%, 95% CI 4.48-4.61; P <.001) with uncomplicated COVID-19. Conclusions: Our LC phenotype enables the identification of individuals with the condition in routine data sets, facilitating their comparison with unaffected people through retrospective research. This phenotype and study protocol to explore its face validity contributes to a better understanding of LC.


Background
Postacute COVID-19 syndrome, otherwise known as long COVID (LC), is a complex, multisystem disease that follows SARS-CoV-2 infection and often follows a relapsing and remitting course [1]. The postacute sequelae of LC could manifest with mild symptoms or asymptomatically. Although a distinct clinical phenotype remains to be defined, current evidence suggests that fatigue with postexertional symptom exacerbation is the most prominent, followed by shortness of breath, muscle aches, and cognitive impairment (brain fog) [2][3][4]. Risk factors are not well understood, and it appears that the characteristics that increase the risk of developing a severe COVID-19 infection (older age, male sex, non-White ethnicity, and certain pre-existing comorbidities) do not translate into an increased risk of developing LC [5]. Current research indicates that the prevalence of LC is greater amongst females, those aged 20-70 years, and those with prepandemic mental health conditions and asthma [6]. As the symptom pattern varies widely between individuals and risk factors have not been defined [7], it is difficult to establish an evidence-based framework for the recognition, assessment, and management of this condition.
In the United Kingdom, the Office for National Statistics (ONS) has estimated that 1.3 million people continue to have ongoing health issues after COVID-19 infection, with over 800,000 people reporting at least some limitation to their daily lives [2], although cases remain underrecorded in primary care electronic health records (EHRs) [8]. In December 2020 (updated in December 2021), the United Kingdom's National Institute for Health and Care Excellence (NICE) recognized the lack of a clinical definition and released a rapid guideline [9]. NICE defines acute COVID-19 (symptoms lasting <4 weeks), ongoing symptomatic COVID-19 (symptoms lasting 4-12 weeks), and postacute COVID-19 syndrome (symptoms lasting >12 weeks), with the latter 2 considered as LC [3]. However, there remain limited treatment options or evidence-based rehabilitation guidance available for this condition, although research projects, such as the Long Covid Multidisciplinary Consortium: Optimising Treatments and Services across the National Health Service (NHS; LOCOMOTION), have been set up to address this [10].
Research on LC is confusing due to heterogenous study methods with minimal phenotypic information, and patient-reported symptoms often remain uncaptured [7]. Phenotypes are a standardized method for case definition and identification from routine data and are usually machine-processable. Computable phenotypes have become increasingly important in EHRs as they allow identification of patient characteristics using data that are generated during routine patient interactions [11]. An EHR-based phenotype definition is constructed by characterizing the disease in terms of its demographic profile, symptomatology, laboratory tests, and other clinically relevant data, such as referrals to specialist services [12]. This information can be displayed in the form of clinical codes or abstractly represented in the form of a logical data flow diagram [13]. In the United Kingdom, we use a national information standard, the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) and Read version 2 codes. It can then be written into a computational algorithm, which can be applied to EHRs to identify a specific cohort of patients. However, such a phenotype has to work within the constraints of data quality and clinical terminology used.

Aims
The aim of this study is to develop a phenotype for LC using pseudonymized individual-level EHR data from English general practice that will enable the monitoring and evaluation of interventions for this condition. The specific objectives are: To compare the symptoms reported by people with LC identified by the phenotype in the year prior to the pandemic with those they experienced during the pandemic • To compare the symptoms of people with LC identified by the phenotype to those with acute COVID-19 • To compare people with LC identified by the phenotype who were hospitalized with those who were managed in the community

Data Source
The LC phenotype was piloted in an observational retrospective database analysis of the English Primary Care Sentinel Cohort (PCSC), which used data from the Oxford Royal College of General Practitioners (RCGP) Research and Surveillance Centre (RSC) sentinel network. This database is derived from pseudonymized patient data from EHRs and is recruited to be representative of the English population in terms of both demographic and clinical factors [14].

Comparisons
This protocol piloted an LC phenotype in the PCSC and described the baseline characteristics and outcomes of those with LC. All people registered within the PCSC were eligible for inclusion in the study. The developed phenotype was used as a detailed reference for the inclusion and exclusion criteria. The study described further aspects of the epidemiology through 3 comparisons: • Before-and-after symptom comparison in people with LC: We compared the presence of symptoms listed by the ONS between 1 and 6 months after index infection. We matched the period with the equivalent months for the previous year. The list of 21 symptoms developed by the ONS is broad and includes central nervous system symptoms, such as fatigue; respiratory symptoms; cardiovascular symptoms; general symptoms; gastrointestinal symptoms; and mental health symptoms ( Figure 1). We defined an index date of COVID-19 hierarchically using our application ontology, which prioritized virologically proven cases (definite COVID-19) over clinical terms for a COVID-19-specific disease (probable COVID-19) over less definite clinical diagnoses (possible COVID-19) [15].
• Comparison of people with LC with those with acute COVID-19 uncomplicated by LC: We compared sociodemographic features, a range of comorbidities, vaccination status, and mortality between those who had LC and those who had a COVID-19 infection. Sociodemographic features included age; gender; ethnicity using 5 categories (Asian, Black, White, mixed, and others) [16]; socioeconomic status (SES), measured using the Index of Multiple Deprivation (IMD) [17]; population density divided into rural, town, city, and conurbation; the English Health Region; obesity, categorized by the BMI or the diagnostic clinical term into underweight, normal weight, overweight, obese, or severely obese; and, finally, smoking status, categorized into current smoker, ex-smoker, and nonsmoker. We conducted a literature review and identified a range of chronic diseases associated with the risk of COVID-19 complications ( Figure 2) and an extended list differentiating long COVID and COVID-19 [1,5,8,18,19]. We reported the vaccination status stratified by the Cambridge Multimorbidity Score (CMMS) as an overall measure of multimorbidity [20]. The CMMS uses 37 conditions to predict primary care consultations, unplanned hospital admissions, and death as primary outcomes; it is useful to identify people who are at higher risk of specific outcomes based on their comorbidity profiles, as recorded in primary care EHR data.
• Comparison of those with LC who were hospitalized with those who were not: We used the same variables to compare people who were hospitalized and subsequently had LC with those who were not hospitalized but had LC diagnosed in the community. We conducted a sensitivity analysis where we subdivided the community cases into 2 groups: people who had an index COVID-19 infection either virologically confirmed or sometimes clinically diagnosed and those who have an LC diagnosis, a referral to an LC service, or a LC disability rating score compatible with an LC diagnosis (eg, Yorkshire LC score) [21].

Phenotype Development
We used a 3-step ontological approach to create our phenotype [12], considering ontological, coding, and logical layers.

Ontological Layer
The key concept identified in our ontological layer was an index date for COVID-19, noting that not all cases had virological confirmation (especially in the early part of the pandemic up to July 2020). Hence, some LC cases might only have been flagged on referral or later presentation. We wanted to also include whether cases were hospitalized, as hospitalization can be associated with poor outcomes [22]. Additionally, we included vaccination status to explore if protective.

Coding Layer
We applied our existing ontology to identify COVID-19 cases. We included key outcomes related to hospital admissions. These were any hospitalization, admission to intensive care, or death in the hospital. To be a case of LC, we included disease codes, primarily recorded with SNOMED-CT or the World Health Organization's (WHO) International Classification of Diseases (ICD). The clinical term could be a diagnosis, a referral (eg, referral to post-COVID-19 assessment clinic), or completion of a rating scale that implied LC (eg, the Yorkshire Rehabilitation Scale, which records symptom severity, functional disability, and health status) [23].

Logical Data Extraction Model
We planned our data extraction using pseudonymized primary care data. We supplemented these data with national data sets. The national data sets used were the Second Generation Surveillance System (SGSS) to capture any missing test data, the National Immunisation Management System (NIMS) to capture any missing vaccine recording, and Hospital Episode Statistics (HES) to add hospital outcome data. The ONS also provided death data. We pseudonymized all data as close to the source as possible using an NHS Digital-approved method. We used the same pseudonymization method to link primary care data to other data sources.
Our phenotype definition is presented as a structured multistep model ( Figure 3) and as a logic model ( Figure 4). This omitted the reporting of vaccine exposure by group.

Formal Ontologies: BioPortal and PhenoFlow
From the logic model, we created 2 formal ontologies. We used Protégé, an open source ontology editor, to construct a domain ontology, which we placed online via BioPortal. Protégé supports Ontology Web Language (OWL) version 2 and Resource Description Frameworks (RDFs) from the World Wide Web Consortium (W3C) [24]. BioPortal is part of the National Center for Biomedical Ontology (in the United States) that supports the creation of interoperable ontologies.
We also created a version within the PhenoFlow library [13]. The PhenoFlow library imports and standardizes abstract definitions under a workflow-based multilayer model, which is later used as the basis for autonomously generating a computable form of the definition. This can then be downloaded and executed locally to identify a patient cohort. Standardizing a definition under the PhenoFlow model also assists with manual phenotype translation as it supplements the use of clinical terminology and simplifies the representation of logical structures, thus increasing intelligibility ( Figure 3). The model allows greater flexibility in updating phenotypes and also increases portability.
The model consisted of 3 layers and included the type or classification of the step's logic, with detailed information regarding inputs and outputs at each relevant step. This information was combined with 1 or more implemented units (eg, a piece of Python code) in order to realize a computable phenotype.

Statistical Analysis
This study is a secondary analysis of existing pseudonymized data within the PCSC of the RSC. Although we noted that 58 (7.8%) of 743 practices had not recorded any LC cases in their EHR system, they were included in the analysis as it is likely that recording would improve during the course of the study, with increased interest in this condition [25].
The distribution of baseline characteristics among the study groups was summarized through descriptive statistics (eg, mean, median, and proportion) with measures of dispersion (eg, SD and IQRs). Univariate analyses included the calculation of odds ratios (ORs) for categorical risk factors versus outcome levels with 95% CIs by using the log(p/1 -p) link function. Logarithmic transformation of the outcome variable allowed a nonlinear association in a linear manner. P values were obtained from a chi-square test for categorical variables and one-way ANOVA for continuous variables. Data that were not documented in our database were reported as missing.
The primary outcome measure was the association with LC using our phenotype. Multivariate logistic regression modeling was used to identify factors associated with LC as a binary outcome within the study population. Relevant risk factors identified in the literature underwent univariate analysis and were included in multivariate logistic regression using a 3-step backward elimination procedure with of α threshold levels of 0.20, 0.10, and 0.05. A 2-sided α value of 0.05 was considered statistically significant. Missing data were presented as a separate category in univariate statistics and compared to the reference category. Missing data categories were imputed to the reference category if no significant differences were found in the reference category. Missing data categories were otherwise included in multivariate regression as a separate category, under the assumption that they may not be missing at random.
The following 3 comparisons were made, reporting frequencies between groups with P values obtained from the chi-square test: • Symptoms reported by people with LC in the year prior to the pandemic versus those they experienced during the pandemic. The study period included COVID-19 cases from March 1, 2020, to April 1, 2021, with a follow-up period of a further 6 months up to, latest, September 30, 2021. This historical comparator period was month-matched; for example, if a patient had an acute COVID-19 code entered on February 1, 2021, their follow-up period was March 1-July 31, 2021, and the historic comparator period was March 1-July 31, 2019. This allowed the comparison of rates of relevant symptoms prior to having acute COVID-19 with after having acute COVID-19. The in-pandemic period was between 1 and 6 months after their index COVID-19 date. For those without a COVID-19 index date, we compared the 5 months prior to their LC recording with a matched period in the previous year.
• Symptoms of people with LC versus those with acute COVID-19. Although we accepted that LC is underrecorded, we considered this analysis of importance as the phenotype of those recorded was likely to be similar to those unrecorded, although potentially with more prominent or debilitating symptomatology.
• Those hospitalized with LC versus those managed in the community. A final comparison was then made between people requiring hospital admission for acute COVID-19 and those who were managed in the community and people who had no documented evidence of acute COVID-19. We also included a comparison between community cases with and without an index infection.
These comparisons enabled us to explore how the clinical phenotype varies. We also reported the vaccination uptake between people with and without LC diagnosis.

Ethical Considerations
This study used existing data, and no subjects were recruited. RSC data used to create this phenotype were pseudonymized as close to the source as possible and sent in an encrypted format to the Oxford Royal College of General Practitioners Clinical Informatics Digital Hub (ORCHID) [15], which is recognized as a trusted research environment.
This study was part of the RECAP (Predicting Risk of Hospital Admission in Patients with Suspected COVID-19 in a Community Setting) study sponsored by the Imperial College London [26]. Although primarily a study to develop a risk prediction tool, it also included the creation of an LC phenotype.

Phenotype: Logic Model
The logic model for the phenotype is shown in Figure 4. It depicts the hierarchical structure for identifying LC cases from the ontological layer of EHR data. The ontology logic runs hierarchically, first screening the population for COVID-19 cases (ie, firm diagnosis of acute COVID-19). Those with an index COVID-19 case were then screened for COVID-19-related hospital admissions. When no index COVID-19 cases were documented, the model still allowed for LC cases to be included as long as they had an entry within their EHRs, implying they had LC (ie, clinically defined LC).

Phenotype: BioPortal and PhenoFlow
The LC phenotype definition was built in Protégé, which is an open source ontology editor that supports the latest OWL. This phenotype was then uploaded to BioPortal. The LC phenotype definition ( Figure 5) can be accessed online [27] and provides a framework for researchers wanting to develop their own executable script to apply to databases.
Within BioPortal, the ontological layer of the structured phenotype is described within a class and subclass structure, while the coding layer is represented by individuals within each class and subclass. BioPortal ontologies can be readily updated.
The PhenoFlow library was used to transform the LC phenotype into a computable form ( Figure 6). The LC phenotype can be accessed online with authorization [28], it can be downloaded, and, unlike BioPortal, it is ready for researchers to apply to EHR databases.

Primary Care Sentinel Cohort
The PCSC of the RSC has a registered population of over 7 million (N=7,382,775). At the time of our data extraction, 428,479 (5.8%) of this population had an acute episode of COVID-19 recorded. Of this group, 42,321 (9.9%) were lost to follow-up; 40% (n=16,993) of this loss to follow-up was due to deaths, with just under half of these deaths (7531/16,993, 44.3%) being COVID-19 related. A total of 403,151 (94.1%) cases were included in the analysis, of whom 19,002 (4.7%) were hospitalized and 384,149 (95.3%) were not.

People With LC
We identified 7471 (1.8%) of 428,479 people recorded as having LC within this included group (Figure 7). A greater proportion were hospitalized in the LC group compared to the overall hospitalization rate (1009/7471, 13.5%, P<.001). Within this group, there were a small number of deaths (23/7471, 0.3%, P<.001).

Comparison of People With Acute COVID-19 and LC
We paired data for people with COVID-19 (n=395,680, 98.1%) and LC (n=7471, 1.9%), expecting to perform comparisons of baseline characteristics between both groups. Among the main preliminary findings, the mean age was 44.6 (SD 21.75) years for the COVID-19 group and 47.7 (SD 14.8) years for the LC group. A significantly higher proportion of those with LC were found among females (4836/7471, 64.7%), where the male gender was associated with lower odds for LC (OR 0.68, 95% CI 0.65-0.72). The proportion of those with a record of intensive care unit (ICU) admission was 0.6% (2523/395,680) in people with COVID-19 and 3.5% (261/7471) in people with LC, where a record of ICU admission was associated with higher odds of LC (OR 5.64, 95% CI 4.96-6.42). Sociodemographic characteristics reporting higher odds for LC in the univariate analysis included living in a conurbation (OR 1.49, 95% CI 1.42-1.57) and obesity (OR 1.40, 95% CI 1.34-1.48). Comorbidities associated with higher odds of LC included depression, anxiety, asthma, and hypertension. In contrast, chronic lung disease, chronic obstructive pulmonary disease (COPD), chronic kidney disease (CKD), ischemic heart disease, atrial fibrillation, and congestive heart failure were associated with lower odds of LC.
The baseline characteristics of the population are shown in Tables 1-4

Comparison of Those Hospitalized and Those Not Hospitalized
For the group of people with LC, we paired data for people with a record of hospitalization (n=1009, 13.5%) and without hospitalization (n=6462, 86.5%). The mean age was 54.6 (SD 13.69) years for the hospitalized group and 46.7 (SD 14.7) years for the nonhospitalized group, while the proportion of females was 66.5% (4297/6462) in the nonhospitalized group and 53.4% (539/1009) in the hospitalized group. Factors associated with greater odds of hospitalization were the male gender (OR 1.73, 95% CI 1.51-1.98) and type 2 diabetes (OR 3.8, 95% CI 3.15-4.59).

Principal Findings
We created a phenotype for LC and made it publicly available with the aim of facilitating research in this area. Our phenotype is straightforward but based on the presence of a postacute COVID-19 syndrome code being present in the EHR. The definition allows comparison of hospitalized and nonhospitalized groups and the inclusion of people with no baseline COVID-19 test data. Our phenotype's logical model can also allow vaccine exposure to be compared between groups.
Based on our network data, LC recording within primary care appears to be low and we noted interpractice variability, with some practices (8%) having no recorded cases. It was not possible to generate a symptom-related definition that might help close the gap between the level of recording in primary care and that identified through the ONS surveys [2].
Many different conditions have been associated with LC, and we made pragmatic, literature-based choices regarding which groups we should contrast where we make LC comparisons. We consider that before-and-after, acute COVID-19 compared with LC and hospitalized compared with nonhospitalized LC analyses will provide an assessment of our phenotype's performance and face validity.
Digitization of health systems worldwide has led to the emergence of EHR repositories for the study of both established and emerging diseases and trends. Phenotyping algorithms allow identification of patients within EHRs who share characteristics, and therefore play an important role in medical cohort studies.
High-quality phenotypes must be portable, accessible, and reproducible. A number of phenotype libraries have been developed or are undergoing development [29] in order to collect and store validated phenotype definitions. Our LC phenotype is available to download from BioPortal, where researchers can use it to produce their own executable script. By additionally applying the phenotype using the PhenoFlow model with "functional" and "computational" layers, our phenotype goes 1 step further with the capability for immediate execution in EHRs. As the characteristics of LC change with more data, vaccines, and treatments becoming available, the flexibility of the PhenoFlow model allows the phenotype to be readily updated and reapplied.

Comparison With Prior Work
Applying the phenotype within the RSC, we identified 7471 patients with LC. The LC group was older overall, more likely to be female, obese, and suffering from anxiety, depression, or asthma. These findings are in keeping with studies using patient-reported data and EHRs [5,6]. In the acute COVID-19 group, 17,993 (4.5%) of 395,680 patients were hospitalized. The number of patients hospitalized with COVID-19 in the LC group was much higher (1009/7471, 13.5%). Furthermore, patients with LC were more likely to have had an ICU admission: 261/7471 (3.5%) versus 2523/395,680 (0.6%). Similar findings were reported by O'Connor et al [23] in an observational study of 187 patients with 15% hospitalized and 5.4% admitted to the ICU. The Zoe Symptom Study app [5] reports even higher rates of patients attending the hospital (up to 44% of those experiencing symptoms for more than 56 days) but does not clarify whether these patients were admitted to the hospital. Survey studies such as these may also suffer from selection bias and are not necessarily representative of the wider population. Nevertheless, hospital attendance during the acute infection appears to be a risk factor for LC, and further work is required to address this.

Strengths and Limitations
Our study used the PCSC of the RSC, 1 of Europe's oldest sentinel systems and one widely involved in pandemic research [14,15,30]. Data quality is good, and linkage to national registries ensured reliable data, including mortality [31]. Additionally, UK primary care is universal and a registration-based system. Nearly all emergency care is provided by the NHS, and national systems enable capture of COVID-19 tests and vaccination data.
The complexity of LC and its multiple symptoms and associations made this analysis challenging. We were selective based on the literature available on the conditions we compared. The statistical analysis was limited to establishing associations between known covariates and outcomes, testing the face validity our LC phenotype against other reports in the United Kingdom. Further research should explore causality of the reported findings under appropriate study designs.
We likely underestimated the frequency of ongoing symptoms following acute infection, because many people do not seek medical care for these. There were also, like all studies using routine data, some issues with data quality. For example, clinicians may have "coded" (used clinical terms) based on symptoms (eg, fatigue) rather than using a "long COVID-19" clinical term to "code" this illness. It is also possible that key data were not coded at all but were included in the free-text narrative within EHRs. Our study aimed to compare LC in the hospitalized and nonhospitalized groups. It is possible that these represent 2 separate populations with different symptom clusters. Those hospitalized with acute COVID-19 are more likely to suffer from respiratory and other organ damage, whereas those managed in the community may suffer from a potentially different range of LC symptoms with a lower risk of end-organ damage and mortality. The lack of fine-detailed symptom categorization in EHRs may have limited this comparison. Symptom coding was also impacted by clinicians' cognitive biases, a known limitation of epidemiological research using routinely recorded data [32].
Finally, LC clinical terms were only added to SNOMED in January 2021 and thus would not have become available in EHRs until around February 2021, almost a year after the onset of the pandemic. The United Kingdom also has its own version of SNOMED-CT, and there are a range of different clinical terms available internationally.
Further research is required to explore symptom clusters and assess key differences in those hospitalized compared to those managed in the community.

Conclusion
Developing and validating an LC phenotype will enable the identification of individuals with the condition and facilitate comparison between affected and unaffected people. However, LC is a complex condition with a wide variety of symptoms that will require further research to understand. This phenotype and study protocol to explore its face validity should contribute to a better understanding of LC.