This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
COVID19 has been one of the most serious global health crises in world history. During the pandemic, health care systems require accurate forecasts for key resources to guide preparation for patient surges. Forecasting the COVID19 hospital census is among the most important planning decisions to ensure adequate staffing, number of beds, intensive care units, and vital equipment.
The goal of this study was to explore the potential utility of local COVID19 infection incidence data in developing a forecasting model for the COVID19 hospital census.
The study data comprised aggregated daily COVID19 hospital census data across 11 Atrium Health hospitals plus a virtual hospital in the greater Charlotte metropolitan area of North Carolina, as well as the total daily infection incidence across the same region during the May 15 to December 5, 2020, period. Crosscorrelations between hospital census and local infection incidence lagging up to 21 days were computed. A multivariate timeseries framework, called the vector error correction model (VECM), was used to simultaneously incorporate both time series and account for their possible longrun relationship. Hypothesis tests and model diagnostics were performed to test for the longrun relationship and examine model goodness of fit. The 7daysahead forecast performance was measured by mean absolute percentage error (MAPE), with timeseries crossvalidation. The forecast performance was also compared with an autoregressive integrated moving average (ARIMA) model in the same crossvalidation time frame. Based on different scenarios of the pandemic, the fitted model was leveraged to produce 60daysahead forecasts.
The crosscorrelations were uniformly high, falling between 0.7 and 0.8. There was sufficient evidence that the two time series have a stable longrun relationship at the .01 significance level. The model had very good fit to the data. The outofsample MAPE had a median of 5.9% and a 95th percentile of 13.4%. In comparison, the MAPE of the ARIMA had a median of 6.6% and a 95th percentile of 14.3%. Scenariobased 60daysahead forecasts exhibited concave trajectories with peaks lagging 2 to 3 weeks later than the peak infection incidence. In the worstcase scenario, the COVID19 hospital census can reach a peak over 3 times greater than the peak observed during the second wave.
When used in the VECM framework, the local COVID19 infection incidence can be an effective leading indicator to predict the COVID19 hospital census. The VECM model had a very good 7daysahead forecast performance and outperformed the traditional ARIMA model. Leveraging the relationship between the two time series, the model can produce realistic 60daysahead scenariobased projections, which can inform health care systems about the peak timing and volume of the hospital census for longterm planning purposes.
SARSCoV2 is a novel member of the coronavirus family, and infections in humans can result in the disease COVID19. The virus is transmitted primarily through droplets from coughing and sneezing and is highly infectious. Its basic reproduction rate is estimated to be in the low to mid 2s based on different models [
Our work is motivated by the need of hospital leaders to have timely and accurate forecasts to guide planning for surges in hospital demands due to the pandemic. Adequate preparation can help prevent or mitigate strains on hospital resources that result when hospitals exceed their historical capacity. On the contrary, being caught offguard under a pandemic can devastate the population and health care systems. For example, previous models in India suggested falsely that it had reached herd immunity, encouraging complacency and insufficient preparation; however, on May 4, 2021, there was still a reported rolling average of 378,000 cases a day, which overwhelmed hospitals and health workers and resulted in a national health crisis [
Prior research has demonstrated the utility of forecasting hospital demands (eg, hospital admissions, intensive care unit census, and hospital overall census) using univariate timeseries models such as the autoregressive integrated moving average (ARIMA), the seasonal autoregressive integrated moving average (SARIMA), and exponential smoothing [
Recently, VECM has been used to forecast the demand for intensive care units during the COVID19 pandemic by including hospital admission as a leading indicator [
During the COVID19 pandemic, many papers have been devoted to developing predictive models for the volume of new cases (ie, infection incidence) using various methods from timeseries analyses [
Atrium Health is a large, integrated health care system operating in North Carolina, South Carolina, and Georgia. In this paper, the COVID19 hospital census (census) refers to the daily aggregate number of beds occupied by patients with COVID19 at midnight across the subset of 11 Atrium Health hospitals in the greater Charlotte metropolitan area of North Carolina, plus a virtual hospital (Atrium Health Hospital at Home). The virtual hospital uses telemedicine to treat patients who require only a minimal level of care. The local COVID19 infection incidence (incidence) is the aggregate daily count of new COVID19–positive cases from 11 local counties belonging to the Cities Readiness Initiative (CRI) region, as designated by the North Carolina Department of Health and Human Services. The CRI region roughly approximates the market catchment area of these hospitals.
Using STL (seasonal and trend decomposition using Loess) timeseries decomposition [
The forecast model described in the following sections was developed for these transformed time series.
Scaled time series for COVID19 hospital census and local COVID19 infection incidence in the Cities Readiness Initiative region for the period from May 15 to December 5, 2020. Transformed census (blue) and incidence (red) are linearly standardized to the 0100 scale.
A VECM is a vector autoregressive model used for nonstationary multivariate time series and accounts for stable longrun relationships, that is, cointegration, between the time series. A
Following Pfaff [
for time
The VECM specification can be formulated as an algebraic rearrangement of the VAR representation as:
where
The model has the following assumptions:
Assumption 1: The components of
Assumption 2: 0≤
Assumption 3:
We now discuss the implications of the assumptions. For assumption 2, if
The VECM was specified and fitted with the steps below.
First, to choose the order
Second, we determined the number of cointegration relationships (
Third, we needed to decide where to place the constant
We made our decision about whether to restrict
Fourth, we used maximum likelihood estimation to fit the model, reported parameter estimates, the corresponding
Finally, we computed the 7daysahead forecasts and the 80% forecast intervals. Once the forecasts of the transformed census were made with the VECM, they were backtransformed to the original scale of census. We created 80% forecast intervals for the transformed census using a bootstrap procedure [
The model was fitted to the data between May 15 and December 5, 2020. All the data analysis was done using R statistical software, version 4.0.3 (R Core Team). The implementation of the VECM was done with the
We examined the omnibus
We used mean absolute percentage error (MAPE) to evaluate the 7daysahead forecasts of census:
where
In order to approximate the sampling distribution of MAPE, we performed timeseries crossvalidation. From June 16 to November 28, 2020, for each day, we iteratively fitted the model, made 7daysahead forecasts, and computed the MAPE. Eventually, we obtained 166 values of MAPE, plotted the distribution, and computed the median as well as the 95th percentile. We will consider a median MAPE below 10% to be satisfactory, based on the practical effect of a peak surge on bed capacity at our health care system.
Leading up to and at the peak of infection prevalence, there can be high anxiety and uncertainty about how much more incidence and, in particular, census may increase. Furthermore, traditional univariate timeseries models may give linear forecasts for census that do not accurately represent pandemic behavior. However, cointegration allows for census forecasts that leverage subtle, but critical, changes in incidence (eg, concavity). This suggests, if not necessitates, the forecasting of census under different pandemic scenarios. For resource planning, hospital leaders will want to understand the implications associated with a worstcase scenario.
For our health care system, besides routine 7daysahead census forecasts, we also deployed our model for 60daysahead census forecasts, considering 3 different scenarios of what could happen with incidence (ie, best case, base case, and worst case). On January 9, 2021, we expected the winter surge to reach peak infection prevalence around February 5, 2021, based on an extension of an epidemiological model called the susceptibleinfectedremoved model [
Using our model refitted on January 9, 2021, with an increased capacity of 1250 patients, we generated forecasts iteratively forward for 60 days using the past census forecasts together with projected incidence under each scenario. To account for uncertainty in future census and incidence, we also simulated 1000 conditional sample paths of the two time series under each scenario using the bootstrap procedure mentioned earlier and computed the 10th and 90th percentile at each horizon to obtain the 80% forecast intervals.
The 60day projected local COVID19 infection incidence in the Cities Readiness Initiative region on the log scale, as of January 9, 2021. Past values (black), worstcase scenario (red), basecase scenario (orange), bestcase scenario (blue) are shown.
Our research protocol was submitted to the Atrium Health Institutional Review Board (IRB) prior to execution, and the study was deemed exempt from IRB oversight. In compliance with HIPAA (Health Insurance Portability and Accountability Act) regulations, individual patient information was not disclosed, and all data have been deidentified and reported as aggregates. The procedures set out in this protocol, pertaining to the conduct, evaluation, and documentation of this study, were designed to ensure that the investigators abide by Good Clinical Practice guidelines and under the guiding principles detailed in the Declaration of Helsinki.
Our model was specified as a VECM with 7 lags in its VAR representation (
The output from the maximum likelihood estimation showed that the cointegration relationship, that is, the error correction term, had a significant negative effect on census change (
where
From
Autocorrelation functions and crosscorrelation functions of the residuals: (A) census residuals, (B) lagged census residuals and incidence residuals, (C) census residuals and lagged incidence residuals, and (D) incidence residuals.
Parameter estimates and
Predictor  ∆ 
∆ 


Estimate  Estimate  

–0.1265  –5.6993  <.001  –0.1216  –1.1323  .26  
∆ 
–0.0489  –0.7143  .48  0.5487  1.6555  .10  
∆ 
–0.0665  –2.8222  .005  –0.9808  –8.6067  <.001  
∆ 
–0.2220  –3.2277  .002  –0.0614  –0.1844  .85  
∆ 
–0.0532  –2.0881  .04  –0.6955  –5.6431  <.001  
∆ 
–0.0700  –0.9949  .32  0.0643  0.1890  .85  
∆ 
–0.0472  –1.9094  .06  –0.6428  –5.3755  <.001  
∆ 
–0.0785  –1.1224  .26  0.9769  2.8871  .004  
∆ 
–0.0567  –2.4165  .02  –0.5564  –4.8999  <.001  
∆ 
–0.0499  –0.7140  .48  –0.0792  –0.2341  .82  
∆ 
–0.0465  –2.1907  .03  –0.4589  –4.4634  <.001  
∆ 
0.0077  0.1107  .91  0.4533  1.3404  .18  
∆ 
–0.0373  –2.4015  .02  –0.2384  –3.1739  .002 
Parameter estimates and
Predictor  ∆ 
∆ 


Estimate  Estimate  
Friday  –0.0213  –1.1120  .27  0.0095  0.1024  .92  
Saturday  0.0083  0.3980  .69  –0.1528  –1.5176  .13  
Sunday  0.0030  0.1330  .89  –0.3340  –3.0744  .002  
Monday  0.0585  2.6205  .01  –0.1939  –1.7950  .07  
Tuesday  0.0291  1.3896  .17  –0.1284  –1.2655  .21  
Wednesday  –0.0037  –0.1895  .85  0.0343  0.3672  .71 
The omnibus
The Portmanteau test did not show sufficient evidence that the errors were autocorrelated (
The Augmented DickeyFuller test for stationarity of the error correction term rejected the unit root null hypothesis at the 10% significance level but failed to reject the null hypothesis at the 5% significance level (based on tabulated critical values). The KPSS test failed to reject the stationarity null hypothesis (
The companion matrix of the VAR representation had a maximum eigenvalue modulus of 0.97, strictly less than 1. Although this value was close to 1, the trace plot showed that this value had been slowly declining and below 1 across time when the model was fitted repeatedly in a daily rolling basis from June 16 to November 28 (
Trace plot of the maximum eigenvalue modulus for the period from June 16 to November 28, 2020.
We obtained the approximate sampling distribution of the outofsample MAPE from the timeseries crossvalidation (
Distribution of the 7daysahead mean absolute percentage error from the timeseries crossvalidation for the period from June 16 to November 28, 2020. Median (blue) and 95th percentile (red) are shown.
Onestepahead insample and 7daysahead outofsample predictions for COVID19 hospital census in the Cities Readiness Initiative region. True values (black), insample and outofsample predictions (red line), 95% prediction intervals (blue band), 80% forecast intervals (red band) are shown. The model is fitted on data from May 15 to December 5, 2020.
In all scenarios, due to cointegration, census followed corresponding concave trajectories with peaks occurring approximately 2 to 3 weeks later than incidence depending on the scenario. In the worstcase scenario, census was projected to peak on February 16, 2021 (11 days later than incidence), with approximately 850 patients at the 80% forecast interval upper bound (
Worstcasescenario, 60day forecasts for COVID19 hospital census in the Cities Readiness Initiative region, as of January 9, 2021. Past values (black), forecasts (red line), and 80% forecast intervals (red band) are shown.
Our VECM provides a very good fit to the data and outperforms models with no or other leading indicators. Significant omnibus
The longrun relationship plays a crucial role in the model. Our model results show how future census responds to perturbations in the longrun cointegration relationship in the direction that would preserve the stability of the relationship. For instance, if incidence increases significantly and drives the error correction term below 0, the nextday census will tend to increase so that the error correction term will move back toward 0. Compared to shortrun relationships between census change and past changes in incidence and census, the longrun relationship effect is also strongly significant and is a major driver in the model.
We observed that local infection incidence led the hospital census by about 2 weeks. The crosscorrelations between incidence and census were uniformly high, between 0.7 and 0.8 at different lags, but the highest correlation was at lag 14. Clinically, we know that after someone is diagnosed with SARSCoV2, it can take several days before they become sick enough to be hospitalized. During the summer 2020 wave of the pandemic, incidence peaked 18 days earlier, on July 10, than when census peaked, on July 28. In the model, we also saw that past incidence changes at multiple lags have statistically significant effects on census. While previous studies have focused on other types of leading indicators [
Applying the model to scenariobased forecasting in a health care system is an important method for longterm forecasting when approaching an infection prevalence peak and helps determine the potential for resource capacity to be exceeded under a worstcase scenario. There are several advantages to our approach. With a scenariobased and epidemiologically informed approach, the VECM produces realistic, nonlinear, longrange trajectories of census. In contrast, an ARIMA model can have an upward linear trajectory even as we approach and arrive at the infection prevalence peak because it is agnostic to incidence. Hence, the VECM fit with scenariobased incidence will provide better accuracy since it is more reflective of pandemic behavior. Additionally, when the concern is a specific scenario, our approach is particularly useful at minimizing longrange forecast uncertainty, since the bootstrapped sample paths are constrained to fluctuate around the marginalized scenariobased census projection. Without such a constraint, 60day forecasts can typically have wide forecast intervals that are of no practical utility.
Our study has mathematically ascertained the stable longrun relationship, that is, cointegration, between the COVID19 hospital census and the local infection incidence, and we have developed a statistical incidencebased model to forecast the COVID19 hospital census. In comparison, prior COVID19 hospital capacity planning models that make use of infection incidence data rely on simplified assumptions about the incidencecensus relationship. For example, in the COVID19 Hospital Impact Model for Epidemics (CHIME) at the University of Pennsylvania [
Although our model has been thoroughly developed, it is not free of limitations. First, it is possible that we may lose the stable longrun relationship at some point in the future, either because it has run its course or due to structural changes in the time series. For instance, in the latter case, inadequate communitybased testing might suddenly underestimate the actual local infection incidence, and there may be a level shift in the relationship that would have to be accounted for by a modified VECM [
The construct presented here provides a framework in the context of a health care system for incorporating other leading indicators that may yield further increases in forecasting performance. For instance, the VECM that uses internetbased leading indicators [
We have shown that infection incidence can be successfully tethered with hospital census in a multivariate timeseries model to achieve accurate forecasting of COVID19 hospital census. When coupled with scenariobased forecasting, the model helped our leaders evaluate resource capacity against different possible peak resource demands. In hindsight, our analyses correctly assured our leaders of our capability to handle a worstcase scenario, alleviated uncertainty, and effectively guided longterm planning of adequate staffing, bed capacity, and equipment supplies through the pandemic.
augmented DickeyFuller
Akaike information criterion
autoregressive integrated moving average
Cities Readiness Initiative
Health Insurance Portability and Accountability Act
Institutional Review Board
KwiatkowskiPhillipsSchmidtShin
mean absolute percentage error
seasonal autoregressive integrated moving average
severe acute respiratory syndrome
seasonal and trend decomposition using Loess
vector autoregressive
vector error correction model
HMN prepared the original draft. HMN and PJT were involved in study conceptualization, statistical analysis, and review and editing of the manuscript. ADM supervised the study and contributed to the review and editing of the manuscript.
ADM is an administrative member of iEnroll LLC.