Automated collection of pathogen-specific diagnostic data for real-time syndromic epidemiological studies

Health-care and public health professionals rely on accurate, real-time monitoring of infectious diseases for outbreak preparedness and response. Early detection of outbreaks is improved by systems that are pathogen-specific. We describe a system, FilmArray® Trend, for rapid disease reporting that is syndrome-based but pathogen-specific. Results from a multiplex molecular diagnostic test are sent directly to a cloud database. www.syndromictrends.com presents these data in near real-time. Trend preserves patient privacy by removing or obfuscating patient identifiers. We summarize the respiratory pathogen results, for 20 organisms from 344,000 patient samples acquired as standard of care testing over the last four years from 20 clinical laboratories in the United States. The majority of pathogens show influenza-like seasonality, rhinovirus has fall and spring peaks and adenovirus and bacterial pathogens show constant detection over the year. Interestingly, the rate of pathogen co-detections, on average 7.7%, matches predictions based on the relative abundance of organisms present.


Introduction
The availability of real-time surveillance data that can monitor the spread of infectious diseases benefits public health [1][2][3]. At present, tracking of respiratory or foodborne outbreaks relies on a variety of methods ranging from automated real-time electronic reporting to manual web entry of 60 test results. Systems such as the Centers for Disease Control and Prevention's (CDC) FluView [4,5], National Respiratory and Enteric Virus Surveillance Systems (NREVSS, [6]), National Electronic Disease Surveillance System (NEDSS [7,8]), Global Emerging Infections Surveillance (GEIS) [9], and others, although web-based still require manual entry of data from laboratories, resulting in data that are often incomplete or not current. 4 coronavirus (MERS CoV), avian influenza, enterovirus D68, and Ebola virus, real-time surveillance programs are critical [1,29,30].
It is not always possible to accurately diagnose the causative agents of most infectious diseases from symptoms alone due to overlapping clinical presentation. Thus, to achieve maximal utility, 90 infectious disease surveillance systems should move beyond syndrome-based reporting and be pathogen-specific and comprehensive, reporting on as many of the common pathogens for a particular syndrome as possible. Sensitive and specific automated molecular diagnostic systems that detect up to four different pathogens in a single sample have been available from in vitro diagnostic (IVD) manufacturers for some time [31][32][33]. However adoption of IVD platforms with 95 broad multiplexing capability has become widespread only in the last few years. Commercially available systems that can detect most of the known etiological agents for respiratory, gastrointestinal and other multi-pathogen syndromes [34][35][36] include the BioFire (Salt Lake City, UT) FilmArray ® System [37]; the GenMark (Carlsbad, CA) eSensor XT-8 ® [38] and ePlex ® [39]; and the Luminex (Austin, TX) xTAG ® [40], nxTag ® [41] and Verigene ® systems [42]. 100 Multi-analyte diagnostic tests provide the raw data needed for real-time pathogen-specific surveillance but there remain a number of obstacles to sharing these results (reviewed in [43]).
The obstacles largely center on information privacy and network security. A real-time surveillance system using diagnostic test results requires safeguards for protected health 105 information (PHI). Medical records and devices have become attractive targets for cyber attackers in recent years [44], which has made hospitals and clinics reluctant to connect their Local Area Networks (LANs) to the Internet. Releasing patient test results requires the removal of PHI or authorization from the patient. Studies have shown that de-identification of patient data is not as simple as removing all specific identifiers because, in the age of big data, under the right 110 circumstances it is possible to re-associate patients and their data using publicly available information [45][46][47][48].
We describe here the implementation of a real-time pathogen-specific surveillance system that overcomes the PHI concerns noted above. FilmArray Trend de-identifies, aggregates and exports 115 test results from FilmArray Instruments in use in US clinical laboratories. Although data from all commercially available FilmArray Panels (Methods and [49]) is exported to the Trend database, we focus here on the Respiratory Panel (RP), which can detect 17  pathogens [37,[50][51][52][53][54][55][56].
With more than 344,000 patient results for the FilmArray RP test alone, the Trend database has 125 many of the properties associated with "big data" as it applies to infectious disease [57]. After describing how the dataset can be cleaned of non-patient tests, we make some observations on the seasonality of the different respiratory pathogens and we apply the ecological concept of "species diversity" [58] to observe a correlation between the abundance of each pathogen and the rate at which co-detections (more than one positive result per test) occur. 130 S  e  n  d  i  n  g  F  i  l  m  A  r  r  a  y  d  a  t  a  d  i  r  e  c  t  l  y  t  o  t  h  e  c  l  o  u  d The FilmArray Trend public website is an outgrowth of the Trend clinical website, which was developed to provide clinical laboratories using the FilmArray System with up-to-date 135 information on the respiratory, gastrointestinal and meningitis/encephalitis pathogens circulating at their institutions. The most general and efficient way to export both the clinical and public

Results
Trend information is to follow a "Bottom-Out" approach to data export ( Figure 1). In this scheme, the FilmArray Instrument sends data via the Internet directly to a single cloud database where it can be viewed by health care providers at the originating institution. This data export 140 pathway contrasts with a "Top-Out" approach ( Figure 1) in which diagnostic test results are pushed from the instrument up to the LIS, then to the HIS and, finally, a subset of this information is forwarded to cloud-based databases.   The detection counts and pathogen detection rates derived from the Trend data set for each 225 organism in the FilmArray RP are shown in Figure 2. Other views of these data, including  The pathogens' seasonal variability measured by percent detection can be classified into at least three groups. Group 1: The majority of organisms follow the classical "respiratory" season 240 (October through March) and increase by more than 10-fold above their baseline detection rate ( Figure 2C). These include the CoVs, Flu A, Flu B, hMPV, the PIVs, and RSV (PIV3 is a slight exception to this rule in that it peaks in the summer months, and has a winter peak that is only detected regionally (data not shown)). Within this group, all but five viruses demonstrate significant biennial fluctuations; Flu B, hMPV, OC43, and PIV3 and RSV experience relatively 245 consistent annual peaks. Group 2: HRV/EV is in a class by itself in that it is detected in a high percentage of tests over time, (minimum of 10% in winter) and experiences moderate peaks of two to three fold outside the respiratory season baseline, in the early fall and spring ( Figure 2D).      A B Trend data have high temporal, spatial, and organism-specific resolution. These three properties allow for a novel evaluation of co-detections. The observed rates of co-detections should be influenced by the number of circulating pathogens detected by the FilmArray RP test at a particular site. Figure 5A shows the average number of unique organisms detected at each site in a given week (see Methods: Calculation of co-detection rates). This number fluctuates from a 295 summer low of four to a winter high of 11 pathogens. Figure 5B (grey bars) shows that the total rate of organism co-detections in the Trend dataset fluctuates annually with peak rates occurring in the winter months. The average rates have been as high as 12% in the winter of 2016 and as low as 2% in the summer of 2014. 300 From the Trend data, a Measure of Interspecific Encounter (MIE) can be calculated as the probability of a co-detection, weighted by the prevalence of each circulating pathogen at a site.
Although the value of the MIE metric is higher than the actual co-detection rate it correlates well

Discussion
This article describes "FilmArray Trend", a new system for real-time reporting of widespread pathogen-specific syndromic data. This system already has many of the important properties of big data. We consider Trend in terms of the "V"s that are often used to describe big data: volume (amount), velocity (speed of acquisition), veracity (accuracy), variety (diversity of information) 315 and value (utility) [57,62].
The Trend RP dataset is growing at an average rate of >200,000 pathogen test results per month.
Connecting the first 20 clinical sites has provided insight into the principal concerns that will be raised by the legal, IT and administrative departments of the healthcare providers that house 320 FilmArray instruments. It should be possible, therefore, to expand the Trend installed base by 10 to 20-fold over the next few years. Similarly, the existence of Trend should enable other IVD manufacturers to build their own Trend-like systems with greater acceptance on the part of their customers, thereby allowing a more global and comprehensive surveillance perspective.
The data in Figure 2 are similar to previous demonstrations of the seasonality associated with different respiratory viruses [63][64][65][66]. What is novel is that this data is generated automatically, on site, and in close to real-time compared to other surveillance systems. Greater than 98% of the test results are exported to the Trend database within 24 hours of being generated. As part of the de-identification protocol, sequential FilmArray RP tests of the same type are put into the same 330 time bin. This has the effect that test results are exported faster during periods of peak use such as during the peak of the respiratory season or during an outbreak. Trend should be instrumental at a local level to determine the start of respiratory season; many hospitals make significant changes to their operations based on this event however, at present, data collection to track the respiratory season is often slow and manual, or semi-automated at best. 335 The key to implementing Trend clinical sites was to demonstrate that FilmArray test results can be exported without the risk of breaching PHI confidentiality either directly or through some combination of the data that was exported. Trend successfully used the Expert Determination process as prescribed by the HIPAA guidelines (see Methods), which greatly simplified the data 340 sharing agreement between BioFire Diagnostics and the clinical site and allowed health care providers to use Trend without risk of inadvertently disclosing PHI.
The software architecture underlying the Trend system is both simple and secure: 1) no changes to the institutional firewall or LAN are needed; 2) the Trend database cannot reach back and 345 query the FilmArray computer due to the institutional firewall, which is set to outbound data only; 3) Trend software can only submit data to the cloud database and cannot query the database. Yet, despite this security, authorized users of the Trend database can mine the deidentified data to look for novel patterns in respiratory pathogen epidemiology.
The goal of an epidemiological surveillance network is to infer which infectious diseases are circulating in the general population based on testing a sample of patients [67]. Different surveillance systems have different biases in their data, biases that perturb the ability to predict true population prevalence. 355 While the removal of all PHI has great benefits in terms of implementation, it also has several shortcomings that complicate interpretation of the data. FluView Influenza is striking (Figure 3), supporting the validity and utility of the Trend data.

380
A second source of concern in the Trend data set is a consequence of removal of sample identification such that we cannot directly determine whether the sample was from a patient or was a non-clinical sample (verification test, QC or PT) and should be removed from further epidemiological analysis. We estimate that non-patient testing makes up approximately 1.8% of the total FilmArray RP tests. Automated detection algorithms remove 3.5% of the total RP tests, 385 including approximately half of the non-clinical samples. With the exception of the four positive tests, the clinical samples removed by filtering should be a random sampling of all patient tests.
The remaining 1% fraction of non-patient tests have essentially no impact on the Trend evaluation of pathogen prevalence but they do make it more difficult to perform high resolution analysis of pathogen co-detections. This is especially true for co-detections of low prevalence 390 organisms where QC positives are likely to be more common than real positives. Future updates to the FilmArray software will simplify the process by which the instrument operator can tag tests of non-patient samples, thereby largely eliminating the need to filter such test results from the Trend database before analysis.
T r e n d V a r i e t y :

395
The total positivity rate of the FilmArray RP test varies from a low of 38% in the summer months to a high of 75% in December and January, with a yearly average of approximately 60%. Figure 5A shows that the average number of different circulating pathogens at a single institution can vary from eight up to 11 during the winter months. Even during the peak periods of ILI, many respiratory infections are due to other viruses ( Figure 2C) that can present clinically in a 400 similar fashion [68,69]. Therefore, the presumption of an influenza infection based on reported influenza percent positivity, without diagnostic testing for the virus, can lead to the inappropriate use of anti-viral agents [70]. Conversely, without comprehensive testing, a negative influenza or RSV test can lead to prescription of an unnecessary antibiotic. FilmArray Trend data can be a valuable aid for antimicrobial stewardship programs because it provides real-time information 405 regarding the causes of respiratory infections and highlights the prevalence of viral infections.
As previously observed [66], the viruses that share the winter seasonality of influenza demonstrate annual or biennial behavior. It is possible that the viruses that share an influenza- Detection of multiple respiratory viruses in the same patient has been reported before. In the 415 Trend dataset the rate of dual and triple co-detections was approximately 7.7 %, with HRV/EV as the organism most commonly observed in a co-detection. Some viruses, such as ADV and CoVs are detected in a mixed infection more than 50% of the time (Figure 4). In principle a FilmArray RP positive result may represent detection of residual pathogen nucleic acid from a previous infection that has resolved. However, several studies suggest that coinfections are 420 associated with more severe disease ( [71][72][73], see also discussion in [74]). In such cases, information about multiple detections can provide infection control practitioners with data that can assist in bed management and in the assessment of risk for nosocomial infections in a patient population that has been segregated by the occurrence of a common pathogen. Such information can prevent the introduction of a new pathogen associated with cohorting patients during busy 425 respiratory seasons [75][76][77].
The question of whether different respiratory pathogens interfere with, or facilitate, growth in a human host is of some interest and not well understood. With the right data it can be studied at the population [78], individual [79], and cellular level [74]. Because the Trend data still includes 430 some non-patient tests, we have chosen not to analyze every possible dual or triple infection individually. Rather we have taken a global approach and compared the overall rate of observed co-detections with MIE, which is a measure of the diversity of viruses circulating in a specific region and time period. MIE is similar, but not identical, to PIE (Probability of Interspecific Encounter [80]), also referred to as the Gini-Simpson index (1-D, where D is the Simpson's 435 index [81,82]), which is used in ecology as a measure of the species diversity of a region.
Similarly, the circulating pathogen number of Figure 5A is identical to the Species Richness measure of ecology. We calculate MIE using frequencies (P i ) of pathogen positivity per FilmArray test and note that the sum of all pathogen frequencies can add up to more than 100% because of co-detections or be less than 100% because of the presence of negative tests. In this 440 regard, MIE differs from PIE because it is not a probability measure.  infections at a rate that can be predicted by their observed abundance. Data however may be biased by the patient population tested and the type of respiratory disease. The data also does not rule out that there are particular respiratory pathogens that occur more or less often in mixed infections than predicted by their individual percent positivity rates. [74,83]. As we improve our ability to remove non-patient test results from the Trend dataset we will be able to characterize 450 specific virus co-detection rates and their significance [65,66,78,79,[84][85][86].
T r e n d V a l u e As with weather forecasting, there is both a theoretical and a practical interest in predicting the next few weeks or months of the respiratory season [87][88][89][90]. FilmArray Trend contributes to infectious disease forecasting efforts because the data is timely and comprehensive. As the 455 number of sites participating in Trend increases it will be possible to localize the reported infections to smaller geographical regions. At a high enough density of Trend sites, patterns of movement of respiratory pathogens across the US will become visible in a way that has not been easily observed before now.  identification and certifies that the protocol allows only a small risk that PHI may be disclosed to an anticipated recipient.
The goal of the Trend project is to provide a near real-time view of the changes in pathogen prevalence; therefore, it is important to be able to retrieve the date when a FilmArray test is 565 performed. However, retrieving the date conflicts with the Safe Harbor approach because the date of a test is PHI (the year of the test is not PHI but working with just the year defeats the purpose of tracking prevalence through the season). For this reason we followed the Expert Determination approach to manage data export. 570 The study took into consideration data that are available on participating clinical laboratory FilmArray Instruments, BioFire's own customer database, the proposed Trend database and publicly available data sources. We analyzed how combinations of this information could be used by an adversary to identify an individual in the dataset thereby disclosing PHI [98]. The results of this study (summarized in Table 2) provided recommendations for development and   575 site enrollment criteria for Trend, and for BioFire operating procedures.

Test Location
High (likely continually associated with a patient) Medium (may be selfdisclosed in public) Limited to sites drawing from > 20,000 pop.

Instrument Name
Low (not associated with patient) Low (only available to institution)

No Action
In accord with the recommended actions, information in the fields that may be used to 580 distinguish a patient is obfuscated (through truncation or binning) to ensure that a combination of these fields cannot be used to identify a specific patient [97]. For example, the time and date of the test are dynamically binned so that a minimum number of tests of one panel type are included in each bin prior to export. This ensures that a sufficient quantity of test results are uploaded to the database from one site at one time so that there is very low risk that patient identity can be 585 inferred from knowledge of the start time of the test.
If an adversary were to infiltrate the safeguards of the database, and wished to know specific patient test results from a specific location on a given day, no unique records would exist. The combination of deleting the sample identification (ID) field, binning the test date range, and 590 truncating the FilmArray pouch serial number ensures that the remaining information is never unique, which indicates that there is a low risk of misuse of data. r  r  a  y  T  r  e  n  d  C  l  i  e  n  t  S  o  f  t  w  a  r  e  a  n  d  D  a  t  a  b  a  s  e The Trend client software resides on the computer associated with the FilmArray Instrument(s).
The computer makes a secure, HTTPS, connection to services hosted by Amazon Web Services. 595 Authenticated data submissions are stored in a database hosted by Amazon Relational Database Service. Both services have been configured to be HIPAA Security Rule compliant [99]. Trend client software requires that the computer has Internet access to make secure outbound web requests. The HTTPS protocol is industry standard technology used for secure banking and web        r  e  5  -F  i  g  u  r  e  S  u  p  p  l  e  m  e  n  t  1  :  L  i  n  e  a  r  r  e  g  r  e  s  s  i  o  n  o  f  M  I  E  a  n  d  o  b  s  e  r  v  e  d  c  o  -d  e  t  e  c  t  i  o  n  s   1005 The time series data in figure 5B shown as a scatter plot. The equation of the linear regression is: MIE = 4.05153* Co-Detection rate -0.0206 with an R 2 value of 0.9003