This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
The time lag in detecting disease outbreaks remains a threat to global health security. The advancement of technology has made health-related data and other indicator activities easily accessible for syndromic surveillance of various datasets. At the heart of disease surveillance lies the clustering algorithm, which groups data with similar characteristics (spatial, temporal, or both) to uncover significant disease outbreak. Despite these developments, there is a lack of updated reviews of trends and modelling options in cluster detection algorithms.
Our purpose was to systematically review practically implemented disease surveillance clustering algorithms relating to temporal, spatial, and spatiotemporal clustering mechanisms for their usage and performance efficacies, and to develop an efficient cluster detection mechanism framework.
We conducted a systematic review exploring Google Scholar, ScienceDirect, PubMed, IEEE Xplore, ACM Digital Library, and Scopus. Between January and March 2018, we conducted the literature search for articles published to date in English in peer-reviewed journals. The main eligibility criteria were studies that (1) examined a practically implemented syndromic surveillance system with cluster detection mechanisms, including over-the-counter medication, school and work absenteeism, and disease surveillance relating to the presymptomatic stage; and (2) focused on surveillance of infectious diseases. We identified relevant articles using the title, keywords, and abstracts as a preliminary filter with the inclusion criteria, and then conducted a full-text review of the relevant articles. We then developed a framework for cluster detection mechanisms for various syndromic surveillance systems based on the review.
The search identified a total of 5936 articles. Removal of duplicates resulted in 5839 articles. After an initial review of the titles, we excluded 4165 articles, with 1674 remaining. Reading of abstracts and keywords eliminated 1549 further records. An in-depth assessment of the remaining 125 articles resulted in a total of 27 articles for inclusion in the review. The result indicated that various clustering and aberration detection algorithms have been empirically implemented or assessed with real data and tested. Based on the findings of the review, we subsequently developed a framework to include data processing, clustering and aberration detection, visualization, and alerts and alarms.
The review identified various algorithms that have been practically implemented and tested. These results might foster the development of effective and efficient cluster detection mechanisms in empirical syndromic surveillance systems relating to a broad spectrum of space, time, or space-time.
Late detection of disease outbreaks has long been a threat to global health security, costing the world many lives, resources, fear, and panic. Case-fatality rates of pandemic diseases are still rising, the most recent being Ebola virus disease in Liberia, West Africa, the Democratic Republic of the Congo, and Uganda [
Traditional surveillance systems are mostly passive and rely on laboratory confirmations to detect disease outbreaks. These have been enhanced by syndromic surveillance systems [
Generally, outbreaks of infectious or communicable diseases are more likely to present in cluster form either in space, time, or both [
Clustering approaches can be roughly categorized as temporal, spatial, and spatiotemporal. Spatial clustering uses multidimensional vectors with longitudinal and latitudinal coordinates. There are variety of related algorithms, such as density-based spatial clustering of applications with noise (DBSCAN) [
There have been notable efforts to bridge the gap between a disease outbreak and its late detection. Research in syndromic surveillance is aimed at detecting disease outbreaks at the presymptomatic stage [
We developed the inclusion and exclusion criteria based on the objective of the study and through rigorous discussions among the authors. For an article to be included in the review, the study required the following criteria: (1) a study of a practically implemented syndromic surveillance system with cluster detection mechanisms or that was thoroughly assessed with real data (such studies also contributed to the understanding of how privacy and security-preserving methods could be adopted in related studies), (2) a focus on surveillance of infectious diseases such as influenza, cholera, severe acute respiratory syndrome, and Ebola virus disease, (3) a focus on humans, (4) reported in English, (5) journal articles, conference papers, or presentations.
All searches were done without restriction on time boundaries. We excluded any article outside the above-stated scope.
We conducted a literature search between January and March 2018 in Google Scholar, ScienceDirect, PubMed, IEEE Xplore, ACM Digital Library, and Scopus. We used keywords such as “spatiotemporal clustering,” “syndromic surveillance,” “real time,” “cell phone,” “mobile phone,” “smart phone,” “trajectory,” “aberration detection,” and “clustering.” To improve the search strategy, we combined keywords using the Boolean operators AND, OR, and NOT. We considered peer-reviewed journals and articles.
Guided by the inclusion and exclusion criteria, we conducted a basic filtering by skimming the titles, abstracts, and keywords to retrieve records that seemed relevant. We removed duplicates and fully read and judged articles that seemed relevant based on the inclusion and exclusion criteria. We retrieved other relevant articles from the reference lists of the accepted articles. We recorded the article selection and screening in a Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram [
We developed our data collection and categorization methods based on the objective and through literature reviews and discussions among the authors. We defined the categories exclusively to assess, analyzed, and evaluate study (
Data categories and their definitions.
Category | Definition |
Clustering and aberration detection algorithm | The kind of clustering and aberration detection algorithm used and implemented in the study. |
Type of clustering algorithm | The type of algorithm used (spatial, temporal, or spatiotemporal algorithm). |
Threshold | The type of threshold used to generate alarms and alerts in the study. |
Design method | The design method used in implementing the system, such as prototype, participatory or joint application development, or agile or waterfall model. |
Evaluation criteria | The criteria used to evaluate the algorithms. |
Performance metrics | The performance metrics used to evaluate the algorithms, such as sensitivity, specificity, and positive predictive value. |
Type of location | Locations used in clustering, including geolocation, postal codes, and counties; specifies the exact type of location used in the system. |
Source of location | Where the type of location information was obtained. |
Nature of location | State of the location as static or dynamic. |
Visualization tool | The type of tool used to implement the visualization aspect of the system. |
Display report | The type of visual displays (eg, graphs, maps, time series) implemented by the various systems in the study. |
Design layout | The stages and processes used in the architectural design of the syndromic surveillance system (eg, a layout may consist of data acquisition, clustering and aberration detection, and visualization [ |
We assessed, analyzed, and evaluated eligible articles based on the above-defined categories. We analyzed each of the categories listed in
We used state-of-the-art methods from the review as input to develop a cluster detection mechanism framework for disease surveillance systems, including those relating to emergency department records, school and work absenteeism, over-the-counter drugs, and medication sales.
Our search of the various online databases found a total of 5936 records. Removal of 97 duplicates resulted in 5839 records. An initial reading of titles excluded 4165 articles. We excluded a total of 1549 through skimming of abstracts and keywords. An in-depth full-text analysis of the resulting 125 articles, guided by the inclusion and exclusion criteria, excluded 98 articles. Thus, we included a total of 27 articles in the qualitative synthesis (
Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the literature review process.
We assessed, analyzed, and evaluated the 27 articles based on the above-defined categories. The following sections describe the findings.
Among the 3 types, namely spatial, temporal, and spatiotemporal, of clustering algorithms, the spatiotemporal algorithm (19/50, 38%) was the most preferred approach, followed by spatial (16/50, 32%) and temporal algorithms (15/50, 30%).
A variety of clustering and aberration detection algorithms were implemented in the reviewed articles. Space-time permutation scan statistic (STPSS) and CUSUM algorithms were most widely used, followed by space-time scan statistic and space scan statistic (
Summary of articles reviewed.
Reference (first author, year) | Target disease | Place | Period | Input source |
Gesteland, 2003 [ |
Bioterrorism | 2002, Olympics | 2002 | Chief complaints from emergency departments |
Yan, 2013 [ |
Infectious diseases | Rural China | 2012 | Symptoms of patients from health facilities, medication sales from pharmacies, and primary school absenteeism |
Maciejewski, 2009 [ |
Detection of public health emergencies | Indiana State Department of Health | 2009-2010 | Symptoms in emergency departments |
Thapen, 2016 [ |
Generalized disease nowcasting | United Kingdom | 2014 | Twitter and |
Thapen, 2016 [ |
Infectious diseases, eg, hay fever and flu | England and Wales | 2014 | |
Gomide, 2011 [ |
Dengue | Observatório da Dengue website (www.observatorio.inweb.org.br/dengue/) | N/Aa | |
Qi, 2013 [ |
Influenza infection | University campus | Spring 2011 | Movement trajectory |
Mathes, 2017 [ |
Infectious diseases | New York City | Since 2001 | Emergency department visits with infectious diseases such as cough, sore throat, and fever for influenzalike illness |
Yih, 2010 [ |
Acute illness for bioterrorism event | Greater Boston area, Greater Twin Cities area, Austin and Travis County, San Mateo County | 2007-2008 | Ambulatory care encounters |
Kleinman, 2005 [ |
Lower respiratory tract infection | Boston area | N/A | Ambulatory care encounters |
Dafni, 2004 [ |
Emergency department data | Athens, 2004 Olympic Games | 2002-2003 | Symptoms in emergency department |
Wagner, 2004 [ |
Infectious disease | Utah, Atlantic City | 1999 | Chief-complaint data |
Weng, 2015 [ |
Enterovirus and influenza | Taipei | 2010/2011 | School-based syndromes |
Maciejewski, 2010 [ |
Respiratory illness | State of Indiana | 2007 | Infectious disease |
Higgs, 2007 [ |
Comprehensive tuberculosis data | San Francisco homeless | 1991-2002 | Tuberculosis |
Ali, 2016 [ |
Infectious disease | Pakistan | 2011-2015 | Chief complaints from emergency departments |
Groeneveld, 2017 [ |
Infectious disease | Netherlands | 2014/2015 | Respiratory tract infection, hepatitis, and encephalitis/meningitis |
Kajita, 2017 [ |
Emergency department data | Los Angeles County Department of Public Health, 2015 Special Olympic Games | 2015 | Monitor health impact |
Choi, 2010 [ |
Infectious disease | Hong Kong | 2005 | Febrile patients |
Heffernan, 2004 [ |
Emergency department chief complaint | New York City Department of Health and Mental Hygiene | 2001-2002 | Infectious disease, eg, respiratory, fever, diarrhea, and vomiting |
Takahashi, 2008 [ |
Infectious disease | Massachusetts | 2005 | Daily syndromic surveillance data |
Besculides, 2005 [ |
Infectious disease | New York City | 2001-2002 | School absenteeism data |
Blake, 2016 [ |
Poliomyelitis outbreaks | N/A | 2003-2012 | Reporting of acute flaccid paralysis cases and laboratory confirmation |
Greene, 2012 [ |
Gastrointestinal disease outbreak detection | Kaiser Permanente Northern California | 2009 | Data streams from electronic medical records |
Vilain, 2016 [ |
Infectious disease | French Institute for Public Health Surveillance, Reunion Island | 2013-2014 | Emergency department visits |
Sharip, 2006 [ |
Infectious disease | Los Angeles County | 2003-2004 | Emergency department syndromic data |
Duangchaemkarn, 2017 [ |
Infectious disease | N/A | 2016-2017 | Chief complaint symptoms |
aN/A: not available.
Frequency of clustering and aberration detection algorithms (n=66).
Algorithm | Usage, n (%) |
Cumulative summation | 10 (15) |
Space-time permutation scan statistic | 10 (15) |
Space-time scan statistic | 5 (8) |
Space scan statistic | 4 (6) |
Kernel density | 3 (5) |
Moving average | 3 (5) |
Log-linear regression | 2 (3) |
Density-based spatial clustering of applications with noise | 2 (3) |
Recursive least square | 2 (3) |
Statistical process control | 2 (3) |
Autoregressive integrated moving average | 2 (3) |
Risk-adjusted support vector clustering | 1 (2) |
Bayesian spatial scan statistic | 1 (2) |
Exponentially weighted moving average | 1 (2) |
Flexible space-time scan statistic | 1 (2) |
k-means clustering | 1 (2) |
K-nearest neighbor with Haversine distance | 1 (2) |
Shewhart chart | 1 (2) |
Pulsar method | 1 (2) |
Risk-adjusted nearest neighbor hierarchical clustering | 1 (2) |
Small area regression and testing | 1 (2) |
Spatiotemporal density-based spatial clustering of applications with noise | 1 (2) |
What is strange about recent event | 1 (2) |
Bayesian space-time regression | 1 (2) |
Generalized linear mixed model | 1 (2) |
Generalized linear model | 1 (2) |
Holt-Winters exponential smoother | 1 (2) |
Temporal scan statistic | 1 (2) |
Modified Early Aberration Reporting System C2 | 1 (2) |
Temporal aberration detection | 1 (2) |
An aberration is detected mainly using thresholding mechanisms and, in this regard, various types of approaches were implemented in the reviewed articles. Recurrence interval (10/17, 37%) and
The most widely used performance metrics were sensitivity (11/25, 44%) and specificity (9/25, 36%), followed by timeliness (2/25, 8%), and consistency, correlation, and positive predictive value (each 1/25, 4%). The reviewed studies used various evaluation strategies, among which simulation with historical data (12/15, 80%) was the most widely used approach, followed by comparison with known outbreak (2/15, 13%) and power of cluster detection test (1/15, 7%).
At specificities and sensitivities ranging from 82% to 99.5%, spatial and spatiotemporal algorithms detected on average more cases (
Sensitivity and specificity of the evaluated algorithms.
Evaluation metrics of some algorithms.
Algorithms | Specificity (%) | Sensitivity (%) | Detected cases (n) |
Space-time permutation scan statistic | 82 | 83 | 26 |
Pulsar method | 97 | 85 | 223 |
Cumulative summation | 95 | 92 | 212 |
Space scan statistic | 95 | 89 | 790 |
Space-time scan statistic | 99 | 92 | 3 |
Flexible space-time scan statistic | 82 | 99.5 | 4 |
The studies used a variety of location type, nature, and source. The majority of studies used static location (22/26, 79%) and the rest used a dynamic location (6/26, 21%). The studies used various address: geocode (14/37, 50%), zip code (13/37, 46%), and county (1/37, 4%). Various sources of locations were used: patient health record (18/27, 64%), mobile device (4/27, 14%), Transport Control Protocol/Internet Protocol (3/27, 11%), county (1/27, 4%), and school address (1/27, 4%).
Clustering and aberration detection mechanisms in disease outbreaks need to be supported by excellent visualization tools and display to facilitate a quick response from the concerned bodies on the exact timing and place. In this regard, the reviewed articles used various kinds of tools: ArcGIS (3/9, 24%), Google Maps (2/9, 22%), Twilio (2/9, 22%), OpenStreetMap (1/9, 11%), and JFreeChart (1/9, 11%) were the most widely used. For displaying mechanisms, a map (14/30, 47%) was the most widely used, followed by time series (7/30, 27%), graphs (8/30, 23%), and color indicators (1/30, 3%).
We developed a conceptualized framework on cluster detection mechanisms (
Design layouts and their frequencies (n=22).
Design layout | Description | Usage, n (%) |
Data clustering and aberration detection, alarms and alerts (DCADAA) | This layout consists of obtaining data first. Then clustering and aberration detection are done, followed by generating alarms to create alerts of aberrations [ |
12 (55) |
Data clustering and aberration detection, visualization, alarms and alerts (DCAVAA) | A visualizing module is built in addition to processes defined in DCADAA [ |
1 (5) |
Data cleaning and transformation, clustering and aberration detection visualization, alarms and alerts | In addition to the DCAVAA layer, this layer has data cleaning and transformation features. | 3 (14) |
Data clustering, filtering or categorizing, aberration detection, alarms and alerts | In addition to DCADAA, this layout filters data or categorizes the data into some defined groups, either manually or by employing machine learning techniques. | 2 (9) |
Data clustering and aberration detection, privacy-preserving mechanism (DPVCAAA) | In addition to DCAVAA, this layout has privacy-preserving mechanisms, such as anonymization and pseudonymization [ |
2 (9) |
Real time, privacy-preserving mechanism, data clustering and aberration detection, alerts and alarms | On top of the DPVCAAA layout, there is an additional module for real-time data processing [ |
1 (5) |
User tracking, data clustering, aberration detection, visualization, alarms and alerts | In addition to DCAVAA, this layout tracks the user’s movement to obtain data. This is followed by validating the data before clustering and aberration detection [ |
1 (5) |
Cluster detection mechanism framework.
Generally, syndromic surveillance systems require input data varying from structured to semistructured data such as comma-separated values, xml, or JavaScript Object Notation (JSON) formats (
The preprocessing phase is to ensure that the input data is in the right format for the cluster and aberration detection phase to use. Therefore, the framework provides for data conversion. For instance, online data in xml format can be converted to JSON format. Missing data would also be handled in various ways. In most instances, missing data were excluded from the analysis [
Another provision is to ensure that privacy-preserving mechanisms are in place. This framework has a provision in the data preprocessing section to ensure that the input data are devoid of personal data. This would be done by following layout standards and regulations such as the General Data Protection Regulation established by the European Union [
The heart and brain of this framework is the cluster and aberration detection phase. In this layout, clusters and aberrations would be detected by considering the clustering and aberration detection algorithms found in the review. STPSS is very outstanding, since it does not require population-at-risk data to draw the expected baseline value. Rather, it uses the detected cases to determine the expected count [
The main output of the framework is timely alerts through alarms and visualizations of detected aberrations. In the studies, various visualization tools and output displays were used. Guided by the results and discussion sections of this review, ArcGIS or Google Maps can be used to implement the visualization module. This visual display would mainly be a map with other displays such as a time series and graph. The maps would indicate where and when clustering and aberrations occur. Also, alerts would be triggered through alarms and messaging.
The general objective of this study was to systematically review practically implemented disease surveillance algorithms for their usage and performance efficacies and to develop an efficient cluster detection mechanism framework. The results were targeted at individuals and organizations who want to implement efficient syndromic surveillance systems for applications such as over-the-counter medication, school and work absenteeism, and disease surveillance relating to presymptomatic stages, among others. The scope was to review the practically implemented state-of-the-art algorithms relating to temporal, spatial, and spatiotemporal clustering mechanisms. We proposed a framework based on the results of the review and considered various challenges, such as user mobility, privacy and confidentiality, and geographical location estimation. In exploring suitable algorithms, we included in the review studies that assessed syndromic surveillance systems with real data. In addition to thoroughly assessing these algorithms, such studies also contributed to the understanding of how privacy- and security-preserving methods could be adopted in related studies. This is also very important in this field, since personal data need to be handled properly in related studies to preserve security and privacy. For instance, in a related study [
Summary of the most used categories.
Category | Most used |
Clustering algorithm | Space-time permutation scan statistic |
Type of clustering | Spatiotemporal type |
Threshold | Recurrence interval |
Design method | Participatory design |
Evaluation method | Simulation with historical data |
Performance metric | Sensitivity |
Type of location | Geocode |
Source of location | Patient health record |
Nature of location source | Static |
Visualization tool used | ArcGIS |
Displayed output | Maps |
Layout | Data clustering and aberration detection, alarms and alerts |
The review identified various spatiotemporal algorithms used for disease surveillance systems, including STPSS, space-time scan statistic, generalized linear mixed model, Bayesian space-time regression, and flexible space-time scan statistic. Spatiotemporal methods generally aimed at detecting disease outbreaks in both spatial and temporal patterns.
STPSS, which was used in many of the studies, was developed to detect hot spots of space-time interaction within space and time pattern occurrences of diseases [
On the other hand, STPSS is more accurate when used for outbreaks that start locally [
The spatial methods we identified in this review were space scan statistic, kernel density, Bayesian spatial scan statistic, k-means clustering, DBSCAN, and K-nearest neighbor (K-NN). Unlike spatiotemporal algorithms, spatial algorithms basically concentrate on where aberrations would occur. This makes planning difficult for health management, since it is difficult to know when to implement health interventions, if potential outbreak areas are known. Thus, spatial algorithms are suggested to be implemented together with temporal algorithms [
As
CUSUM is a statistical control method that has traditionally been used for industrial process control. It has been predominantly used in tracking changes in average production process levels since the 1950s [
The most used threshold for aberration detection in spatiotemporal algorithms was the recurrence interval, possibly as a result of the combination of recurrence interval and Monte Carlo replication, which helps to easily determine and set the specificity of the system [
In the reviewed studies, CUSUM is a temporal algorithm that was mostly used together with special algorithms to form spatiotemporal algorithms [
Participatory design was mostly used at the design stage, while simulation with historical data was mostly used to evaluate the clusters in most of the algorithms. Historical data were mostly used perhaps because those records were known to have aberrations, making it possible and easy to determine the performance of the system. Sensitivity and specificity were the most used performance metrics in the evaluation. This could be because users wanted a system with reduced false-alarm rates.
Some of the algorithms were compared based on their performance metrics of sensitivity, specificity, timeliness, and positive predictive value (
An evaluation that was performed through injection of spikes of a known outbreak revealed low detection in the space and spaciotemporal algorithms [
In terms of location, geocodes of census tracking or hospitals and zip codes were mostly used as location points for the clustering algorithms. These data were mostly retrieved from patient health records. The dynamic nature of the sources of location caused a low count, which could have been because they have not been comparatively assessed and due to difficulties associated with acquiring and processing the dynamic nature of location source data for syndromic surveillance. Privacy-preserving polices and a high computational time requirement prohibited the use of exact location of persons for syndromic surveillance. Exact locations such as house numbers and tracking of individuals were mostly used for group data at the zip code or county level. Information on the exact place of infection is also vital for early prevention and control of morbidity and mortality. But these limitations often hamper the accuracy of information on place of infection, since the information collected often relates to the place of notification, which is usually far from the place of infection [
ArcGIS was mostly used to display graphs in the studies in this review. It is possible that maps were the most common display type because they can be used to represent both spatial and spatiotemporal data. This could have accounted for their high usage of 34% and 47% in their respective categories. In the system design layout category, most of the systems obtained data from various sources first. Clustering and aberration detection were done, followed by generating alarms to create alerts of aberrations. Tracking for data, acquiring data in real time, privacy-preserving mechanisms, filtering, and data cleaning were some of the layout processes employed in a few of the systems studied. The low rate of tracking persons for data sources could be due to legal, privacy, and ethical reasons [
Despite the numerous availabilities of disease surveillance algorithms, their lack of efficacy in detecting disease outbreaks remains a threat to global health security. To overcome this problem, the main objective of this study was to systematically review practically implemented disease surveillance algorithms for their usage and performance efficacies, and to develop an efficient framework. The results were targeted at individuals and organizations who wish to implement efficient syndromic surveillance systems in applications such as over-the-counter medication, school and work absenteeism, and disease surveillance relating to presymptomatic stage, among others. The scope was to review the practically implemented state-of-the-art algorithms relating to temporal, spatial, and spatiotemporal clustering mechanisms. We considered various challenges such as user mobility, privacy and confidentiality, and geographical location estimation.
The study revealed that STPSS and CUSUM were the most frequently implemented algorithms. These algorithms can be used in syndromic surveillance systems that are aimed at implementing state-of-the-art cluster detection mechanisms, although STPSS was shown to be efficient only in a surveillance system with a high rate of infections. Temporal and spatial algorithms such as CUSUM and K-NN can also be combined in an empirical study to achieve efficient results. This study provided wide data categorization, ranging from design of the system to the display of reports which we used in the development of the framework. These results might foster the development of effective and efficient cluster detection mechanisms in empirical syndromic surveillance systems relating to a broad spectrum of space, time, or space-time.
cumulative summation
density-based spatial clustering of applications with noise
JavaScript Object Notation
K-nearest neighbor
space-time permutation scan statistic
None declared.