Abstract
Background: Disease outbreak investigation is a key aspect of public health. Whole-genome sequencing of bacterial pathogen based on new generation high-throughput sequencing technologies has facilitated outbreak investigations recently. Whilst the approach has become more affordable and accessible to research and clinical laboratories, a system for adequate and efficient analyses of genome data in the context of bacterial outbreak investigations is missing. Methods: We performed a literature review of timely genomic investigations performed during the course of bacterial outbreaks that are based on new generation sequencing technologies. Currently available bioinformatics tools for genomic analyses are also reviewed here. Results: Genomic investigations in early stages of bacterial outbreaks have shown to provide timely information on evolutionary origin, transmission route, pathogenic potential, and resistance information of the outbreak strains and allow development of strain-specific typing methods. A systematic genomic analytical workflow is proposed here for the first time to facilitate efficient extraction of epidemiologically useful information from genome data of bacterial pathogens in future bacterial outbreak investigations. Conclusion: With the continuous reduction of genome sequencing cost and development of user-friendly analytical tools, it is expected that high-throughput genome sequencing will be applied routinely for timely genomic analysis in bacterial outbreaks in the near future.
Introduction
Disease outbreaks represent one of the major public health threats worldwide. In 2010, there were 857 reported foodborne outbreaks in the US (http://wwwn.cdc.gov/foodborneoutbreaks). Timely identification and characterization of the pathogen during onset of a deadly outbreak, such as understanding its pathogenic potential, antimicrobial resistance and route of transmission, could save lives. Pulse field gel electrophoresis (PFGE) is the current gold standard for molecular typing of bacterial pathogens. However, inter-laboratory pattern comparisons could be difficult and subjective [1]. More importantly, the approach lacks resolution for closely related isolates and is not well-suited for detection of novel strains. On the other hand, conventional strain characterization methods such as virulence assays and antimicrobial susceptibility tests are time-consuming and labor-demanding [2].
Whole-genome sequencing (WGS) takes into account all the genetic information within a genome and is able to provide the ultimate resolution possible and allows the discovery of the ‘unknown unknowns’ [3]. The recent advent of next- and third-generation high-throughput sequencing technologies has greatly improved the speed of WGS; draft genome of a bacterium can now be generated within days [4]. In addition, the introduction of affordable benchtop new-generation sequencers has made WGS of bacterial pathogens feasible for small and medium-sized research and clinical laboratories [5]. The state-of-the-art sequencing technologies have recently been employed to study bacterial outbreaks, both in a retrospective manner [e.g. [6,7]] and during the course of outbreaks [3,8,9,10,11,12,13].
However, while sequencing a bacterial genome nowadays is not limiting, how to translate effectively and efficiently the complex genome data into information that could benefit population health has been one of the main topics of public health genomics. Unlike basic research studies in which a genome can be analyzed for months with various approaches before the final conclusions are drawn, understanding the route of transmission and the antibiotic resistance profile of the bacterial pathogen, for instance, could be the instant concerns in deadly outbreak investigations. Also, although crowd-sourcing efforts could generate a massive amount of analytical outputs within short periods of time [3], redundant analyses may occur, and results that are not directly comparable may make interpretation of data even more vague. It is believed that a system for the analysis of genome data has to be established for the public health system [14].
In this review, we revisit real life examples of genomic investigations based on WGS and the use of new-generation sequencers during early stages of bacterial outbreaks that provided timely information on evolutionary relatedness, pathogenic potential and antimicrobial resistance of the bacterial pathogens. We then propose a systematic genomic analytical workflow that aims to facilitate efficient extraction of epidemiologically useful information from genome data of bacterial pathogens in the context of outbreak investigations.
Timely Genomic Investigations during Bacterial Outbreaks
2008 Canadian Listeriosis Outbreak
Listeriosis is an infection caused by the Gram-positive bacterium Listeria monocytogenes. In the summer of 2008, an outbreak of listeriosis associated with ready-to-eat meat products occurred in Canada. The outbreak was caused by a L. monocytogenes strain of serotype 1/2a and resulted in more than 20 deaths and over 50 illnesses (http://www.phac-aspc.gc.ca/alert-alerte/listeria 200808-eng.php). Using the next-generation 454 pyrosequencer [15], draft genome sequences of 2 L. monocytogenes isolates were generated within 3 days [10] (table 1). Comparative genomic analysis revealed a novel plasmid in the primary outbreak isolate, which harbors cadmium resistance genes, cadA and cadC, that are associated with resistance to sanitizers used in food-processing facilities [16]. A novel Listeria phage and a genetic island which is unique among all other sequenced L. monocytogenes isolates and encodes putative translocation, resistance, and regulatory factors were also found in the primary outbreak isolate. Virulence factor investigations revealed the presence of an intact internalin-encoding inlA locus, which plays a role in the promotion of mammalian host cell invasion [17], and some additional internalin-like loci which may partly account for the pathogenicity of the outbreak strain. Phylogenetic analysis based on whole-genome alignment indicated that the outbreak isolates belong to clonal complex 8 within lineage II and are most closely related to strain EGDe isolated in 1924 [18]. The analysis demonstrated that lineage II strains of L. monocytogenes can also cause a large outbreak of severe invasive disease, despite the fact that listeriosis outbreaks are usually caused by strains belonging to serotype 4b in lineage I [19]. The study represents one of the first attempts to use next-generation DNA sequencing technology in an ongoing bacterial outbreak investigation and provides a proof-of-concept that the approach could offer real-time responses to bacterial outbreaks.
Multidrug-Resistant Acinetobacter baumannii
Military patients from Iraq and Afghanistan are often colonized with multidrug-resistant Acinetobacter baumannii (MDRAB) strains, which can subsequently cause nosocomial infections in civilian patients and healthcare workers [20]. In 2008, a hospital outbreak of MDRAB occurred in the UK, in which isolates of the pathogen were recovered from 2 civilian patients following admission of 4 military patients colonized with MDRAB in the same unit [11]. PFGE and variable number tandem repeat analyses grouped the outbreak isolates into European clone 1 [21] but generated indistinguishable profiles among them. Identical antimicrobial resistance profiles were also obtained for all the 6 outbreak isolates. The transmission events thus remained unclear. By identifying 3 well-validated single nucleotide polymorphisms (SNPs) in draft genomes of the outbreak isolates using 454 pyrosequencing and subsequent mapping of the data to the complete genome sequence of a reference strain also in European clone 1, one of the military isolates was identified as bearing the ancestral genotype at all the 3 SNP loci, suggesting a transmission route from the wound of that military patient to the respiratory track of the civilian patient in the adjacent bed. The study highlights the potential of using genome sequencing to examine transmission events in bacterial outbreaks.
2010 Haitian Cholera Outbreak
Cholera, an acutely dehydrating diarrheal disease that could be deadly, is caused by the Gram-negative bacterium Vibrio cholerae [22]. In late 2010, a large outbreak of cholera started in Haiti, causing over 6,600 deaths and 0.47 million cases (http://new.paho.org/disasters/index.php?option=com content&task=view&id=1423&Itemid=1). Cholera had not been epidemic in Haiti for at least 100 years, and the origin of the outbreak was controversial [23]. Using the third-generation PacBio RS sequencing system [24], genome sequences of 2 clinical outbreak isolates of V. cholerae and 3 historical isolates from other regions of the world were determined [9]. Phylogenetic analyses based on core SNPs placed the outbreak isolates in group V of the seventh-pandemic group [25] and revealed a close relationship to the South Asian isolates from Bangladesh. Investigations on 20 previously described hyper-recombinant chromosomal elements [26] in the 5 V. cholerae genomes revealed structural variations in 3 regions: superintegron, VSP-2 and SXT, which in turn suggested a closer relationship of the 2 outbreak isolates to the Bangladesh strain CIRS101, isolated in 2002, than to the other Bangladesh strain M4, isolated in 2008. Detailed comparative genomic analysis of the 2 outbreak isolates with 3 additional outbreak isolates from the Centers for Disease Control and Prevention (CDC) indicated that the outbreak is clonal. In addition, the distant phylogenetic relationship between the Haitian outbreak isolates and those circulating in Latin America and the US Gulf Coast showed that the cholera epidemic in Haiti is not associated with climatic events, unlike some other cases of cholera epidemic [27]. Instead, the close relationship of the Haitian outbreak isolates with historical South Asian isolates from Bangladesh suggested that the Haitian epidemic is probably due to human activity that brought the V. cholerae strain from a distant geographic source to Haiti. The study represents the first application of third-generation sequencing in an ongoing bacterial outbreak and provides policy implications for public health officials on consideration of measures for controlling cholera [28].
2011 German Escherichia coli O104:H4 Outbreak
In mid 2011, a large outbreak of diarrhea with associated hemolytic-uremic syndrome (HUS) started in Germany, causing nearly 4,000 reported cases and over 40 deaths (http://www.ecdc.europa.eu/en/healthtopics/escherichiacoli/Pages/index.aspx). Diarrhea associated with HUS is usually caused by enterohemorrhagic E. coli (EHEC) of serotype O157:H7 [29]. However, the outbreak strain was serotyped to be O104:H4, a rare serotype of Shiga toxin-producing E. coli that had only been linked to sporadic cases of HUS [30]. The outbreak was also characterized by a higher incidence in adults, a higher incidence of HUS and a predominance of female patients among HUS cases, which are all unusual [31]. Using various next- and third-generation sequencers, draft genomes of 10 outbreak isolates, as well as some related and historical isolates, were made available within days [3,8,12,13] (table 1). A crowd-sourcing effort, in which analyses of the publicly released genome data were outsourced to bioinformaticians worldwide, was also set in motion in the early stage of the outbreak to gather analytical outputs rapidly (https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki).
Genomic comparisons of the outbreak strain with all previously sequenced complete genomes of E. coli revealed the enteroaggregative E. coli (EAEC) strain 55989, isolated in the late 1990s [32], to be the closest relative of the outbreak strain [3]. The result was confirmed by multi-locus sequence analysis (MLSA) [8] and whole-genome phylogenetic analysis [12]. Genes encoding virulence factors that are typical of EAEC were also found in the genomes of the outbreak strain [3]. However, a Shiga toxin-encoding phage, highly similar to a phage from EHEC O157:H7, was identified in the outbreak strain [8,13], although the locus of enterocyte effacement pathogenicity island (PAI) which is typical in EHEC [29] was found missing [3,8,13]. Comparison of genome sequences from the outbreak isolates derived from different patients suggested a stable genome of the outbreak strain during its infection in different hosts [8] as well as a clonal nature of the outbreak [13]. Two large plasmids were revealed in the outbreak strain; the larger plasmid is highly similar to the pEC Bactec plasmid that harbors extended-spectrum beta-lactamase genes of the TEM-1 and CTX-M-15 classes, and the smaller one is similar to the pAA plasmid found in EAEC 55989 but contains a rare type of aggregative adherence fimbria, AAF/I, instead of the more common AAF/III type [3,8,13]. Based on the characteristic presence of the AAF/I gene cluster, strain-specific diagnostic kits were designed and released for outbreak isolate identification 5 days after the release of the genome sequence data [3]. Genes involved in mercury resistance, tellurium resistance and antimicrobial resistance were also identified in the outbreak strain [8]. The scenario represents the largest sequencing effort on a bacterial pathogen using different high-throughput platforms in an outbreak investigation at the moment.
Genomic Analytical Workflow for Bacterial Outbreaks
As illustrated in the case examples reviewed above, timely genomic investigations in early stages of bacterial outbreaks could rapidly provide information on the evolutionary position, transmission route, pathogenic potential, and resistance information of the outbreak strains and allow development of quick strain-specific typing methods. In order to facilitate future investigations of bacterial outbreaks, here we propose a systematic genomic analytical workflow (fig. 1), with suggestions of some ready-to-use tools and pipelines (table 2), that is specifically designed to facilitate efficient extraction of epidemiologically useful information from genome data of bacterial pathogens in the context of outbreak investigations. It should be noted that since the analysis of new-generation sequencing data is a fast-evolving field in which analytical tools are constantly being improved and developed [33], tools and pipelines listed here are not aimed to be exhaustive.
Genome Sequencing, Assembly and Annotation
When a bacterial outbreak occurs, isolation of the causative pathogen in pure culture and extraction of genomic DNA first take place. High-throughput sequencing of the whole bacterial genome is then feasible using any of the state-of-the-art new-generation sequencing platforms. Currently available next-generation sequencing platforms include 454 [15], Illumina [34] and SOLiD [35]; while third-generation sequencing platforms include Ion Torrent [36] and PacBio [24]. Various platforms employ different sequencing chemistry, have different sequencing throughput, generate sequence reads of different lengths, and are subject to different intrinsic errors. A comparison of these platforms has been reviewed recently [4].
Raw sequencing reads have to be assembled after genome sequencing. The purpose of genome assembly is to group the fragments of a DNA sequence into contigs, and then contigs into scaffolds, to reconstruct the original DNA sequence. Genomes could be assembled using either the de novo or mapping approach [37]. The de novo approach is more mathematically complex and computationally demanding and is usually employed on reconstructing genomes that have never been sequenced before, while the mapping approach allows quicker assembly but is only feasible when a closely related reference sequence is available. Many genome assemblers are currently available, examples for de novo assembly include Velvet [38], MIRA 3 [39] and Allora; and those for mapping assembly include BWA [40], SOAP2 [41] and Bowtie [42] (table 2). Bao et al. [43] recently compared the performance of various genome assemblers and suggested guidelines for tool selection under varying conditions. Finishing of genome assembly often requires a time-consuming gap-closure process. However, it has been shown that unfinished draft genomes of pathogens are informative enough in the context of emerging bacterial outbreaks [44]. The utilization of draft genomes for rapid outbreak investigations is therefore generally recommended.
After the genome is assembled, whether in draft or completed form, genome annotation follows. Genome annotation is a process of adding biological interpretations to DNA sequences and involves gene prediction and functional annotation. In gene prediction, a gene finder is applied to the genome sequence, producing a set of predicted protein-coding genes. Subsequent functional annotation attaches biological information to the set of predictions via sequence similarity searches against available databases. Various tools and pipelines have been developed for automatic genome annotation, including RAST [45], Gent [46] and DIYA [47]. However, none of them is capable of generating a functional annotation without any error and thus manual curation, in which experts are deployed to re-examine the prediction set, is always required. This could be assisted with GenePRIMP, a web-based post-processing pipeline that identifies erroneously predicted genes and which has been used by the US Department of Energy Joint Genome Institute on over 300 genomes [48].
Analyses of Pathogenicity and Antimicrobial Resistance
Rapid identification and characterization of pathogenicity-related and antimicrobial resistance genes is crucial in order to quickly get information on what the emerging pathogen is capable of and to assure susceptibility to drugs of choice. These kinds of genes are usually harbored on plasmids, prophages and genomic islands (GIs), which are acquired via horizontal gene transfer.
Plasmids are self-replicating pieces of extrachromosomal DNA that usually carry virulence-related and antimicrobial resistance genes. For instance, acquisition of the virulence plasmid pINV makes enteroinvasive E. coli invasive [49], and the presence of plasmid-encoded Qnr protein confers quinolone resistance in various bacterial genera [50]. Prophages are bacteriophages that have physically integrated into genomes of their preferred bacterial host [51]. The presence of prophage sequences may allow some bacteria to become pathogenic or to acquire antimicrobial resistance. SPC-P1, for example, is a pathogenicity-associated prophage of Salmonella enterica serovar Paratyphi C [52]. Available tools that allow prophage identification include PHAST [53], Prophage Finder [54] and Prophinder [55]. Performance of these tools has been compared in a recent review [53].
GIs refer to horizontally transferred gene clusters that are typically 10–200 kb in size [56]. Several classes of GIs are recognized according to their gene content, including PAIs, resistance islands, secretion islands, and metabolic islands [57]. PAIs carry genes coding for virulence factors such as toxins and adhesins that confer pathogenicity to bacteria and resistance islands harbor genes related to antimicrobial resistance and metal resistance [58]. For instance, virulence-related genes are found within the Francisella PAI of F. tularensis LVS [59], and the presence of Salmonella GI confers multidrug resistance to S. Typhimurium DT104 [60]. Ready-to-use tools for GI prediction include IslandViewer [61], MobilomeFinder [62] and Alien Hunter [63].
Besides acquisition of virulence- and resistance-related gene elements via horizontal gene transfer, point mutations and DNA rearrangements might also contribute to pathogenicity and antimicrobial resistance of pathogens. An example includes SNPs in gyrA that could confer bacterial resistance against quinolones and fluoroquinolones [64]. Commonly used tools for point mutation detection include Samtools [65], GATK [66] and SOAPsnp [41]. On the other hand, conservation of synteny among genomes could be identified and analyzed using whole-genome aligners such as Mauve [67], MUMmer [68] and ACT [69] or circular genome viewers such as BRIG [70], DNAPlotter [71] and CGView [72].
Several online databases contain collections of pathogenicity-related and antimicrobial resistance genes that could also facilitate rapid identification of such elements in the genomes of bacterial pathogens. For instance, the virulence factor database contains sequences of 418 experimentally demonstrated virulence factors and 2,353 virulence-factor-related genes from 24 genera of medically important bacterial pathogens [73], and the antibiotic resistance genes database contains sequences of 380 types of antimicrobial resistance genes that encode resistance to 249 antibiotics [74].
Elucidation of Phylogenetic Relatedness
Apart from knowing the pathogenic potential and antimicrobial resistance profile of outbreak strains, timely information on their phylogenetic relatedness to other strains is equally important to facilitate source tracking and understand their evolutionary positions and routes of transmission. As illustrated in the German E. coli scenario, various approaches could be employed to address the issue, these include the average nucleotide identity method [3], core SNPs (http://bacpathgenomics.wordpress.com/2011/06/15/snp-base-phylogeny-confirms-similarity-of-e-coli-outbreak-to-eaec-ec55989/), MLSA [8], core genome open reading frames [12], core genome alignment [13], and alignment-free approach [75]. While MLSA is based on information from only 7 housekeeping genes, accuracy of alignment-based methods relies heavily on the sequence alignment, and alignment-free methods are sometimes opposed due to the lack of biological background [76]. Nevertheless, among the approaches, those based on the entire core genome and concatenated SNP sets seem to be more common and well developed [e.g. [77,78]]. Core genome and core SNP data could be extracted from the input genomes using the online tool, Panseq [79], which could also automatically create input files for phylogenetic tree building programs such as MEGA [80], RAxML [81] and PhyML [82]. The approach is thus more accessible, especially to non-bioinformaticians.
Conclusion and Perspectives
The genome of a bacterium contains too much information one could extract from and make sense of, and which could fit for various different purposes [83]. In the context of disease outbreak investigations, understanding the pathogenic potential and drug susceptibility of the pathogen, developing rapid strain-specific typing methods, knowing the route of transmission, and tracking the source of outbreak should be of top priorities for outbreak control. In order to facilitate efficient and targeted extraction of such epidemiologically useful information from genome data of bacterial pathogens in future bacterial outbreak investigations, a genomic analytical workflow is developed here. In case of a bacterial outbreak, studying elements such as plasmids, prophages and GIs on which pathogenicity-related and antimicrobial resistance genes are usually located allows the investigator to quickly get information on what the emerging bacterial pathogen is capable of, thus allows design of preventive measures, and to assist antibiotic treatment decision making so that susceptibility to drugs of choice could be assured. Strands of DNA unique to the outbreak strain may also be found in studying these elements, allowing establishment of rapid strain-specific typing methods that could in turn control and prevent further spread of the disease. On the other hand, by studying core SNPs or core genome alignment, phylogenetic relatedness of the outbreak strain could be uncovered. Revealing the phylogenetic identity of an outbreak strain could affect treatment decision and preventive measure implementation: when the outbreak strain is identified to be a known one, effective antibiotics and preventive measures previously employed could be simply re-adopted. And, an evolutionary profile of the outbreak strain with other related strains would allow source tracking and understanding of the transmission route of outbreak.
Although WGS with new-generation high-throughput sequencers has revolutionized genomic investigations during bacterial outbreaks, there is no simple path from genome sequence to understanding of the virulence or transmissibility [84]. For instance, although the most parsimonious transmission route was suggested in the case of the MDRAB outbreak, the exact time and mode of transmission remained undetermined [11]. For the 2010 Haitian cholera outbreak, additional epidemiologic investigations are required to understand how exactly the South Asian V. cholerae strain was introduced to Haiti [9]. And, it remained unclear why the incidence rates of HUS in adult and female are unusually higher in the 2011 German E. coli outbreak. Also, as the current WGS approach relies on isolation of pure cultures, it is infeasible to be directly applied on clinical samples in which a mixture of pathogen(s) and the normal microbiota is present [85]. Possible culture-independent approaches that may be used to tackle the problem include metagenomic sequencing [86] and single cell genome sequencing [87]. However, these have to be tested in the clinical setting in the future before put into routine use.
The size and composition of the genomic database of bacterial strains is also crucial for subsequent biological interpretations, and a lack of representative members could result in biased or even wrong interpretation of data. Taking the 2011 German E. coli case as an example, EAEC 55989 had been identified as the closest relative of the outbreak strain before the genome sequence of the 2001 isolate from the HUS-associated E. coli collection [88] was made available. However, later phylogenetic analysis revealed a closer relationship of the outbreak strain to the 2001 isolate than to EAEC 55989 [75]. This example demonstrated how the availability of the E. coli collection facilitated phylogenetic grouping and also revealed the need of additional genome sequences from related strains. Indeed, apart from the benefits to ongoing outbreaks, an expanded genomic database of clinical bacterial isolates might also allow detection of outbreaks in advance [5].
Although the reduction in cost of WGS has made genomic analysis of bacterial pathogens more affordable to small and medium-sized clinical laboratories, many of these laboratories are currently still using conventional methods such as PFGE and instrumentations for genome sequencing are lacking. Also, although our genomic analytical workflow provides a clear and more focused direction of data analysis that is specifically designed for the purpose of outbreak investigations, a certain level of knowledge on bioinformatics is still required with the present set of available tools. As it is currently unrealistic to have dedicated bioinformatics specialists in every diagnostic laboratory, we expect an initial introduction of the WGS approach into country-level or regional core sequencing centers in which specialized technical expertise is present. After public health officials are more informed about genomics and more user-friendly bioinformatics tools become available, decentralization with genome sequencers in local public health laboratories and then diagnostic laboratories in hospitals across countries is expected.
For WGS to be used in routine disease outbreak investigations, the CDC and the European Centre for Disease Prevention and Control (ECDC) have to play important leading roles. For instance, they should coordinate regional and national meetings that bring together scientists, public health practitioners and the academia for discussions. Besides, it is important for the CDC and ECDC to coordinate and provide adequate training activities to public health professional to enrich their knowledge on genomics so that correct decisions could be made from the genomic information obtained. The CDC and ECDC should also establish and maintain an international database for central storage and sharing of genome data of bacterial pathogens for the purpose of disease outbreak investigations.
We are now in a new era of high-throughput, genome-based epidemiology. WGS would soon provide a cost-effective alternative to the conventional methods [10] or even replace them [11,89]. It is therefore inevitable for public health laboratories to prepare for appropriate analysis and interpretation of genome data in the context of molecular epidemiology [10]. We move one step forward here by proposing a genomic analytical workflow that provides a clear and focused direction for efficient extraction of epidemiologically useful information from the complex genome data of bacterial pathogens for the purpose of outbreak investigations, and which should be able to facilitate and accelerate the revolution of outbreak genomics. In the near future, perhaps with the help of a further reduction of genome sequencing cost and development of user-friendly and powerful analytical pipelines and tools, we can perform routine genomic investigations to fight against bacterial outbreaks.
Acknowledgements
This work is supported by RFCID CHP-PH-06 from Food and Health Bureau of Hong Kong SAR, China.