Deep Sequencing in Pre- and Clinical Vaccine ResearchPrachi P. · Donati C. · Masciopinto F. · Rappuoli R. · Bagnoli F.
Novartis Vaccines, Research Center, Siena, Italy Corresponding Author
Via Fiorentina 1
IT-53100 Siena (Italy)
Vaccine research has experienced a quantum leap after the beginning of the genomics era. High-throughput sequencing techniques, unlimited computing resources, as well as new bioinformatic algorithms are now changing the way we perform genomic studies. Whole genome sequencing will soon become the gold standard for phylogenetic and epidemiology studies and is already shedding new light on the dynamics of bacterial evolution. We believe that deep sequencing projects, together with structural studies on vaccine candidates, will allow targeting constant epitopes and avoid vaccine failure due to antigenic variability. Systems biology, which is expected to revolutionize vaccine research and clinical studies, greatly relies on high-throughput technologies such as RNA-seq. Furthermore, genomics is a key element to develop safer vaccines, and the accuracy of deep sequencing will allow monitoring vaccine coverage after their introduction on the market.
© 2013 S. Karger AG, Basel
The first generation sequencing technology, based on the Sanger sequencing method (chain termination methodology), has been the molecular biology workhorse for more than 25 years. The first human genome sequence, accomplished using first generation technologies, took roughly 10 years and 3 billion dollars [1,2]. In the last 7 years, the field of genomics has dramatically advanced, entering into the so-called era of next generation sequencing (NGS, also referred as 2nd generation). As compared to first generation technology, NGS is much cheaper and faster, and this has allowed to perform whole genome studies that could never be afforded before, which are also referred to as massively parallel sequencing or deep sequencing projects  (table 1). NGS technologies are based either on sequencing by synthesis on isolated groups of clonally amplified templates or sequencing by ligation reaction controlled by polymerase or ligase. Today many NGS platforms are commercially available. Although these platforms vary in their engineering configuration and sequencing chemistry, they share the common technological feature of massive parallel sequencing of either clonally amplified single molecule (Roche/454, Illumina/Solexa, Life/APG, Applied Biosystems/SoLid platforms, and Dover systems/The Polonator)  or of single molecules of DNA in real-time, where the DNA synthesis is followed by its analysis without interruptions (Pacific BioSciences/PACBIO RS and Visigen biotechnologies/VisiGen).
Given the overwhelming amount of data generated by NGS projects, new algorithms have been recently generated for the alignment of the huge number of short reads released by NGS sequencers, the identification of operons, recombination events as well as for the construction of phylogenetic trees based on single nucleotide polymorphisms (SNPs) [4,5].
The efficiency of the new sequencers, coupled with random PCR amplification, allows sequencing of virtually any nucleic acid present in a sample. This approach has been named metagenomics and can be used to identify microorganisms present in a sample without prior knowledge of their presence or their genome sequence [6,7]. Since it can detect non-cultivable species, metagenomics is giving an unprecedented contribution to basic research, identifying new species in any ecological niche. Furthermore, as we will discuss in this review, deep sequencing is also becoming instrumental to vaccine research, phylogenetic and epidemiology studies as well as to predict and monitor vaccine efficacy and safety.
The first example of large scale use of genomic information for the identification of potential vaccine targets was the attempt to develop a vaccine against serogroup B Neisseria meningitidis through the so-called reverse vaccinology approach . The idea behind the method is to mine the pathogen's genome with bioinformatic algorithms to identify coding sequences predicted to encode for proteins exposed on the surface of the pathogen or to be secreted in the extracellular milieu. The rationale of this selection relies on the assumption that surface and secreted factors are exposed to the host's immune system and, therefore, potential vaccine targets. Since then, the same approach has been applied to most human bacterial pathogens determining an explosion in vaccine candidate identification . In the attempt to measure the importance of the availability of bacterial genomes for vaccine discovery, we extrapolated all patent applications on bacterial vaccines containing genomic information filed since the 1990's when first microbial genome was sequenced . The number of inventions enormously increased right after the first years of the genomics era, and since the year 2002, it has steadily decreased (fig. 1). We think that this phenomenon is due to the following reason. Initially, availability of genomes allowed the identification of many new vaccine targets as it was demonstrated with reverse vaccinology . This is reflected by the sudden increase of patent filings after the publication of the first genomes in the 90's (fig. 1). Later, legal requirements for granting patent applications became more stringent and in silico data needed to be corroborated by empirical evidence. Therefore, a smaller number of patent applications, but containing more data, has been filed lately. We expect that the vaccine discovery rate will soon be boosted by the advent of deep sequencing studies. Indeed, deep sequencing together with other high-throughput technologies has the potential to provide important information on candidate vaccine targets in addition to the nucleotide sequence. For example, initially reverse vaccinology was performed on a single genome, therefore the predictive power of antigen coverage was low. After the introduction of deep sequencing technologies, a representative collection of epidemiologically relevant strains are sequenced and antigen selection is performed taking into consideration the level of conservation of the antigens. Another example is the use of RNA-seq to determine the expression level of antigens in different conditions, including infected tissues [13,14]. Increased gene expression during infection, if paralleled by a role in the pathogenesis, is generally considered an important indication that a protein is a potential vaccine candidate. Indeed, vaccines targeting virulence factors may have 2 protective mechanisms: immune response against the target pathogen and inhibition of its virulence mechanisms. This vaccine discovery approach has been proposed earlier, measuring the transcription profile of antigens by DNA-microarrays [15,16]. The technical advantages of RNA-seq over DNA-microarray analysis, including the more reliable and accurate quantitation of gene expression, cost-effectiveness and speed, is providing new fuel to this area of vaccine research. Furthermore, as discussed below, deep sequencing can be applied to the design of novel vaccines against variable pathogens and cancers.
Most available vaccines are against pathogens whose antigens are relatively stable. Microbes that have rapid and extensive antigenic variability remain a major challenge for vaccine researchers . The most striking example is the human immunodeficiency (HIV). Subunit vaccines derived from the HIV envelope were developed, tested in phase I and phase II clinical studies, and in the mid 1990s were ready to enter phase III efficacy studies. However, in vitro studies demonstrated that the antibodies induced by the vaccines only neutralized the virus strain used to make the vaccine and did not neutralize divergent viruses or primary viruses isolated from patients , due to the extremely high rate at which the virus is able to mutate its dominant antigens. A different approach, able to induce broadly neutralizing antibodies and CD8 T-cell response against conserved epitopes, will be probably needed to develop an efficacious vaccine. Similar issues are also hampering the progress towards broadly protective vaccines against several other viruses (e.g. rhinovirus and influenza). A combination of new technologies could help find the solution of this so far insurmountable problem. Deep sequencing can be used to find variable and constant regions of the pathogens genome from hundreds of isolates recovered from infected patients . Once conserved epitopes have been identified, structural studies on the antigens can be performed to understand the degree of surface exposure of the epitopes and to design peptides optimized to generate neutralizing antibodies .
A similar, but reversed approach could be used to design anti-cancer vaccines. Indeed, during tumor genesis, cancer cells accumulate mutations generating antigenic variability. High-throughput sequencing has been recently used to identify mutations in murine melanome cells (‘mutanome'), and the peptides containing the mutations were assessed for prophylactic and therapeutic vaccination in tumor transplant models . The response triggered by the treatment was shown to be specific for the mutated antigens and to control tumor growth. This approach could be used to identify cancer-specific epitopes from biopsies and to generate personalized vaccines.
Comparative genomics has shown that genetic variability within bacterial species is much larger than expected, leading to the definition of the species pan-genome . From a practical point of view, this unexpected finding led to the conclusion that a solid understanding of the population genetics of the bacterial species is fundamental for the formulation of vaccines with broad coverage . At the present, phylogenetics and population genetics have been mostly based on molecular typing methods, the most successful of which is the multi locus sequence typing (MLST) . However, it is now becoming evident that the level of resolution achievable with MLST is limited, and we should replace it with whole-genome based typing. Indeed, several recent studies have shown that strains belonging to the same clonal complex (CC), as defined on the basis of MLST, can have several substantial differences. One of the most striking examples is a recent publication on the analysis of Staphylococcus aureus CC30 . High-throughput sequencing of a collection of historical and contemporary clones of MRSA S. aureus has shown that contemporary CC30 strains have a common ancestor with the phage type 80/81, which was responsible for an epidemic wave in Australia, Great Britain, Canada, and the United States in the 1950s. Phylogenetic analysis using genome-wide SNPs has shown that the contemporary CC30 strains harbor SNPs which correlate with substantial difference in the pathogenicity of the strains. These observations could not be predicted on the basis of MLST.
Genome wide screening of SNPs is also instrumental to track the evolution, demographic expansion and geographic dispersal of the species, as it was recently performed on European strains of the S. aureus ST225 lineage , to identify the dominant haplotypes and pathotypes within an single urban district of the monomorphic species Salmonella enterica serovar Tiphy [26,27] and to investigate on historical pandemics of Yersinia pestis . Genome studies are also demonstrating that MLST-based epidemiology can fail to identify macroscopic differences among strains. For example, S. aureus strains of sequence type ST239 (clonal complex CC8, as determined by eBURST) display evidence of a large recombination event involving a region of approximately 557 kb spanning the origin of replication that appears to have been donated by a CC30 strain .
Another important application of deep sequencing that is rapidly emerging is the post-marketing monitoring of vaccine coverage. The launch of the PCV7 pneumococcal vaccine in 2000 on the market provided an unprecedented opportunity to measure the vaccine-induced selective pressure. Since 2003, pneumococcal strains in which a capsular switch from serotype 4 (contained in the vaccine) to serotype 19A (not in the vaccine) were identified . Genomic information was shown to be critical to understand the serological replacement event , showing that vaccine strains can switch to non-vaccine capsule types by homologous recombination of the entire capsular biosynthesis locus. Since genome-wide studies of large collection of strains of the same lineage has shown that these exchange events have occurred frequently in the past, also in the absence of a selective pressure against a specific serotype , it is conceivable that the efficacy of existing vaccines could last less than expected. On the basis of these observations, several authors have proposed the use of serotype-independent vaccines based on combination of conserved proteins that presumably would cover all circulating strains and avoid the capsular switch phenomena observed with PCV7 .
Recently, holistic approaches to identify and characterize immune responses to vaccines in humans have been proposed [17,33,34,35,36]. The driving idea behind these approaches is to integrate different sets of biological data from as many hierarchical levels as possible to visualize ‘emergent properties' that are not demonstrated by their individual parts and cannot be predicted from the parts alone. The sets of data that have been considered so far for these studies differ depending on the approach, but they have a certain degree of overlap. Vaccinomics focuses on unrevealing associations between the host genetic background and its responses to vaccination [34,37]. On the other hand, systems biology studies have primarily analyzed the interrelationship existing between gene expression profiles and immune responses triggered by vaccination into the host [17,38]. Recently, a systems biology approach has successfully been used to predict the immune responses induced by the live-attenuated yellow fever virus vaccine YF-17D. Gene expression profiles induced in the blood of vaccinees were used to identify genes that regulate virus innate sensing and type I interferon production. In addition, computational analyses identified a gene signature that predicted CD8 T-cell and neutralizing antibody responses to YF-17D [39,40]. High-throughput technologies, such as deep sequencing, are vital to these kind of approaches in order to facilitate the discovery of rare polymorphisms and alternative splice variants as well as to measure how vaccines affect host gene expression (RNA-seq) [35,38,41].
The availability of genomic sequences allows designing and developing safer vaccines. Indeed, by comparing the sequence of vaccine antigens with the human genome, it is possible to reveal homologies that could potential elicit autoimmunity. Another recent application of genomics to vaccine safety is the identification of contaminant nucleic acids in licensed human vaccines through deep sequencing. As compared to PCR-based strategies, which is used for the identification of specific target sequences, deep sequencing can reveal the presence of virtually every DNA or RNA molecule in the sample without prior knowledge of their sequence. Indeed, it has been shown that its application to live-attenuated vaccines can detect the presence of adventitious viruses, sequence changes in the attenuated virus sequence and minority variants [42,43,44]. For example, the rotavirus vaccine Rotarix was found to be unexpectedly contaminated with porcine circovirus-1 DNA by deep sequencing . Furthermore, this technology can also be applied to the screening of cells and all the reagents used for the production of viral or subunit vaccines. Several viruses have been identified in mammalian as well as insect cells used in vaccine manufacturing . Therefore, high-throughput sequencing technologies will soon become a key factor for vaccine lot release as well as characterization and improvement of reagents used in vaccine production.
The importance of genomics for vaccine research and development has been well established in the field. Deep sequencing is now bringing new fuel to virtually every area of vaccinology. As compared to the year 2000, when the first reverse vaccinology genome study was published , availability of NGS technology is significantly contributing to the improvement of the approach (table 2). Successful vaccines developed so far are the ones against relatively slow evolving pathogens . To design vaccines that are able to cope with antigenic variability, we need deep sequencing to identify epitopes conserved virtually among any circulating strain. Using this technique, a medium-size laboratory can sequence hundreds of isolates of the same species and, on the basis of this information, identify vaccine candidates as well as perform much more accurate phylogenetic studies as compared to traditional methods.
The possibility to use these techniques to monitor the host response to vaccination and disease in a large collection of individuals and correlate it to the genetic background of the host will greatly facilitate vaccinomics and systems biology studies to predict and optimize vaccine outcomes (i.e. maximizing the appropriate immune responses and minimizing vaccine failure and adverse events) as well as to discover signature of protection in humans.
In conclusion, the level of analytical resolution and the quality of vaccines that we can now produce with the aid of deep sequencing is absolutely unprecedented. We believe that this is what the public opinion needs to hear to be reinsured on the safety of immunoprophylactic campaigns.
Via Fiorentina 1
IT-53100 Siena (Italy)
Open Access License: This is an Open Access article licensed under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported license (CC BY-NC) (www.karger.com/OA-license), applicable to the online version of the article only. Distribution permitted for non-commercial purposes only.
Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in government regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.
Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.