Journal Mobile Options
Table of Contents
Vol. 72, No. 2, 2011
Issue release date: October 2011
Section title: Original Paper
Free Access
Hum Hered 2011;72:85–97
(DOI:10.1159/000330579)

Power of Data Mining Methods to Detect Genetic Associations and Interactions

Molinaro A.M.a · Carriero N.b · Bjornson R.b · Hartge P.c · Rothman N.c · Chatterjee N.c
aDivision of Biostatistics, School of Public Health, and bDepartment of Computer Science, Yale University, New Haven, Conn., and cDivision of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Md., USA
email Corresponding Author

Abstract

Background: Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR). Methods: We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma. Results: The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest. Conclusions: Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.

© 2011 S. Karger AG, Basel


  

Key Words

  • Genetic associations
  • Power
  • Random forests
  • SNP
  • Variable importance measure

References

  1. Breiman L: Random forests. Mach Learn 2001;45:5–32.
  2. Cordell HJ: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 2009;10:392–404.
  3. Lunetta K, Hayward LB, Segal J, Van Eerdewegh P: Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004;5:32.
  4. García-Magariños M, López-de-Ullibarri I, Cao R, Salas A: Evaluating the ability of tree-based methods and logistic regression for the detection of SNP-SNP interaction. Ann Hum Genet 2009;73:360–369.
  5. Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, Sun YV: Machine learning in genome-wide association studies. Genet Epidemiol 2009;33(suppl 1):S51–S57.
  6. Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics 2009;10(suppl 1):S65.
  7. Wang M, Chen X, Zhang H: Maximal conditional chi-square importance in random forests. Bioinformatics 2010;26:831–837.
  8. Altmann A, Tolosi L, Sander O, Lengauer T: Permutation importance: a corrected feature importance measure. Bioinformatics 2010;26:1340–1347.
  9. Guyon I, Elisseeff A: An introduction to variable and feature selection. J Mach Learn Res 2003;3:1157–1182.
  10. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Wadsworth and Brooks/Cole, Monterey, CA, 1984.
  11. Liaw A, Wiener M: Classification and regression by random forest. R News 2002;2:18–22.
  12. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, 2008.
  13. Ruczinski I, Kooperberg C, LeBlanc M: Logic regression. J Comput Graph Statist 2003;12:474–511.

    External Resources

  14. Kooperberg C, Ruczinski I: LogicReg: Logic Regression, 2008. R package version 1.4.8.
  15. Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003;19:376–382.
  16. Aout M, Wachter C: Rmdr: R-Multifactor Dimensionality Reduction, 2005. R package version 0.1-1.
  17. Huang J, Lin A, Narasimhan B, Quertermous T, Hsiung CA, Ho LT, Grove JS, Olivier M, Ranade K, Risch NJ, Olshen RA: Tree-structured supervised learning and the genetics of hypertension. Proc Natl Acad Sci USA 2004;101:10529–10534.
  18. Wang S, Cerhan J, Hartge P, Davis S, Cozen W, Severson R, Chatterjee N, Yeager M, Chanock S, Rothman N: Common genetic variants in proinflammatory and other immunoregulatory genes and risk for non-Hodgkin lymphoma. Cancer Res 2006;66:9771–9780.
  19. Chatterjee N, Hartge P, Cerhan J, et al: Risk of non-Hodgkin lymphoma and family history of lymphatic, hematologic, and other cancers. Cancer Epidemiol Biomarkers Prev 2004;13:1415–1421.
  20. Dalmasso C: LBE: estimation of the false discovery rate. 2007. R package version 1.10.0.
  21. Dalmasso C, Brot P, Moreau T: A simple procedure for estimating the false discovery rate. Bioinformatics 2005;21:660–668.
  22. Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. Technical Report 66, Department of Statistics, University of California, Berkeley, 2004.
  23. Chen Y, Zheng T, Lan Q, Foss F, Kim C, Chen X, Dai M, Li Y, Holford T, Leaderer B, Boyle P, Chanock SJ, Rothman N, Zhang Y: Cytokine polymorphisms in th1/th2 pathway genes, body mass index, and risk of non-Hodgkin lymphoma. Blood 2011;117:585–590.
  24. Lan Q, Zheng T, Rothman N, Zhang Y, Wang SS, Shen M, Berndt SI, Zahm SH, Holford TR, Leaderer B, Yeager M, Welch R, Boyle P, Zhang B, Zou K, Zhu Y, Chanock S: Cytokine polymorphisms in the th1/th2 pathway and susceptibility to non-Hodgkin lymphoma. Blood 2006;107:4101–4108.
  25. Purdue MP, Lan Q, Kricker A, Grulich AE, Vajdic CM, Turner J, Whitby D, Chanock S, Rothman N, Armstrong BK: Polymorphisms in immune function genes and risk of non-Hodgkin lymphoma: findings from the new south wales non-Hodgkin lymphoma study. Carcinogenesis 2006;28:704–712.
  26. Butterbach K, Beckmann L, de Sanjosé S, Benavente Y, Becker N, Foretova L, Maynadie M, Cocco P, Staines A, Boffetta P, Brennan P, Nieters A: Association of JAK-STAT pathway related genes with lymphoma risk: results of a European case-control study (EpiLymph). Br J Haematol 2011;153:318–333.
  27. Kim Y, Wojciechowski R, Sung H, Mathias RA, Wang L, Klein AP, Lenroot RK, Malley J, Bailey-Wilson JE: Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects. BMC Proc 2009;3(suppl 7):S64.
  28. Archer KJ, Kimes RV: Empirical characterization of random forest variable importance measures. Comput Stat Data Anal 2008;52:2249–2260.

    External Resources

  29. Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 2007;8:25.
  30. Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006;7:3.
  31. Schwender H, Ickstadt K: Identification of SNP interactions using logic regression. Biostatistics 2008;9:187–198.
  32. Wolf BJ, Hill EG, Slate EH: Logic forest: an ensemble classifier for discovering logical combinations of binary markers. Bioinformatics 2010;26:2183–2189.
  33. Schwarz DF, König IR, Ziegler A: On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics 2010;26:1752–1758.
  34. Zhang H, Wang M, Chen X: Willows: a memory efficient tree and forest construction package. BMC Bioinformatics 2009;10:130.

  

Author Contacts

Annette M. Molinaro
Division of Biostatistics
School of Public Health, Yale University
New Haven, CT 06519 (USA)
E-Mail annette.molinaro@yale.edu

  

Article Information

Received: January 6, 2011
Accepted: July 4, 2011
Published online: September 17, 2011
Number of Print Pages : 13
Number of Figures : 5, Number of Tables : 2, Number of References : 34
Additional supplementary material is available online - Number of Parts : 1

  

Publication Details

Human Heredity (International Journal of Human and Medical Genetics)

Vol. 72, No. 2, Year 2011 (Cover Date: October 2011)

Journal Editor: Devoto M. (Philadelphia, Pa./Rome)
ISSN: 0001-5652 (Print), eISSN: 1423-0062 (Online)

For additional information: http://www.karger.com/HHE


Copyright / Drug Dosage / Disclaimer

Copyright: All rights reserved. No part of this publication may be translated into other languages, reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, microcopying, or by any information storage and retrieval system, without permission in writing from the publisher or, in the case of photocopying, direct payment of a specified fee to the Copyright Clearance Center.
Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in goverment regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.
Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.

Abstract

Background: Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR). Methods: We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma. Results: The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest. Conclusions: Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.

© 2011 S. Karger AG, Basel


  

Author Contacts

Annette M. Molinaro
Division of Biostatistics
School of Public Health, Yale University
New Haven, CT 06519 (USA)
E-Mail annette.molinaro@yale.edu

  

Article Information

Received: January 6, 2011
Accepted: July 4, 2011
Published online: September 17, 2011
Number of Print Pages : 13
Number of Figures : 5, Number of Tables : 2, Number of References : 34
Additional supplementary material is available online - Number of Parts : 1

  

Publication Details

Human Heredity (International Journal of Human and Medical Genetics)

Vol. 72, No. 2, Year 2011 (Cover Date: October 2011)

Journal Editor: Devoto M. (Philadelphia, Pa./Rome)
ISSN: 0001-5652 (Print), eISSN: 1423-0062 (Online)

For additional information: http://www.karger.com/HHE


Article / Publication Details

First-Page Preview
Abstract of Original Paper

Received: 1/6/2011 12:36:10 PM
Accepted: 7/4/2011
Published online: 9/17/2011
Issue release date: October 2011

Number of Print Pages: 13
Number of Figures: 5
Number of Tables: 2

ISSN: 0001-5652 (Print)
eISSN: 1423-0062 (Online)

For additional information: http://www.karger.com/HHE


Copyright / Drug Dosage

Copyright: All rights reserved. No part of this publication may be translated into other languages, reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, microcopying, or by any information storage and retrieval system, without permission in writing from the publisher or, in the case of photocopying, direct payment of a specified fee to the Copyright Clearance Center.
Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in goverment regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.
Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.

References

  1. Breiman L: Random forests. Mach Learn 2001;45:5–32.
  2. Cordell HJ: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 2009;10:392–404.
  3. Lunetta K, Hayward LB, Segal J, Van Eerdewegh P: Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004;5:32.
  4. García-Magariños M, López-de-Ullibarri I, Cao R, Salas A: Evaluating the ability of tree-based methods and logistic regression for the detection of SNP-SNP interaction. Ann Hum Genet 2009;73:360–369.
  5. Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, Sun YV: Machine learning in genome-wide association studies. Genet Epidemiol 2009;33(suppl 1):S51–S57.
  6. Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics 2009;10(suppl 1):S65.
  7. Wang M, Chen X, Zhang H: Maximal conditional chi-square importance in random forests. Bioinformatics 2010;26:831–837.
  8. Altmann A, Tolosi L, Sander O, Lengauer T: Permutation importance: a corrected feature importance measure. Bioinformatics 2010;26:1340–1347.
  9. Guyon I, Elisseeff A: An introduction to variable and feature selection. J Mach Learn Res 2003;3:1157–1182.
  10. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. Wadsworth and Brooks/Cole, Monterey, CA, 1984.
  11. Liaw A, Wiener M: Classification and regression by random forest. R News 2002;2:18–22.
  12. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, 2008.
  13. Ruczinski I, Kooperberg C, LeBlanc M: Logic regression. J Comput Graph Statist 2003;12:474–511.

    External Resources

  14. Kooperberg C, Ruczinski I: LogicReg: Logic Regression, 2008. R package version 1.4.8.
  15. Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003;19:376–382.
  16. Aout M, Wachter C: Rmdr: R-Multifactor Dimensionality Reduction, 2005. R package version 0.1-1.
  17. Huang J, Lin A, Narasimhan B, Quertermous T, Hsiung CA, Ho LT, Grove JS, Olivier M, Ranade K, Risch NJ, Olshen RA: Tree-structured supervised learning and the genetics of hypertension. Proc Natl Acad Sci USA 2004;101:10529–10534.
  18. Wang S, Cerhan J, Hartge P, Davis S, Cozen W, Severson R, Chatterjee N, Yeager M, Chanock S, Rothman N: Common genetic variants in proinflammatory and other immunoregulatory genes and risk for non-Hodgkin lymphoma. Cancer Res 2006;66:9771–9780.
  19. Chatterjee N, Hartge P, Cerhan J, et al: Risk of non-Hodgkin lymphoma and family history of lymphatic, hematologic, and other cancers. Cancer Epidemiol Biomarkers Prev 2004;13:1415–1421.
  20. Dalmasso C: LBE: estimation of the false discovery rate. 2007. R package version 1.10.0.
  21. Dalmasso C, Brot P, Moreau T: A simple procedure for estimating the false discovery rate. Bioinformatics 2005;21:660–668.
  22. Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. Technical Report 66, Department of Statistics, University of California, Berkeley, 2004.
  23. Chen Y, Zheng T, Lan Q, Foss F, Kim C, Chen X, Dai M, Li Y, Holford T, Leaderer B, Boyle P, Chanock SJ, Rothman N, Zhang Y: Cytokine polymorphisms in th1/th2 pathway genes, body mass index, and risk of non-Hodgkin lymphoma. Blood 2011;117:585–590.
  24. Lan Q, Zheng T, Rothman N, Zhang Y, Wang SS, Shen M, Berndt SI, Zahm SH, Holford TR, Leaderer B, Yeager M, Welch R, Boyle P, Zhang B, Zou K, Zhu Y, Chanock S: Cytokine polymorphisms in the th1/th2 pathway and susceptibility to non-Hodgkin lymphoma. Blood 2006;107:4101–4108.
  25. Purdue MP, Lan Q, Kricker A, Grulich AE, Vajdic CM, Turner J, Whitby D, Chanock S, Rothman N, Armstrong BK: Polymorphisms in immune function genes and risk of non-Hodgkin lymphoma: findings from the new south wales non-Hodgkin lymphoma study. Carcinogenesis 2006;28:704–712.
  26. Butterbach K, Beckmann L, de Sanjosé S, Benavente Y, Becker N, Foretova L, Maynadie M, Cocco P, Staines A, Boffetta P, Brennan P, Nieters A: Association of JAK-STAT pathway related genes with lymphoma risk: results of a European case-control study (EpiLymph). Br J Haematol 2011;153:318–333.
  27. Kim Y, Wojciechowski R, Sung H, Mathias RA, Wang L, Klein AP, Lenroot RK, Malley J, Bailey-Wilson JE: Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects. BMC Proc 2009;3(suppl 7):S64.
  28. Archer KJ, Kimes RV: Empirical characterization of random forest variable importance measures. Comput Stat Data Anal 2008;52:2249–2260.

    External Resources

  29. Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 2007;8:25.
  30. Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006;7:3.
  31. Schwender H, Ickstadt K: Identification of SNP interactions using logic regression. Biostatistics 2008;9:187–198.
  32. Wolf BJ, Hill EG, Slate EH: Logic forest: an ensemble classifier for discovering logical combinations of binary markers. Bioinformatics 2010;26:2183–2189.
  33. Schwarz DF, König IR, Ziegler A: On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics 2010;26:1752–1758.
  34. Zhang H, Wang M, Chen X: Willows: a memory efficient tree and forest construction package. BMC Bioinformatics 2009;10:130.