Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy

Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy

The Earth BioGenome Project aims to produce reference genomes for all ~1.8 million known eukaryotic species over the next decade1,2,3,4. Achieving this goal will require the current pace of reference genome production to increase by at least two orders of magnitude1. Automation of the assembly process with a pipeline that is widely accessible to any research group will be required to achieve this speed-up. Enabling this goal requires sustained effort in three major areas: genome assembly optimization and best-practice development, computational infrastructure provisioning, and dissemination and training.

Data availability

The workflows, their description and instructions on how to use them can be found at https://galaxyproject.org/projects/vgp/workflows/. The requisite tools are installed on usegalaxy.org and usegalaxy.eu, and are in the process of being installed on usegalaxy.org.au. These genomes were supported by collaborators of the VGP and ERGA, and the QC analyses reported here to test the VGP Galaxy pipeline do not release those that are under specific embargo policies for genome-wide analyses (e.g., https://genome10k.ucsc.edu/data-use-policies/). New genome assemblies are available in the GenomeArk repository: https://www.genomeark.org/. After manual curation, the assemblies are submitted to the US National Center for Biotechnology Information (NCBI) under the BioProject Vertebrate Genome Project: https://www.ncbi.nlm.nih.gov/bioproject/48924317.

References

  1. Hotaling, S., Kelley, J. L. & Frandsen, P. B. Proc. Natl Acad. Sci. USA 118, e2109019118 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  2. Formenti, G. et al. Trends Ecol. Evol. 37, 197–202 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  3. Theissinger, K. et al. Trends Genet. 39, 545–559 (2003).

    Article 

    Google Scholar
     

  4. Lewin, H. A. et al. Proc. Natl Acad. Sci. USA 119, e2115635118 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  5. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Genome Biol. 21, 245 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  6. Galaxy Community. Nucleic Acids Res. 50, W345–W351 (2022).

    Article 

    Google Scholar
     

  7. Lander, E. S. & Waterman, M. S. Genomics 2, 231–239 (1988).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  8. Bray, S. & Maier, W. Automating Galaxy workflows using the command line. Galaxy Training Network (2023).

  9. Galaxy Community. Galaxy Server administration. Galaxy Training Network https://github.com/galaxyproject/training-material (2019).

  10. Formenti, G. et al. Genome Biol. 22, 120 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  11. Uliano-Silva, M. et al. BMC Bioinform. 24, 288 (2023).

  12. Wenger, A. M. et al. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  13. Batut, B. et al. Cell Syst. 6, 752–758.e1 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  14. Lariviere, D., Ostrovsky, A., Gallardo, C., Pickett, B. & Abueg, L. VGP assembly pipeline – short version. Galaxy Training Network (2023); https://gxy.io/GTN:T00040

  15. Rautiainen, M. et al. Nat. Biotechnol. 41, 1474–1482 (2023).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  16. Cheng, H., Asri, M., Lucas, J., Koren, S. & Li, H. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.03399 (2023).

  17. BioProject Vertebrate Genome Project. NCBI BioProject PRJNA489243 (accessed 18 January 2024); https://www.ncbi.nlm.nih.gov/bioproject/489243

Download references

Acknowledgements

We thank Yagoub Adam, Tyler Alioto, Jun Aruga, Diego De Panis, Sagane Dind, Diego Fuentes, Shilpa Garg and Jèssica Gómez for contributing to the initial implementation during ELIXIR Biohackathon 2021. We also thank Nate Jue for help testing and developing the pipeline tutorials and Andrea Guarracino for their useful comments to the manuscript. This work was supported in part by the Intramural Research Program of the US National Human Genome Research Institute (NHGRI), the US National Institutes of Health (NIH) and the Howard Hughes Medical Institute (HHMI). The authors are grateful to the broader Galaxy community for their support and software development efforts. This work is funded by NIH grants U41 HG006620, U24 HG010263, U24 CA231877 and U01CA253481, along with US National Science Foundation grants 1661497, 1758800 and 2216612. The work was also supported in part by The Human Frontier Science Program (HFSP) RGP0025/2021, the Swiss National Science Foundation (SNSF) grants 202669 and 198691, the Swiss State Secretariat for Education, Research and Innovation (SERI) grant 22.00173 and Horizon Europe under the Biodiversity, Circular Economy and Environment program (REA.B.3, BGE 101059492). Usegalaxy.eu is supported by German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to B.G. Computational resources are provided by the Advanced Cyberinfrastructure Coordination Ecosystem (ACCESS-CI), Texas Advanced Computing Center, and the JetStream2 scientific cloud.

Author information

Author notes

  1. These authors contributed equally: Delphine Larivière, Linelle Abueg, Nadolina Brajuka

Authors and Affiliations

  1. Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA

    Delphine Larivière, Marius van den Beek & Anton Nekrutenko

  2. Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA

    Linelle Abueg, Nadolina Brajuka, Jennifer R. Balacco, Melanie Couture, Olivier Fedrigo, Grenville MacDonald Gooder, Kathleen Horan, Nivesh Jain, Cassidy Johnson, Brian O’Toole, Tatiana Tilley, Erich D. Jarvis & Giulio Formenti

  3. Bioinformatics Group, Department of Computer Science, Albert-Ludwigs University Freiburg, Freiburg, Germany

    Cristóbal Gallardo-Alba & Bjorn Grüning

  4. Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea

    Byung June Ko & Heebal Kim

  5. Departments of Biology and Computer Science, Johns Hopkins University, Baltimore, MD, USA

    Alex Ostrovsky & Michael C. Schatz

  6. Department of Medicine and Life Sciences (MELIS), Institut de Biologia Evolutiva, Universitat Pompeu Fabra-CSIC, Barcelona, Spain

    Marc Palmada-Flores & Tomas Marques-Bonet

  7. Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA

    Brandon D. Pickett, Arang Rhie & Adam M. Phillippy

  8. Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA

    Keon Rabbani & Mark J. P. Chaisson

  9. CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Porto, Portugal

    Agostinho Antunes

  10. Department of Biology, Faculty of Sciences, University of Porto, Porto, Portugal

    Agostinho Antunes

  11. Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA

    Haoyu Cheng

  12. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA

    Haoyu Cheng

  13. Wellcome Sanger Institute, Cambridge, UK

    Joanna Collins

  14. Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia

    Alexandra Denisova

  15. Department of Biosciences, University of Milan, Milan, Italy

    Guido Roberto Gallo

  16. BMRI, Weill Cornell Medical College, New York, NY, USA

    Alice Maria Giani

  17. eGnome, Inc., Seoul, Republic of Korea

    Heebal Kim

  18. Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea

    Heebal Kim & Chul Lee

  19. Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA

    Chul Lee

  20. Catalan Institution of Research and Advanced Studies (ICREA), Barcelona, Spain

    Tomas Marques-Bonet

  21. CNAG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain

    Tomas Marques-Bonet

  22. Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Spain

    Tomas Marques-Bonet

  23. Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus

    Simona Secomandi

  24. University of Florence, Department of Biology, Florence, Italy

    Marcella Sozzoni

  25. Tree of Life, Wellcome Sanger Institute, Cambridge, UK

    Marcela Uliano-Silva

  26. Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA

    Robert W. Williams

  27. Department of Ecology & Evolution and Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland

    Robert M. Waterhouse

Contributions

D.L. built the assembly pipeline with support from G.F., L.A., C.G.-A., B.G., A.O., H.C., M.U.-S., B.D.P., A.R., M.v.d.B. and the VGP assembly working group. L.A., A.D., G.R.G., A.M.G., G.M.G., N.J., C.J., B.O., S.S., M.S. and T.T. generated one or several assemblies used in the analyses. B.J.K., K.R. and M.J.P.C. validated the zebra finch assemblies. J.C. performed the manual curation on the zebra finch assembly. L.A. assembled and evaluated the mitochondrial genomes. N.B. established the decontamination pipeline and performed the contamination analyses. N.B. and M.P.-F. compared the scaffolding strategies. A.N. performed the analyses on XBP1. C.G.-A. and B.D.P. developed the training material with support from the user community. K.H. and M.C. sourced and arranged for sample procurement for species in this study. J.R.B., N.J., T.T., B.O’T., O.F., C.L., H.K., T.M.-B. and R.M.W. generated the PacBio and Hi-C data. G.F., M.C.S., A.N., A.M.P. and E.D.J. conceived the study and drafted the manuscript. All authors, including A.A. and R.W.W., contributed to writing and editing the manuscript and approved it.

Corresponding authors

Correspondence to
Erich D. Jarvis, Michael C. Schatz, Anton Nekrutenko or Giulio Formenti.

Ethics declarations

Competing interests

The authors declare no competing interests.

Supplementary information

About this article

Cite this article

Larivière, D., Abueg, L., Brajuka, N. et al. Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy.
Nat Biotechnol (2024). https://doi.org/10.1038/s41587-023-02100-3

Download citation

  • Published:

  • DOI: https://doi.org/10.1038/s41587-023-02100-3

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *