The Earth BioGenome Project aims to produce reference genomes for all ~1.8 million known eukaryotic species over the next decade1,2,3,4. Achieving this goal will require the current pace of reference genome production to increase by at least two orders of magnitude1. Automation of the assembly process with a pipeline that is widely accessible to any research group will be required to achieve this speed-up. Enabling this goal requires sustained effort in three major areas: genome assembly optimization and best-practice development, computational infrastructure provisioning, and dissemination and training.
Data availability
The workflows, their description and instructions on how to use them can be found at https://galaxyproject.org/projects/vgp/workflows/. The requisite tools are installed on usegalaxy.org and usegalaxy.eu, and are in the process of being installed on usegalaxy.org.au. These genomes were supported by collaborators of the VGP and ERGA, and the QC analyses reported here to test the VGP Galaxy pipeline do not release those that are under specific embargo policies for genome-wide analyses (e.g., https://genome10k.ucsc.edu/data-use-policies/). New genome assemblies are available in the GenomeArk repository: https://www.genomeark.org/. After manual curation, the assemblies are submitted to the US National Center for Biotechnology Information (NCBI) under the BioProject Vertebrate Genome Project: https://www.ncbi.nlm.nih.gov/bioproject/48924317.
References
-
Hotaling, S., Kelley, J. L. & Frandsen, P. B. Proc. Natl Acad. Sci. USA 118, e2109019118 (2021).
-
Formenti, G. et al. Trends Ecol. Evol. 37, 197–202 (2022).
-
Theissinger, K. et al. Trends Genet. 39, 545–559 (2003).
-
Lewin, H. A. et al. Proc. Natl Acad. Sci. USA 119, e2115635118 (2022).
-
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Genome Biol. 21, 245 (2020).
-
Galaxy Community. Nucleic Acids Res. 50, W345–W351 (2022).
-
Lander, E. S. & Waterman, M. S. Genomics 2, 231–239 (1988).
-
Bray, S. & Maier, W. Automating Galaxy workflows using the command line. Galaxy Training Network (2023).
-
Galaxy Community. Galaxy Server administration. Galaxy Training Network https://github.com/galaxyproject/training-material (2019).
-
Formenti, G. et al. Genome Biol. 22, 120 (2021).
-
Uliano-Silva, M. et al. BMC Bioinform. 24, 288 (2023).
-
Wenger, A. M. et al. Nat. Biotechnol. 37, 1155–1162 (2019).
-
Batut, B. et al. Cell Syst. 6, 752–758.e1 (2018).
-
Lariviere, D., Ostrovsky, A., Gallardo, C., Pickett, B. & Abueg, L. VGP assembly pipeline – short version. Galaxy Training Network (2023); https://gxy.io/GTN:T00040
-
Rautiainen, M. et al. Nat. Biotechnol. 41, 1474–1482 (2023).
-
Cheng, H., Asri, M., Lucas, J., Koren, S. & Li, H. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.03399 (2023).
-
BioProject Vertebrate Genome Project. NCBI BioProject PRJNA489243 (accessed 18 January 2024); https://www.ncbi.nlm.nih.gov/bioproject/489243
Acknowledgements
We thank Yagoub Adam, Tyler Alioto, Jun Aruga, Diego De Panis, Sagane Dind, Diego Fuentes, Shilpa Garg and Jèssica Gómez for contributing to the initial implementation during ELIXIR Biohackathon 2021. We also thank Nate Jue for help testing and developing the pipeline tutorials and Andrea Guarracino for their useful comments to the manuscript. This work was supported in part by the Intramural Research Program of the US National Human Genome Research Institute (NHGRI), the US National Institutes of Health (NIH) and the Howard Hughes Medical Institute (HHMI). The authors are grateful to the broader Galaxy community for their support and software development efforts. This work is funded by NIH grants U41 HG006620, U24 HG010263, U24 CA231877 and U01CA253481, along with US National Science Foundation grants 1661497, 1758800 and 2216612. The work was also supported in part by The Human Frontier Science Program (HFSP) RGP0025/2021, the Swiss National Science Foundation (SNSF) grants 202669 and 198691, the Swiss State Secretariat for Education, Research and Innovation (SERI) grant 22.00173 and Horizon Europe under the Biodiversity, Circular Economy and Environment program (REA.B.3, BGE 101059492). Usegalaxy.eu is supported by German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to B.G. Computational resources are provided by the Advanced Cyberinfrastructure Coordination Ecosystem (ACCESS-CI), Texas Advanced Computing Center, and the JetStream2 scientific cloud.
Ethics declarations
Competing interests
The authors declare no competing interests.
Supplementary information
Rights and permissions
About this article
Cite this article
Larivière, D., Abueg, L., Brajuka, N. et al. Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy.
Nat Biotechnol (2024). https://doi.org/10.1038/s41587-023-02100-3
-
Published:
-
DOI: https://doi.org/10.1038/s41587-023-02100-3