Abstract
Introduction/Objectives: Options for assembling the DNA pieces generated by short-read Illumina sequencing into a completed assembly or shorter contiguous regions for submission to the National Center for Biotechnology Information GenBank include purchased services, Python-based programs available from GitHub, and open source web-based platforms using graphical user interfaces. The goal of the present study was to evaluate assembly and annotation of a single strain of Serratia marcescens using programs available on four different web-based platforms: BV-BRC (BV-BRC.org), Galaxy (https://genome.usegalaxy.org.au/), KBase (kbase.us) and Proksee (proksee.ca/users/). Each of these allows users a free account with the ability to store data online and to export results, but each also has unique programs. Common to all are programs to evaluate read quality (such as FastQC), trim reads (such as Trimmomatic), assemble reads (Velvet, SPAdes and Unicycler), and analyze the assembly (Quast). The four platforms differ regarding the available annotation programs, which include Prokka, Bakta, RAST and DRAM.
Methods: SeqCenter performed Illumina sequencing of a clinically isolated strain of S. marcescens, 131 Watkins. Read files were imported into accounts at each of the platforms. Two platforms, Proksee and BV-BRC, use an automated pipeline of programs to assemble reads into genomes. In Galaxy and KBase, each program was individually chosen and used. Once assembled and annotated, parameters were compared for the different assemblies and annotations.
Results: Nine assemblies (5 on Galaxy, 1 on BV-BRC, 1 Proksee, and 2 on KBase) were completed. The SPAdes program on KBase was unable to be completed. Each of these 9 assemblies was then annotated using the annotation programs available on the platform : Prokka (Galaxy, KBase, Proksee), RAST (BV-BRC), DRAM (KBase) and Bakta (Proksee). The assembly length varied from 4.9 to 5.3 million base pairs (Mbp), with 7 assemblies being 5.2 Mbp. The longest contigs, 3.1 Mbp long, were obtained with Unicycler and Velvet on Galaxy, but the assembly with the fewest contigs was created with Unicycler on BV-BRC. The largest N50, which measures assembly contiguity, was 3.1 Mbp with Galaxy-based Unicycler. Results of the annotation programs on different platforms also varied. Six of the 10 assembly annotations found ~4700 coding sequences, while one (Prokka of the KBase Unicycler assembly) located 7100 coding sequences.
Conclusions: These comparisons supported the conclusion that one version of the Galaxy Unicycler program would be the most appropriate assembly program to use for submission to GenBank, while several of the annotation programs (DRAM on KBase, Bakta on Proksee and Prokka on Galaxy or Proksee) will yield comparative results.
Methods: SeqCenter performed Illumina sequencing of a clinically isolated strain of S. marcescens, 131 Watkins. Read files were imported into accounts at each of the platforms. Two platforms, Proksee and BV-BRC, use an automated pipeline of programs to assemble reads into genomes. In Galaxy and KBase, each program was individually chosen and used. Once assembled and annotated, parameters were compared for the different assemblies and annotations.
Results: Nine assemblies (5 on Galaxy, 1 on BV-BRC, 1 Proksee, and 2 on KBase) were completed. The SPAdes program on KBase was unable to be completed. Each of these 9 assemblies was then annotated using the annotation programs available on the platform : Prokka (Galaxy, KBase, Proksee), RAST (BV-BRC), DRAM (KBase) and Bakta (Proksee). The assembly length varied from 4.9 to 5.3 million base pairs (Mbp), with 7 assemblies being 5.2 Mbp. The longest contigs, 3.1 Mbp long, were obtained with Unicycler and Velvet on Galaxy, but the assembly with the fewest contigs was created with Unicycler on BV-BRC. The largest N50, which measures assembly contiguity, was 3.1 Mbp with Galaxy-based Unicycler. Results of the annotation programs on different platforms also varied. Six of the 10 assembly annotations found ~4700 coding sequences, while one (Prokka of the KBase Unicycler assembly) located 7100 coding sequences.
Conclusions: These comparisons supported the conclusion that one version of the Galaxy Unicycler program would be the most appropriate assembly program to use for submission to GenBank, while several of the annotation programs (DRAM on KBase, Bakta on Proksee and Prokka on Galaxy or Proksee) will yield comparative results.
Original language | American English |
---|---|
Pages | 52 |
State | Published - 16 Feb 2024 |
Event | Oklahoma State University Center for Health Sciences Research Week 2024 - Oklahoma State University Center for Health Sciences, Tulsa, United States Duration: 13 Feb 2024 → 17 Feb 2024 https://medicine.okstate.edu/research/research_days.html |
Conference
Conference | Oklahoma State University Center for Health Sciences Research Week 2024 |
---|---|
Country/Territory | United States |
City | Tulsa |
Period | 13/02/24 → 17/02/24 |
Internet address |
Keywords
- genome assembly and annotation
- Galaxy
- Proksee
- KBase
- BV-BRC