데이터셋 상세
미국
GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff
The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).
연관 데이터
Data from: Phased Genotyping-by-Sequencing Enhances Analysis of Genetic Diversity and Reveals Divergent Copy Number Variants in Maize
공공데이터포털
,High-throughput sequencing (HTS) of reduced representation genomic libraries has ushered in an era of genotyping-by-sequencing (GBS), where genome-wide genotype data can be obtained for nearly any species. However, there remains a need for imputation-free GBS methods for genotyping large samples taken from heterogeneous populations of heterozygous individuals. This requires that a number of issues encountered with GBS be considered, including the sequencing of nonoverlapping sets of loci across multiple GBS libraries, a common missing data problem that results in low call rates for markers per individual, and a tendency for applicability only in inbred line samples with sufficient linkage disequilibrium for accurate imputation. We addressed these issues while developing and validating a new, comprehensive platform for GBS. This study supports the notion that GBS can be tailored to particular aims, and using Zea mays our results indicate that large samples of unknown pedigree can be genotyped to obtain complete and accurate GBS data. Optimizing size selection to sequence a high proportion of shared loci among individuals in different libraries and using simple in silico filters, a GBS procedure was established that produces high call rates per marker (>85%) with accuracy exceeding 99.4%. Furthermore, by capitalizing on the sequence-read structure of GBS data (stacks of reads), a new tool for resolving local haplotypes and scoring phased genotypes was developed, a feature that is not available in many GBS pipelines. Using local haplotypes reduces the marker dimensionality of the genotype matrix while increasing the informativeness of the data. Phased GBS in maize also revealed the existence of reproducibly inaccurate (apparent accuracy) genotypes that were due to divergent copy number variants (CNVs) unobservable in the underlying single nucleotide polymorphism (SNP) data.,,
GenBank
공공데이터포털
NIH Genetic sequence database; an annotated collection of all publicly available DNA sequences.
Hyalella azteca Official Gene Set v1.0
공공데이터포털
,The Hyalella azteca genome was recently sequenced and annotated as part of the i5k pilot project by the Baylor College of Medicine. The Hyalella azteca research community has manually reviewed and curated the computational gene predictions and generated an official gene set, OGSv1.0. The OGS is an integration of automatic gene predictions from Maker with manual annotations by the research community (via the Apollo manual annotation software).,If you wish to use this dataset, please follow the Baylor College of Medicine's conditions for data use: https://www.hgsc.bcm.edu/bcm-hgsc-conditions-use,
Jana Sperschneider - Melampsora lini genome assembly and RNA-seq data
공공데이터포털
The Melampsora lini genome was sequenced to improve its genome assembly. PacBio HiFi and Hi-C data was generated as well as RNA-seq data.
NIST test dataset for assessing baseline nucleic acid sequence screening
공공데이터포털
This repository contains the dataset used in the manuscript "Inter-tool analysis of a NIST dataset for assessing baseline nucleic acid sequence screening". NIST constructed the test dataset based on the current screening recommendations from HHS. The dataset is a FASTA formatted file with blinded numerical sequence headers. The dataset was sent to sequence screening tool developers for initial testing and to obtain feedback about its utility for assessing baseline sequence screening. An additional metadata file provides the NIST-assigned label for each sequence, along with a more detailed description derived from the source database.
Data from: A High-Quality Genome Assembly from a Single, Field-collected Spotted Lanternfly (Lycorma delicatula) using the PacBio Sequel II System
공공데이터포털
,A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies, however, long-read methods have historically had greater input DNA requirements and higher costs than next generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female Spotted Lanternfly (Lycorma delicatula) using a single PacBio SMRT Cell. The Spotted Lanternfly is an invasive species recently discovered in the northeastern United States, threatening to damage economically important crop plants in the region. The DNA from one individual female specimen collected in Reading, Berks County, Pennsylvania was used to make one standard, size-selected library with an average DNA fragment size of ~20 kb. The library was run on one Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing approximately 38x coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Further, it was possible to segregate more than half of the diploid genome into the two separate haplotypes. The assembly also recovered two microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species.,Supporting files for the manuscript "A High-Quality Genome Assembly from a Single, Field-collected Spotted Lanternfly (Lycorma delicatula) using the PacBio Sequel II System", include several intermediate versions of the assembly (raw output from Falcon, raw output from Falcon unzip, etc.) as well as the final assembly primary contigs and haplotigs (for the regions of the genome that were phased).,,
HomoloGene
공공데이터포털
System for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes.
Trace Archive
공공데이터포털
A repository of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.
Analysis of a human brain transcriptome map
공공데이터포털
Background Genome wide transcriptome maps can provide tools to identify candidate genes that are over-expressed or silenced in certain disease tissue and increase our understanding of the structure and organization of the genome. Expressed Sequence Tags (ESTs) from the public dbEST and proprietary Incyte LifeSeq databases were used to derive a transcript map in conjunction with the working draft assembly of the human genome sequence. Results Examination of ESTs derived from brain tissues (excluding brain tumor tissues) suggests that these genes are distributed on chromosomes in a non-random fashion. Some regions on the genome are dense with brain-enriched genes while some regions lack brain-enriched genes, suggesting a significant correlation between distribution of genes along the chromosome and tissue type. ESTs from brain tumor tissues have also been mapped to the human genome working draft. We reveal that some regions enriched in brain genes show a significant decrease in gene expression in brain tumors, and, conversely that some regions lacking in brain genes show an increased level of gene expression in brain tumors. Conclusions This report demonstrates a novel approach for tissue specific transcriptome mapping using EST-based quantitative assessment.
Genetic Testing Registry (GTR)
공공데이터포털
Genetic Testing Registry (GTR) is a free, centralized voluntary registry of comprehensive genetic test information covering clinical and research tests for Mendelian disorders and drug responses including multigenic, array-based, biochemical, cytogenetic, and molecular tests.