데이터셋 상세
미국
Vector algebra in the analysis of genome-wide expression data
Background Data from thousands of transcription-profiling experiments in organisms ranging from yeast to humans are now publicly available. How best to analyze these data remains an important challenge. A variety of tools have been used for this purpose, including hierarchical clustering, self-organizing maps and principal components analysis. In particular, concepts from vector algebra have proven useful in the study of genome-wide expression data. Results Here we present a framework based on vector algebra for the analysis of transcription profiles that is geometrically intuitive and computationally efficient. Concepts in vector algebra such as angles, magnitudes, subspaces, singular value decomposition, bases and projections have natural and powerful interpretations in the analysis of microarray data. Angles in particular offer a rigorous method of defining 'similarity' and are useful in evaluating the claims of a microarray-based study. We present a sample analysis of cells treated with rapamycin, an immunosuppressant whose effects have been extensively studied with microarrays. In addition, the algebraic concept of a basis for a space affords the opportunity to simplify data analysis and uncover a limited number of expression vectors to span the transcriptional range of cell behavior. Conclusions This framework represents a compact, powerful and scalable construction for analysis and computation. As the amount of microarray data in the public domain grows, these vector-based methods are relevant in determining statistical significance. These approaches are also well suited to extract biologically meaningful information in the analysis of signaling networks.
데이터 정보
연관 데이터
Improved analytical methods for microarray-based genome-composition analysis
공공데이터포털
Genome-composition analysis using microarrays can be used to categorize genes into 'present' and 'divergent' categories. This involves selecting a signal value that is used as a cutoff to discriminate present and divergent genes, but this can result in the misclassification of many genes. A method is described that depends on the shape of the signal-ratio distribution and does not require empirical determination of a cutoff. Many genes previously classified as present using static methods are in fact divergent on the basis of microarray signal; this is corrected by our algorithm.
Expression profiling of
공공데이터포털
A combination of linear RNA amplification and DNA microarray hybridization has allowed the determination of expression profiles of individual imaginal discs and larval tissues and the identification of genes expressed in tissue-specific patterns.
DNA loops and semicatenated DNA junctions
공공데이터포털
Background Alternative DNA conformations are of particular interest as potential signals to mark important sites on the genome. The structural variability of CA microsatellites is particularly pronounced; these are repetitive poly(CA) · poly(TG) DNA sequences spread in all eukaryotic genomes as tracts of up to 60 base pairs long. Many in vitro studies have shown that the structure of poly(CA) · poly(TG) can vary markedly from the classical right handed DNA double helix and adopt diverse alternative conformations. Here we have studied the mechanism of formation and the structure of an alternative DNA structure, named Form X, which was observed previously by polyacrylamide gel electrophoresis of DNA fragments containing a tract of the CA microsatellite poly(CA) · poly(TG) but had not yet been characterized. Results Formation of Form X was found to occur upon reassociation of the strands of a DNA fragment containing a tract of poly(CA) · poly(TG), in a process strongly stimulated by the nuclear proteins HMG1 and HMG2. By inserting Form X into DNA minicircles, we show that the DNA strands do not run fully side by side but instead form a DNA knot. When present in a closed DNA molecule, Form X becomes resistant to heating to 100°C and to alkaline pH. Conclusions Our data strongly support a model of Form X consisting in a DNA loop at the base of which the two DNA duplexes cross, with one of the strands of one duplex passing between the strands of the other duplex, and reciprocally, to form a semicatenated DNA junction also called a DNA hemicatenane.
Full-length messenger RNA sequences greatly improve genome annotation
공공데이터포털
Background Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. Results Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. Conclusions Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.
Within the fold: assessing differential expression measures and reproducibility in microarray assays
공공데이터포털
Fold-change' cutoffs have been widely used in microarray assays to identify genes that are differentially expressed. More accurate measures are required to identify high-confidence sets of genes with biologically meaningful changes in transcription. A general procedure for analyzing cDNA microarray data is proposed and validated. It is shown that pooled reference samples should be based not only on the expression of individual genes in each cell line but also on the expression levels of genes within cell lines.
A simple method for statistical analysis of intensity differences in microarray-derived gene expression data
공공데이터포털
Background Microarray experiments offer a potent solution to the problem of making and comparing large numbers of gene expression measurements either in different cell types or in the same cell type under different conditions. Inferences about the biological relevance of observed changes in expression depend on the statistical significance of the changes. In lieu of many replicates with which to determine accurate intensity means and variances, reliable estimates of statistical significance remain problematic. Without such estimates, overly conservative choices for significance must be enforced. Results A simple statistical method for estimating variances from microarray control data which does not require multiple replicates is presented. Comparison of datasets from two commercial entities using this difference-averaging method demonstrates that the standard deviation of the signal scales at a level intermediate between the signal intensity and its square root. Application of the method to a dataset related to the β-catenin pathway yields a larger number of biologically reasonable genes whose expression is altered than the ratio method. Conclusions The difference-averaging method enables determination of variances as a function of signal intensities by averaging over the entire dataset. The method also provides a platform-independent view of important statistical properties of microarray data.
Mining microarray expression data by literature profiling
공공데이터포털
The lack of efficient techniques for assessing the biological implications of microarray gene-expression data remains an important obstacle in exploiting this information. To address this need, a mining technique has been developed based on the analysis of literature profiles generated by extracting the frequencies of certain terms from thousands of abstracts stored in the Medline literature database.
산림청 국립산림과학원 버섯게놈 전사체
공공데이터포털
버섯 유전자 전사체 분석 및 서열 배열 정보(유전체 염기서열 분석 및 각 염기서열의 기능 주석 작업을 통하여 주요 분자의 유전마커 확립을 위한 정보 제공)
FOUNTAIN: A JAVA open-source package to assist large sequencing projects
공공데이터포털
Background Better automation, lower cost per reaction and a heightened interest in comparative genomics has led to a dramatic increase in DNA sequencing activities. Although the large sequencing projects of specialized centers are supported by in-house bioinformatics groups, many smaller laboratories face difficulties managing the appropriate processing and storage of their sequencing output. The challenges include documentation of clones, templates and sequencing reactions, and the storage, annotation and analysis of the large number of generated sequences. Results We describe here a new program, named FOUNTAIN, for the management of large sequencing projects . FOUNTAIN uses the JAVA computer language and data storage in a relational database. Starting with a collection of sequencing objects (clones), the program generates and stores information related to the different stages of the sequencing project using a web browser interface for user input. The generated sequences are subsequently imported and annotated based on BLAST searches against the public databases. In addition, simple algorithms to cluster sequences and determine putative polymorphic positions are implemented. Conclusions A simple, but flexible and scalable software package is presented to facilitate data generation and storage for large sequencing projects. Open source and largely platform and database independent, we wish FOUNTAIN to be improved and extended in a community effort.