데이터셋 상세
미국
New feature subset selection procedures for classification of expression profiles
Background Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier. Results We show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes. Conclusion When looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information.
데이터 정보
연관 데이터
Improved analytical methods for microarray-based genome-composition analysis
공공데이터포털
Genome-composition analysis using microarrays can be used to categorize genes into 'present' and 'divergent' categories. This involves selecting a signal value that is used as a cutoff to discriminate present and divergent genes, but this can result in the misclassification of many genes. A method is described that depends on the shape of the signal-ratio distribution and does not require empirical determination of a cutoff. Many genes previously classified as present using static methods are in fact divergent on the basis of microarray signal; this is corrected by our algorithm.
A simple method for statistical analysis of intensity differences in microarray-derived gene expression data
공공데이터포털
Background Microarray experiments offer a potent solution to the problem of making and comparing large numbers of gene expression measurements either in different cell types or in the same cell type under different conditions. Inferences about the biological relevance of observed changes in expression depend on the statistical significance of the changes. In lieu of many replicates with which to determine accurate intensity means and variances, reliable estimates of statistical significance remain problematic. Without such estimates, overly conservative choices for significance must be enforced. Results A simple statistical method for estimating variances from microarray control data which does not require multiple replicates is presented. Comparison of datasets from two commercial entities using this difference-averaging method demonstrates that the standard deviation of the signal scales at a level intermediate between the signal intensity and its square root. Application of the method to a dataset related to the β-catenin pathway yields a larger number of biologically reasonable genes whose expression is altered than the ratio method. Conclusions The difference-averaging method enables determination of variances as a function of signal intensities by averaging over the entire dataset. The method also provides a platform-independent view of important statistical properties of microarray data.
Permutation-validated principal components analysis of microarray data
공공데이터포털
Background In microarray data analysis, the comparison of gene-expression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of gene-selection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutation-validated principal components analysis (PCA). The approach is designed for microarray data with a group structure. Results We used PCA to detect the major sources of variance underlying the hybridization conditions followed by gene selection based on PCA-derived and permutation-based test statistics. We validated our method by applying it to well characterized yeast cell-cycle data and to two datasets from our laboratory. We could describe the major sources of variance, select informative genes and visualize the relationship of genes and arrays. We observed differences in the level of the explained variance and the interpretability of the selected genes. Conclusions Combining data visualization and permutation-based gene selection, permutation-validated PCA enables one to illustrate gene-expression variance between several conditions and to select genes by taking into account the relationship of between-group to within-group variance of genes. The method can be used to extract the leading sources of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as gene-specific scatter. Visualization of the data can support a straightforward biological interpretation.
인포보스 - 자생종 효소 유전자 계통확률 데이터
공공데이터포털
● 데이터 키워드 - 유전체, 유전자, NGS, DNA ● 데이터 상품 정보 - 본 상품은 자생종 유전체 분석을 통해 얻어진 유전자의 유전자군 계통확률 정보를 제공합니다. - 기능 도메인에 대해 기능별 유용성, 효소, 단백질, 병 저항성 유전자군 분류 가공 - 데이터 comparative analysis를 통해 유전자군별 발현 확률 및 계통 확률 계산 ● 컬럼 정보 - fasta format ● 활용 예제 - 본 데이터 상품을 활용하여 사용자는 다음과 같은 정보를 확인할 수 있습니다. 1) 신약 및 기능성 식품, 화장품 개발 관련 분야 기초자료 ● 데이터 및 기간 - 2019년 7월 ~ 2019년 12월 [원본 데이터](https://www.bigdata-forest.kr/product/GNM200801)는 로그인 후 구매하여 다운로드 하십시오.
NIST test dataset for assessing baseline nucleic acid sequence screening
공공데이터포털
This repository contains the dataset used in the manuscript "Inter-tool analysis of a NIST dataset for assessing baseline nucleic acid sequence screening". NIST constructed the test dataset based on the current screening recommendations from HHS. The dataset is a FASTA formatted file with blinded numerical sequence headers. The dataset was sent to sequence screening tool developers for initial testing and to obtain feedback about its utility for assessing baseline sequence screening. An additional metadata file provides the NIST-assigned label for each sequence, along with a more detailed description derived from the source database.
NIST test dataset for assessing baseline nucleic acid sequence screening
공공데이터포털
This repository contains the dataset used in the manuscript "Inter-tool analysis of a NIST dataset for assessing baseline nucleic acid sequence screening". NIST constructed the test dataset based on the current screening recommendations from HHS. The dataset is a FASTA formatted file with blinded numerical sequence headers. The dataset was sent to sequence screening tool developers for initial testing and to obtain feedback about its utility for assessing baseline sequence screening. An additional metadata file provides the NIST-assigned label for each sequence, along with a more detailed description derived from the source database.
Cluster-Rasch models for microarray gene expression data
공공데이터포털
Background We propose two different formulations of the Rasch statistical models to the problem of relating gene expression profiles to the phenotypes. One formulation allows us to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes; this model can also be used for future prediction. The other formulation provides an alternative way of identifying genes that are over- or underexpressed from their expression levels in tissue or cell samples of a given tissue or cell type. Results We illustrate the methods on available datasets of a classification of acute leukemias and of 60 cancer cell lines. For tumor classification, the results are comparable to those previously obtained. For the cancer cell lines dataset, we found four clusters of genes that are related to drug response for many of the 90 drugs that we considered. In addition, for each type of cell line, we identified genes that are over- or underexpressed relative to other genes. Conclusions The cluster-Rasch model provides a probabilistic model for describing gene expression patterns across samples and can be used to relate gene expression profiles to phenotypes.
인포보스 - 자생종 단백질 유전자 발현확률 데이터
공공데이터포털
● 데이터 키워드 - 유전체, 유전자, NGS, DNA ● 데이터 상품 정보 - 본 상품은 자생종 유전체 분석을 통해 얻어진 유전자의 유전자군 발현확률 정보를 제공합니다. - 기능 도메인에 대해 기능별 유용성, 효소, 단백질, 병 저항성 유전자군 분류 가공 - 데이터 comparative analysis를 통해 유전자군별 발현 확률 및 계통 확률 계산 ● 컬럼 정보 - fasta format ● 활용 예제 - 본 데이터 상품을 활용하여 사용자는 다음과 같은 정보를 확인할 수 있습니다. 1) 신약 및 기능성 식품, 화장품 개발 관련 분야 기초자료 ● 기간 및 범위 - 2019년 7월 ~ 2019년 12월 [원본 데이터](https://www.bigdata-forest.kr/product/GNM201201)는 로그인 후 구매하여 다운로드 하십시오.
인포보스 - 자생종 병 저항성 유전자 발현확률 데이터
공공데이터포털
● 데이터 키워드 - 유전체, NGS, DNA ● 데이터 상품 정보 - 본 상품은 자생종 유전체 분석을 통해 얻어진 유전자의 유전자군 발현확률 정보를 제공합니다. - 기능 도메인에 대해 기능별 유용성, 효소, 단백질, 병 저항성 유전자군 분류 가공 - 데이터 comparative analysis를 통해 유전자군별 발현 확률 및 계통 확률 계산 ● 컬럼 정보 - fasta format ● 활용 예제 - 본 데이터 상품을 활용하여 사용자는 다음과 같은 정보를 확인할 수 있습니다. 1) 신약 및 기능성 식품, 화장품 개발 관련 분야 기초자료 ● 기간 및 범위 - 2019년 7월 ~ 2019년 12월 [원본 데이터](https://www.bigdata-forest.kr/product/GNM201501)는 로그인 후 구매하여 다운로드 하십시오.
인포보스 - 자생종 유용성 유전자 발현확률 데이터
공공데이터포털
● 데이터 키워드 - 유전체, 유전자, NGS, DNA ● 데이터 상품 정보 - 본 상품은 자생종 유전체 분석을 통해 얻어진 유전자의 유전자군 발현확률 정보를 제공합니다. - 기능 도메인에 대해 기능별 유용성, 효소, 단백질, 병 저항성 유전자군 분류 가공 - 데이터 comparative analysis를 통해 유전자군별 발현 확률 및 계통 확률 계산 ● 컬럼 정보 - fasta format ● 활용 예제 - 본 데이터 상품을 활용하여 사용자는 다음과 같은 정보를 확인할 수 있습니다. 1) 신약 및 기능성 식품, 화장품 개발 관련 분야 기초자료 ● 기간 및 범위 - 2019년 7월 ~ 2019년 12월 [원본 데이터](https://www.bigdata-forest.kr/product/GNM200601)는 로그인 후 구매하여 다운로드 하십시오.