데이터셋 상세
미국
Judson Mansouri Automated Chemical Curation QSAREnvRes Data
Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community. This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).
데이터 정보
연관 데이터
Judson Mansouri Automated Chemical Curation QSAREnvRes Data
공공데이터포털
Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community. This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).
Designing QSARs for Parameters of High-Throughput Toxicokinetic Models Using Open-Source Descriptors
공공데이터포털
Additional details used in the methods are found in the MS Word file “S1_Dawson et al._Supporting_Information.docx”. The MS Excel file “S2_Dawson et al. Supporting Information.xlsx” contains datasets and graphical results. The Excel file sheets are as follows: S2.1 illustrates Clint hepatic flow calculations, S2.2 - 5 include training and test data sets; S2.6-7 include figures illustrating Clint model selection criteria and assemblages of model descriptors; S2.8 includes confusion matrices for evaluation Clint model, S2.9-10 include figures illustrating fup model selection criteria and assemblages of model descriptors (with ranges); S2.11 includes tables of model assessments of the Clint test set, S2.12 includes information relevant to BER calculations for the ToxCast test set, S2.13 includes information relevant to BER calculations for Tox21 chemicals, and S2.14 provides information on different transformations for fup. This dataset is associated with the following publication: Dawson, D., B. Ingle, K. Phillips, J. Nichols, J. Wambaugh, and R. Tornero-Velez. Designing QSARs for Parameters of High-Throughput Toxicokinetic Models Using Open-Source Descriptors. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 55(9): 6505, (6517).
Quantitative Structure-Use Relationship (QSUR) Model Descriptors
공공데이터포털
This data set contains ToxPrint finger prints for all chemicals in FUse that had QSAR-ready SMILES strings as well as select physicochemical properties from the Estimation Program Interface Suite (EPI Suite) program. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).
Grulke EPA’s DSSTox Chemical Structure Database
공공데이터포털
The US Environmental Protection Agency’s (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database, launched publicly in 2004, currently exceeds 875 K substances spanning hundreds of lists of interest to EPA and environmental researchers. From its inception, DSSTox has focused curation efforts on resolving chemical identifier errors and conflicts in the public domain towards the goal of assigning accurate chemical structures to data and lists of importance to the environmental research and regulatory community. In 2014, the legacy, manually curated DSSTox_V1 content was migrated to a MySQL data model, with modern cheminformatics tools supporting both manual and automated curation processes to increase efficiencies. Currently, DSSTox serves as the core foundation of EPA’s CompTox Chemicals Dashboard [https://comptox.epa.gov/dashboard], which provides public access to DSSTox content in support of a broad range of modeling and research activities within EPA and, increasingly, across the field of computational toxicology. This dataset is associated with the following publication: Grulke, C., A. Williams, I. Thillainadarajah, and A. Richard. EPA’s DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research. Computational Toxicology. Elsevier B.V., Amsterdam, NETHERLANDS, 12: 100096, (2019).
Designing QSARs for parameters of high throughput toxicokinetic models using open-source descriptors
공공데이터포털
The MS Excel file (Dawson et al S2 Supporting information.xlsx) contains multiple sheets containing the training sets, test sets, and predictions for intrinsic metabolic clearance (Clint), fraction unbound in plasma (fup), and bioactivity-exposure ratios (BER), for ToxCast and pharmaceutical-like chemicals. The Word file (Dawson et al S1 Supporting Information.docx) provides additional supporting information on assembly of the training and test sets for Clint, fup, and BER. The data dictionary describes the terms used in the supporting information, S1 and S2. This dataset is associated with the following publication: Dawson, D., B. Ingle, K. Phillips, J. Nichols, J. Wambaugh, and R. Tornero-Velez. Designing QSARs for Parameters of High-Throughput Toxicokinetic Models Using Open-Source Descriptors. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 55(9): 6505-6517, (2021).
Designing QSARs for parameters of high throughput toxicokinetic models using open-source descriptors
공공데이터포털
The MS Excel file (Dawson et al S2 Supporting information.xlsx) contains multiple sheets containing the training sets, test sets, and predictions for intrinsic metabolic clearance (Clint), fraction unbound in plasma (fup), and bioactivity-exposure ratios (BER), for ToxCast and pharmaceutical-like chemicals. The Word file (Dawson et al S1 Supporting Information.docx) provides additional supporting information on assembly of the training and test sets for Clint, fup, and BER. The data dictionary describes the terms used in the supporting information, S1 and S2. This dataset is associated with the following publication: Dawson, D., B. Ingle, K. Phillips, J. Nichols, J. Wambaugh, and R. Tornero-Velez. Designing QSARs for Parameters of High-Throughput Toxicokinetic Models Using Open-Source Descriptors. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 55(9): 6505-6517, (2021).
Metadata Dictionary Describing Data to Model a Chemical's Conditions of Use
공공데이터포털
This data dictionary describes relevant fields from secondary data sources that can assist with modeling the conditions of use for a chemical when performing a chemical assessment. Information on how to access the secondary data sources are included. This dataset is associated with the following publication: Chea, J.D., D.E. Meyer, R.L. Smith, S. Takkellapati, and G.J. Ruiz-Mercado. Exploring automated tracking of chemicals through their conditions of use to support life cycle chemical assessment. JOURNAL OF INDUSTRIAL ECOLOGY. Berkeley Electronic Press, Berkeley, CA, USA, 29(2): 413-616, (2025).
Quantitative structure activity relationships (QSARs) and machine learning models for abiotic reduction of organic compounds by an aqueous Fe(II) complex
공공데이터포털
Due to the increasing diversity of organic contaminants discharged into anoxic water environments, reactivity prediction is necessary for chemical persistence evaluation for water treatment and risk assessment purposes. Almost all quantitative structure activity relationships (QSARs) that describe rates of contaminant transformation apply only to narrowly-defined, relatively homogenous families of reactants (e.g., dechlorination of alkyl halides). In this work, we develop predictive models for abiotic reduction of 60 organic compounds with diverse reducible functional groups, including nitroaromatic compounds (NACs), aliphatic nitro-compounds (ANCs), aromatic N-oxides (ANOs), isoxazoles (ISXs), polyhalogenated alkanes (PHAs), sulfoxides and sulfones (SOs), and others. Rate constants for their reduction were measured using a model reductant system, Fe(II)-tiron. Qualitatively, the rates followed the order NACs > ANOs  ISXs  PHAs > ANCs > SOs. To develop QSARs, both conventional chemical descriptor-based and machine learning (ML)-based approaches were investigated. Conventional univariate QSARs based on a molecular descriptor ELUMO (energy of the lowest-unoccupied molecular orbital) gave good correlations within classes. Multivariate QSARs combining ELUMO with Abraham descriptors for physico-chemical properties gave slightly improved correlations within classes for NCs and NACs, but little improvement in correlation within other classes or among classes. The ML model obtained covers reduction rates for all classes of compounds and all of the conditions studied with the prediction accuracy similar to those of the conventional QSARs for individual classes (r2 = 0.41-0.98 for univariate QSARs, 0.71-0.94 for multivariate QSARs, and 0.83 for the ML model). Both approaches required a scheme for a priori classification of the compounds for model training. This work offers two alternative modelling approaches to comprehensive abiotic reactivity prediction for persistence evaluation of organic compounds in anoxic water environments. This dataset is associated with the following publication: Gao, Y., S. Zhong, T. Torralba-Sanchez, P. Tratnyek, E. Weber, Y. Chen, and H. Zhang. Quantitative structure activity relationships (QSARs) and machine learning models for abiotic reduction of organic compounds by an aqueous Fe(II) complex. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 192: 116843, (2021).
Functional Use Database (FUse)
공공데이터포털
There are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).
Datasets for manuscript "Predicting chemical end-of-life scenarios using structure-based classification models"
공공데이터포털
As described in the README.md file, the GitHub repository github.com/USEPA/PRTR-QSTR-models/tree/data-driven are Python scripts written to run Quantitative Structure–Transfer Relationship (QSTR) models based on chemical structure-based machine learning (ML) models for supporting environmental regulatory decision-making. Using features associated with annual chemical transfer amounts, chemical generator industry sectors, environmental policy stringency, gross value added by industry sectors, chemical descriptors, and chemical unit prices, as in the GitHub repository PRTR_transfers, the QSTR models developed here can predict potential EoL activities for chemicals transferred to off-site locations for EoL management. Also, this contribution shows that QSTR models aid in estimating the mass fraction allocation of chemicals of concern transferred off-site for EoL activities. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Predicting Chemical End-of-Life Scenarios Using Structure-Based Classification Models. ACS Sustainable Chemistry & Engineering. American Chemical Society, Washington, DC, USA, 11(9): 3594-3602, (2023).