데이터셋 상세
미국
Data and Model Archive for Preliminary Machine Learning Models of Manganese and 1,4-Dioxane in Groundwater on Long Island, New York
Data and preliminary machine-learning models used to predict manganese and 1,4-dioxane in groundwater on Long Island are documented in this data release. Concentration data used to develop the models were from 910 wells for manganese and 553 wells for 1,4-dioxane, primarily public supply wells, from U.S. Geological Survey, U.S. Environmental Protection Agency (USEPA), and Suffolk County Water Authority sources. Thirty-two explanatory variables describe depth, groundwater flow, land use, soil properties, and other features of the aquifer system. The models use XGBoost, an ensemble tree machine learning method. Four models are documented for manganese, predicting the probability of concentrations relative to four thresholds: 10 micrograms per liter (detection), 50 micrograms per liter (the USEPA Secondary Maximum Contaminant Level), 150 micrograms per liter, and 300 micrograms per liter (the USEPA lifetime health advisory). One model is documented for 1,4-dioxane, predicting the probability of concentrations relative to 0.07 micrograms per liter (detection). The models were used to predict concentrations in two layers of the upper glacial aquifer and three layers of the Magothy aquifer. Predictions were made at a 500-square-foot resolution across the entire island for manganese and across Suffolk County, which occupies the eastern two-thirds of Long Island, for 1,4-dioxane. The data are provided in data tables, raster files, and model files. One data table describes the 32 explanatory variables (LI_mn_14dx_exp_vars.txt). One data table describes the well data and includes the manganese and 1,4-dioxane concentrations, explanatory variables, and predictions for the wells (LI_mn_14dx_well_data.txt). There is a compressed group (zip file) of five files providing the explanatory variable data used to make predictions for the five aquifer layers (LI_mn_14dx_predinput_griddata.zip) and a zip file of 25 files providing model predictions for each model and aquifer layer (LI_mn_14dx_predoutput_rasters.zip). The data release also contains a tif-format raster file of the prediction grid (LI_mn_14dx_prediction_grid.tif). The models are documented in a zip file (LI_mn_14dx_models.zip) that contains the model object files (R data format) and scripts that can be used to run the models to produce the predictions provided in this data release. Filenames for prediction input and for model output are distinguished by names and numbers as follows: 1_upper_glacial, top layer of the upper glacial aquifer; 3_upper_glacial, bottom layer of the upper glacial aquifer; 5_Magothy, top layer of the Magothy aquifer; 14_Magothy, middle layer of the Magothy aquifer; and 23_Magothy, bottom layer of the Magothy aquifer.
데이터 정보
연관 데이터
Data and Model Archive for Preliminary Machine Learning Models of Manganese and 1,4-Dioxane in Groundwater on Long Island, New York
공공데이터포털
Data and preliminary machine-learning models used to predict manganese and 1,4-dioxane in groundwater on Long Island are documented in this data release. Concentration data used to develop the models were from 910 wells for manganese and 553 wells for 1,4-dioxane, primarily public supply wells, from U.S. Geological Survey, U.S. Environmental Protection Agency (USEPA), and Suffolk County Water Authority sources. Thirty-two explanatory variables describe depth, groundwater flow, land use, soil properties, and other features of the aquifer system. The models use XGBoost, an ensemble tree machine learning method. Four models are documented for manganese, predicting the probability of concentrations relative to four thresholds: 10 micrograms per liter (detection), 50 micrograms per liter (the USEPA Secondary Maximum Contaminant Level), 150 micrograms per liter, and 300 micrograms per liter (the USEPA lifetime health advisory). One model is documented for 1,4-dioxane, predicting the probability of concentrations relative to 0.07 micrograms per liter (detection). The models were used to predict concentrations in two layers of the upper glacial aquifer and three layers of the Magothy aquifer. Predictions were made at a 500-square-foot resolution across the entire island for manganese and across Suffolk County, which occupies the eastern two-thirds of Long Island, for 1,4-dioxane. The data are provided in data tables, raster files, and model files. One data table describes the 32 explanatory variables (LI_mn_14dx_exp_vars.txt). One data table describes the well data and includes the manganese and 1,4-dioxane concentrations, explanatory variables, and predictions for the wells (LI_mn_14dx_well_data.txt). There is a compressed group (zip file) of five files providing the explanatory variable data used to make predictions for the five aquifer layers (LI_mn_14dx_predinput_griddata.zip) and a zip file of 25 files providing model predictions for each model and aquifer layer (LI_mn_14dx_predoutput_rasters.zip). The data release also contains a tif-format raster file of the prediction grid (LI_mn_14dx_prediction_grid.tif). The models are documented in a zip file (LI_mn_14dx_models.zip) that contains the model object files (R data format) and scripts that can be used to run the models to produce the predictions provided in this data release. Filenames for prediction input and for model output are distinguished by names and numbers as follows: 1_upper_glacial, top layer of the upper glacial aquifer; 3_upper_glacial, bottom layer of the upper glacial aquifer; 5_Magothy, top layer of the Magothy aquifer; 14_Magothy, middle layer of the Magothy aquifer; and 23_Magothy, bottom layer of the Magothy aquifer.
Data used to model and map manganese in the Northern Atlantic Coastal Plain aquifer system, eastern USA
공공데이터포털
Data used to model and map manganese concentrations in groundwater in the Northern Atlantic Coastal Plain (NACP) aquifer system, eastern USA, are documented in this data release. The model predicts manganese concentration within four classes and is based on concentration data from 4492 wells. The well data were compiled from U.S. Geological Survey, U.S. Environmental Protection Agency, Suffolk County Water Authority (Suffolk County, New York), and state agency sources. The four concentration classes are based on guidelines for drinking water quality: below detection (class 1, less than 10 micrograms per liter (ug/L)); detected but less than the aesthetic guideline of 50 ug/L (class 2); greater than the aesthetic guideline but less than the health guideline of 300 ug/L (class 3); and greater than the health guideline of 300 ug/L (class 4). The thresholds of 50 ug/L and 300 ug/L are a Secondary Maximum Contaminant Level and a lifetime health advisory, respectively, from the U.S. Environmental Protection Agency for public water supplies. The model is built with the XGboost machine learning method. Explanatory variables (predictors) include well depth, soil characteristics, hydrologic variables, groundwater residence time, and predicted values of pH and of the probability of low dissolved oxygen from previous machine learning models of the aquifer system. The data are provided in data tables, raster files, and model files, organized as follows. One data table describes the 27 explanatory variables used in the model (NACP_Mn_explanatory_variables.csv). There is a data table for the well data used to develop the models, which includes the manganese concentrations, concentration classes, regional aquifer, explanatory variables, and predicted concentration class for the wells (NACP_Mn_well_data.csv). There is a compressed group (zip file) of 10 files (one for each regional aquifer) for explanatory variable data used to make predictions for the regional aquifers (NACP_Mn_prediction_input_aquifers.zip). There are two zip files providing model output, one for predictions made for each aquifer in text format and one for tif-format rasters of predictions for each aquifer. The data release also contains a tif-format raster file of the prediction grid and a zip file with the model object file (R data format) and a script that can be used to run the model to produce the predictions provided in this data release. Filenames for prediction input and for model output are distinguished by codes abbreviating the aquifer name and position in the vertical stack of 19 regional aquifers and confining units, as follows: Surficial aquifer, 1surf; Upper Chesapeake aquifer, 3upch; Lower Chesapeake aquifer, 5loch; Piney Point aquifer, 7pipt; Aquia aquifer, 9aqia; Monmouth - Mt. Laurel Aquifer, 11moml; Matawan aquifer, 13mtwn; Magothy Aquifer, 15mgty; Potomac-Patapsco aquifer, 17popt; Potomac-Patuxent aquifer, 19popx. The nine confining units are not represented in the model or predictions.
Data used to model and map manganese in the Northern Atlantic Coastal Plain aquifer system, eastern USA
공공데이터포털
Data used to model and map manganese concentrations in groundwater in the Northern Atlantic Coastal Plain (NACP) aquifer system, eastern USA, are documented in this data release. The model predicts manganese concentration within four classes and is based on concentration data from 4492 wells. The well data were compiled from U.S. Geological Survey, U.S. Environmental Protection Agency, Suffolk County Water Authority (Suffolk County, New York), and state agency sources. The four concentration classes are based on guidelines for drinking water quality: below detection (class 1, less than 10 micrograms per liter (ug/L)); detected but less than the aesthetic guideline of 50 ug/L (class 2); greater than the aesthetic guideline but less than the health guideline of 300 ug/L (class 3); and greater than the health guideline of 300 ug/L (class 4). The thresholds of 50 ug/L and 300 ug/L are a Secondary Maximum Contaminant Level and a lifetime health advisory, respectively, from the U.S. Environmental Protection Agency for public water supplies. The model is built with the XGboost machine learning method. Explanatory variables (predictors) include well depth, soil characteristics, hydrologic variables, groundwater residence time, and predicted values of pH and of the probability of low dissolved oxygen from previous machine learning models of the aquifer system. The data are provided in data tables, raster files, and model files, organized as follows. One data table describes the 27 explanatory variables used in the model (NACP_Mn_explanatory_variables.csv). There is a data table for the well data used to develop the models, which includes the manganese concentrations, concentration classes, regional aquifer, explanatory variables, and predicted concentration class for the wells (NACP_Mn_well_data.csv). There is a compressed group (zip file) of 10 files (one for each regional aquifer) for explanatory variable data used to make predictions for the regional aquifers (NACP_Mn_prediction_input_aquifers.zip). There are two zip files providing model output, one for predictions made for each aquifer in text format and one for tif-format rasters of predictions for each aquifer. The data release also contains a tif-format raster file of the prediction grid and a zip file with the model object file (R data format) and a script that can be used to run the model to produce the predictions provided in this data release. Filenames for prediction input and for model output are distinguished by codes abbreviating the aquifer name and position in the vertical stack of 19 regional aquifers and confining units, as follows: Surficial aquifer, 1surf; Upper Chesapeake aquifer, 3upch; Lower Chesapeake aquifer, 5loch; Piney Point aquifer, 7pipt; Aquia aquifer, 9aqia; Monmouth - Mt. Laurel Aquifer, 11moml; Matawan aquifer, 13mtwn; Magothy Aquifer, 15mgty; Potomac-Patapsco aquifer, 17popt; Potomac-Patuxent aquifer, 19popx. The nine confining units are not represented in the model or predictions.
Machine-learning model predictions and rasters of arsenic and manganese in groundwater in the Mississippi River Valley alluvial aquifer
공공데이터포털
Groundwater from the Mississippi River Valley alluvial aquifer (MRVA) is a vital resource for agriculture and drinking-water supplies in the central United States. Water availability can be limited in some areas of the aquifer by high concentrations of trace elements, including manganese and arsenic. Boosted regression trees, a type of ensemble-tree machine-learning method, were used to predict manganese concentration and the probability of arsenic concentration exceeding a 10 µg/L threshold throughout the MRVA. Explanatory variables for the BRT models included attributes associated with well location and construction, surficial variables (such as hydrologic position and recharge), variables extracted from a MODFLOW-2005 groundwater-flow model for the Mississippi embayment, and variables from an airborne electromagnetic survey of the aquifer. This data release provides the R scripts to tune and reproduce the BRT models and final prediction rasters. For a full description of modeling workflow and final model selection see the companion journal article.
Data for Elevated Manganese Concentrations in United States Groundwater, Role of Land Surface-Soil-Aquifer Connections
공공데이터포털
Chemical data from 43,334 wells were used to examine the role of land surface-soil-aquifer connections in producing elevated manganese concentrations (>300 µg/L) in United States (U.S.) groundwater. Elevated manganese and dissolved organic carbon (DOC) concentrations were associated with shallow water tables and organic-carbon rich soils, suggesting soil-derived DOC supported manganese reduction. Manganese and DOC concentrations were higher near rivers than farther from rivers, suggesting river-derived DOC also supported manganese reduction. Anthropogenic nitrogen may also affect manganese concentrations in groundwater. In parts of the northeastern U.S. containing poorly buffered soils, ~40% of the samples with elevated manganese concentrations had pH values <6 and elevated concentrations of dissolved oxygen and nitrate relative to samples with pH ≥6, suggesting acidic recharge produced by the oxidation of ammonium in fertilizer helped mobilize manganese. An estimated 2.6 million people potentially consume groundwater with elevated manganese concentrations, the highest densities of which occur near rivers and in areas with organic-carbon rich soil. Results from this study indicate land surface-soil-aquifer connections play an important role in producing elevated manganese concentrations in groundwater used for human consumption.
Data for Elevated Manganese Concentrations in United States Groundwater, Role of Land Surface-Soil-Aquifer Connections
공공데이터포털
Chemical data from 43,334 wells were used to examine the role of land surface-soil-aquifer connections in producing elevated manganese concentrations (>300 µg/L) in United States (U.S.) groundwater. Elevated manganese and dissolved organic carbon (DOC) concentrations were associated with shallow water tables and organic-carbon rich soils, suggesting soil-derived DOC supported manganese reduction. Manganese and DOC concentrations were higher near rivers than farther from rivers, suggesting river-derived DOC also supported manganese reduction. Anthropogenic nitrogen may also affect manganese concentrations in groundwater. In parts of the northeastern U.S. containing poorly buffered soils, ~40% of the samples with elevated manganese concentrations had pH values <6 and elevated concentrations of dissolved oxygen and nitrate relative to samples with pH ≥6, suggesting acidic recharge produced by the oxidation of ammonium in fertilizer helped mobilize manganese. An estimated 2.6 million people potentially consume groundwater with elevated manganese concentrations, the highest densities of which occur near rivers and in areas with organic-carbon rich soil. Results from this study indicate land surface-soil-aquifer connections play an important role in producing elevated manganese concentrations in groundwater used for human consumption.
Deep learning classification of manganese and iron mines and prospects in the Lewisburg 30 x 60 minute quadrangle
공공데이터포털
Manganese is a designated critical mineral, being industrially utilized for producing steel and batteries, including in the production of electric vehicles (Rozelle and others, 2021). The central Appalachian Valley and Ridge hosts hundreds of manganese and iron oxide mines that served steel production until their abandonment in the mid-twentieth century (Lesure, 1957; Pegau, 1958). Many relict mines still feature accessible pits, waste rock, and unmined ore materials to varying degrees. Preliminary assessments of supergene manganese oxides in the Appalachian Mountains have revealed extensive enrichment in critical minerals and rare earth elements (REE) (Carmichael and others, 2017; Odom, 2020). The Appalachian Manganese Oxide Research Effort (AMORE) was established to: (1) characterize the locations and extents of Appalachian manganese oxide mines using artificial intelligence mapping applied to high-resolution lidar elevation models and (2) assess the geochemical nature of remnant manganese oxide ore in the context of critical minerals. Here, we present digital vector data of manganese and iron mines and prospects within the Lewisburg 30 x 60 minute quadrangle generated by a semantic segmentation deep learning artificial intelligence model. The Lewisburg 30 x 60 minute quadrangle hosts several large areas of historic iron and manganese mines on federal lands (Lesure, 1957). Previously published maps (Lesure, 1957) document mining at the excavation to large trench scale and lack the resolution to document prospect pits and small trenches smaller than ~30 square meters (~323 square feet), which are often characteristic of mine workings throughout the quadrangle. Mineralization within the quadrangle is documented in the following scenarios: (1) at the contact between the Devonian Oriskany Sandstone and limestones of the Silurian and Devonian Helderberg Group, (2) disseminated within the Silurian Rose Hill Formation, and (3) within the damage zones of faults. Model outputs identified the locations of probable prospect pits, trenches, and large excavations which were then evaluated based on geologic context and subsequently culled based on their co-occurrence with known anthropogenic and karst features. The resulting dataset contains 2,054 features ranging in size from ~16 square meters to ~183,224 square meters (~172-1,972,207 square feet), documents probable mining and prospecting features in much greater detail than in previously published resources (for example, Lesure, 1957), and is a valuable resource for future work documenting existing and abandoned mine lands on federal lands.
Delaware River Basin Stream Salinity Machine Learning Models and Data
공공데이터포털
This model archive contains the input data, model code, and model outputs for machine learning models that predict daily non-tidal stream salinity (specific conductance) for a network of 459 modeled stream segments across the Delaware River Basin (DRB) from 1984-09-30 to 2021-12-31. There are a total of twelve models from combinations of two machine learning models (Random Forest and Recurrent Graph Convolution Neural Networks), two training/testing partitions (spatial and temporal), and three input attribute sets (dynamic attributes, dynamic and static attributes, and dynamic attributes and a minimum set of static attributes). In addition to the inputs and outputs for non-tidal predictions provided on the landing page, we also provide example predictions for models trained with additional tidal stream segments within the model archive (TidalExample folder), but we do not recommend our models for this use case. Model outputs contained within the model archive include performance metrics, plots of spatial and temporal errors, and Shapley (SHAP) explainable artificial intelligence plots for the best models. The results of these models provide insights into DRB stream segments with elevated salinity, and processes that drive stream salinization across the DRB, which may be used to inform salinity management. This data compilation was funded by the USGS.
Delaware River Basin Stream Salinity Machine Learning Models and Data
공공데이터포털
This model archive contains the input data, model code, and model outputs for machine learning models that predict daily non-tidal stream salinity (specific conductance) for a network of 459 modeled stream segments across the Delaware River Basin (DRB) from 1984-09-30 to 2021-12-31. There are a total of twelve models from combinations of two machine learning models (Random Forest and Recurrent Graph Convolution Neural Networks), two training/testing partitions (spatial and temporal), and three input attribute sets (dynamic attributes, dynamic and static attributes, and dynamic attributes and a minimum set of static attributes). In addition to the inputs and outputs for non-tidal predictions provided on the landing page, we also provide example predictions for models trained with additional tidal stream segments within the model archive (TidalExample folder), but we do not recommend our models for this use case. Model outputs contained within the model archive include performance metrics, plots of spatial and temporal errors, and Shapley (SHAP) explainable artificial intelligence plots for the best models. The results of these models provide insights into DRB stream segments with elevated salinity, and processes that drive stream salinization across the DRB, which may be used to inform salinity management. This data compilation was funded by the USGS.
Probability distribution grids of dissolved oxygen and dissolved manganese concentrations at selected thresholds in drinking water depth zones, Central Valley, California
공공데이터포털
The ascii grids represent regional probabilities that groundwater in a particular location will have dissolved oxygen (DO) concentrations less than selected threshold values representing anoxic groundwater conditions or will have dissolved manganese (Mn) concentrations greater than selected threshold values representing secondary drinking water-quality contaminant levels (SMCL) and health-based screening levels (HBSL) for water quality. The probability models were constrained by the alluvial boundary of the Central Valley to a depth of approximately 300 meters (m). We utilized prediction modeling methods, specifically boosted regression trees (BRT) with a Bernoulli error distribution within a statistical learning framework within R's computing framework (http://www.r-project.org/) to produce two-dimensional probability grids at selected depths throughout the modeling domain. The statistical learning framework seeks to maximize the predictive performance of machine learning methods through model tuning by cross validation. Models were constructed using measured dissolved oxygen and manganese concentrations sampled from 2,767 wells within the alluvial boundary of the Central Valley and over 60 predictor variables from 7 sources (see metadata) and were assembled to develop a model that incorporates regional-scale soil properties, soil chemistry, land use, aquifer textures, and aquifer hydrology. Previously developed Central Valley model outputs of textures (Central Valley Textural Model, CVTM; Faunt and others, 2010) and MODFLOW-simulated vertical water fluxes and predicted depth to water table (Central Valley Hydrologic Model, CVHM; Faunt, 2009) were used to represent aquifer textures and groundwater hydraulics, respectively. The wells used in the BRT models described above were attributed to predictor variable values in ArcGIS using a 500-m buffer. The response variable data consisted of measured DO and Mn concentrations from 2,767 wells within the alluvial boundary of the Central Valley. The data were compiled from two sources: U.S. Geological Survey (USGS) National Water Information System (NWIS) database (all data are publicly available from the USGS at http://waterdata.usgs.gov/ca/nwis/nwis) and the California State Water Resources Control Board Division of Drinking Water (SWRCB-DDW) database (water-quality data are publicly available from the SWRCB at http://geotracker.waterboards.ca.gov/gama/). Only wells with well depth data were selected, and for wells with multiple records, only the most recent sample in the period 1993–2014 that had the required water-quality data was used. Data were available for 932 wells for the NWIS dataset and 1,835 wells for the SWRCB-DDW dataset. Models were trained on a USGS NWIS dataset of 932 wells and evaluated on an independent hold-out dataset of 1,835 wells from the SWRCB-DDW. We used cross-validation to assess the predictive performance of models of varying complexity as a basis for selecting the final models used to create the prediction grids. Trained models were applied to cross-validation testing data and a separate hold-out dataset to evaluate model predictive performance by emphasizing three model metrics of fit: Kappa, accuracy, and the area under the receiver operator characteristic (ROC) curve. The final trained models were used for mapping predictions at discrete depths to a depth of approximately 300 m. Trained DO and Mn models had accuracies of 86–100 percent, Kappa values of 0.69–0.99, and ROC values of 0.92–1.0. Model accuracies for cross-validation testing datasets were 82–95 percent, and ROC values were 0.87–0.91, indicating good predictive performance. Kappa values for the cross-validation testing dataset were 0.30–0.69, indicating fair to substantial agreement between testing observations and model predictions. Hold-out data were available for the manganese model only and indicated accuracies of 89–97 percent, ROC values of 0.73–0.75, and Kappa values of 0.06–0.30. The