데이터셋 상세
미국
Distributed Monitoring of the R2 Statistic for Linear Regression
The problem of monitoring a multivariate linear regression model is relevant in studying the evolving relationship between a set of input variables (features) and one or more dependent target variables. This problem becomes challenging for large scale data in a distributed computing environment when only a subset of instances is available at individual nodes and the local data changes frequently. Data centralization and periodic model recomputation can add high overhead to tasks like anomaly detection in such dynamic settings. Therefore, the goal is to develop techniques for monitoring and updating the model over the union of all nodes' data in a communication-efficient fashion. Correctness guarantees on such techniques are also often highly desirable, especially in safety-critical application scenarios. In this paper we develop DReMo --- a distributed algorithm with very low resource overhead, for monitoring the quality of a regression model in terms of its coefficient of determination (R2 statistic). When the nodes collectively determine that R2 has dropped below a fixed threshold, the linear regression model is recomputed via a network-wide convergecast and the updated model is broadcast back to all nodes. We show empirically, using both synthetic and real data, that our proposed method is highly communication-efficient and scalable, and also provide theoretical guarantees on correctness.
데이터 정보
연관 데이터
A Scalable Local Algorithm for Distributed Multivariate Regression
공공데이터포털
This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer environments. The algorithm can be used for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the quality of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.
An Efficient Local Algorithm for Distributed Multivariate Regression
공공데이터포털
This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer environments. The algorithm is designed for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the “quality ” of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.
Block-GP: Scalable Gaussian Process Regression for Multimodal Data
공공데이터포털
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. In many cases, regression algorithms such as linear regression or neural networks attempt to fit the target variable as a function of the input variables without regard to the underlying joint distribution of the variables. As a result, these global models are not sensitive to variations in the local structure of the input space. Several algorithms, including the mixture of experts model, classification and regression trees (CART), and others have been developed, motivated by the fact that a variability in the local distribution of inputs may be reflective of a significant change in the target variable. While these methods can handle the non-stationarity in the relationships to varying degrees, they are often not scalable and, therefore, not used in large scale data mining applications. In this paper we develop Block-GP, a Gaussian Process regression framework for multimodal data, that can be an order of magnitude more scalable than existing state-of-the-art nonlinear regression algorithms. The framework builds local Gaussian Processes on semantically meaningful partitions of the data and provides higher prediction accuracy than a single global model with very high confidence. The method relies on approximating the covariance matrix of the entire input space by smaller covariance matrices that can be modeled independently, and can therefore be parallelized for faster execution. Theoretical analysis and empirical studies on various synthetic and real data sets show high accuracy and scalability of Block-GP compared to existing nonlinear regression techniques.
Distributed Prognostic Health Management with Gaussian Process Regression
공공데이터포털
Distributed prognostics architecture design is an enabling step for efficient implementation of health management systems. A major challenge encountered in such design is formulation of optimal distributed prognostics algorithms. In this paper, we present a distributed GPR based prognostics algorithm whose target platform is a wireless sensor network. In addition to challenges encountered in a distributed implementation, a wireless network poses constraints on communication patterns, thereby making the problem more challenging. The prognostics application that was used to demonstrate our new algorithms is battery prognostics. In order to present trade-offs within different prognostic approaches, we present comparison with the distributed implementation of a particle filter based prognostics for the same battery data.
Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems
공공데이터포털
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
Making Predictions using Large Scale Gaussian Processes
공공데이터포털
One of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples [tex] ( X,y ) [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair [tex] ( X,y ) [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his [recent paper]( ) (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our [new paper](/dashlink/resources/51/) on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.
Solving a prisoner's dilemma in distributed anomaly detection
공공데이터포털
Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.
A Distributed Approach to System-Level Prognostics
공공데이터포털
Prognostics, which deals with predicting remaining useful life of components, subsystems, and systems, is a key tech- nology for systems health management that leads to improved safety and reliability with reduced costs. The prognostics problem is often approached from a component-centric view. However, in most cases, it is not specifically component life- times that are important, but, rather, the lifetimes of the sys- tems in which these components reside. The system-level prognostics problem can be quite difficult due to the increased scale and scope of the prognostics problem and the rela- tive lack of scalability and efficiency of typical prognostics approaches. In order to address these issues, we develop a distributed solution to the system-level prognostics prob- lem, based on the concept of structural model decomposi- tion. The system model is decomposed into independent submodels. Independent local prognostics subproblems are then formed based on these local submodels, resulting in a scalable, efficient, and flexible distributed approach to the system-level prognostics problem. We provide a formulation of the system-level prognostics problem and demonstrate the approach on a four-wheeled rover simulation testbed. The re- sults show that the system-level prognostics problem can be accurately and efficiently solved in a distributed fashion.
An Integrated Model-Based Distributed Diagnosis and Prognosis Framework
공공데이터포털
Diagnosis and prognosis are necessary tasks for system reconfiguration and fault-adaptive control in complex systems. Diagnosis consists of detec- tion, isolation and identification of faults, while prognosis consists of prediction of the remain- ing useful life of systems. This paper presents an integrated model-based distributed diagnosis and prognosis framework, where system decomposi- tion is used to perform the diagnosis and prog- nosis tasks in a distributed way. We show how different submodels can be automatically con- structed to solve the local diagnosis and prog- nosis problems. We illustrate our approach us- ing a simulated four-wheeled rover for different fault scenarios. Our experiments show that our approach correctly performs fault diagnosis and prognosis in a robust manner.
An Integrated Framework for Model-Based Distributed Diagnosis and Prognosis
공공데이터포털
Diagnosis and prognosis are necessary tasks for system re- configuration and fault-adaptive control in complex systems. Diagnosis consists of detection, isolation and identification of faults, while prognosis consists of prediction of the remain- ing useful life of systems. This paper presents a novel inte- grated framework for model-based distributed diagnosis and prognosis, where system decomposition is used to enable the diagnosis and prognosis tasks to be performed in a distributed way. We show how different submodels can be automati- cally constructed to solve the local diagnosis and prognosis problems. We illustrate our approach using a simulated four- wheeled rover for different fault scenarios. Our experiments show that our approach correctly performs distributed fault diagnosis and prognosis in an efficient and robust manner.