데이터셋 상세
미국
Theoretically Optimal Distributed Anomaly Detection
A novel general framework for distributed anomaly detection with theoretical performance guarantees is proposed. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. Under a Gaussian assumption, our distributed algorithm is guaranteed to perform as well as its centralized counterpart, a condition we call Ôzero information lossÕ. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach.
데이터 정보
연관 데이터
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES
공공데이터포털
DISTRIBUTED ANOMALY DETECTION USING SATELLITE DATA FROM MULTIPLE MODALITIES KANISHKA BHADURI*, KAMALIKA DAS**, AND PETR VOTAVA*** Abstract. There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets ate physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
Anomaly Detection for Complex Systems
공공데이터포털
In performance maintenance in large, complex systems, sensor information from sub-components tends to be readily available, and can be used to make predictions about the system's health and diagnose possible anomalies. However, existing methods can only use predictions of individual component anomalies to guess at systemic problems, not accurately estimate the magnitude of the problem, nor prescribe good solutions. Since physical complex systems usually have well-defined semantics of operation, we here propose using anomaly detection techniques drawn from data mining in conjunction with an automated theorem prover working on a domain-specific knowledge base to perform systemic anomalydetection on complex systems. For clarity of presentation, the remaining content of this submission is presented compactly in Fig 1.
An Efficient Local Algorithm for Distributed Multivariate Regression
공공데이터포털
This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer environments. The algorithm is designed for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the “quality ” of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.
Solving a prisoner's dilemma in distributed anomaly detection
공공데이터포털
Anomaly detection has recently become an important problem in many industrial and financial applications. In several instances, the data to be analyzed for possible anomalies is located at multiple sites and cannot be merged due to practical constraints such as bandwidth limitations and proprietary concerns. At the same time, the size of data sets affects prediction quality in almost all data mining applications. In such circumstances, distributed data mining algorithms may be used to extract information from multiple data sites in order to make better predictions. In the absence of theoretical guarantees, however, the degree to which data decentralization affects the performance of these algorithms is not known, which reduces the data providing participants' incentive to cooperate.This creates a metaphorical 'prisoners' dilemma' in the context of data mining. In this work, we propose a novel general framework for distributed anomaly detection with theoretical performance guarantees. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. We show that the performance of such a distributed approach is indistinguishable from that of a centralized instantiation of the same anomaly detection algorithm, a condition that we call zero information loss. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach. The remaining content of this presentation is presented in Fig. 1.
Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data
공공데이터포털
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
A Scalable Local Algorithm for Distributed Multivariate Regression
공공데이터포털
This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer environments. The algorithm can be used for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the quality of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.
Comparative Analysis of Data-Driven Anomaly Detection Methods
공공데이터포털
This paper provides a review of three different advanced machine learning algorithms for anomaly detection in continuous data streams from a ground-test firing of a subscale Solid Rocket Motor (SRM). This study compares Orca, one-class support vector machines, and the Inductive Monitoring System (IMS) for anomaly detection on the data streams. We measure the performance of the algorithm with respect to the detection horizon for situations where fault information is available. These algorithms have been also studied by the present authors (and other co-authors) as applied to liquid propulsion systems. The trade space will be explored between these algorithms for both types of propulsion systems.
Improving Distributed Diagnosis Through Structural Model Decomposition
공공데이터포털
Complex engineering systems require efficient fault diagnosis methodologies, but centralized ap- proaches do not scale well, and this motivates the development of distributed solutions. This work presents an event-based approach for distributed diagnosis of abrupt parametric faults in continuous systems, by using the structural model decompo- sition capabilities provided by Possible Conflicts. We develop a distributed diagnosis algorithm that uses residuals, computed by extending Possible Conflicts, to build local event-based diagnosers based on global diagnosability analysis that gen- erate globally correct local diagnosis results. The proposed approach is applied to a multi-tank sys- tem, and results demonstrate an improvement in the design of local diagnosers. Since local diag- nosers use only a subset of the residuals, and use subsystem models to compute residuals (instead of the global system model), the local diagnosers are more efficient than previously developed dis- tributed approaches.
Detection and Prognostics on Low Dimensional Systems
공공데이터포털
This paper describes the application of known and novel prognostic algorithms on systems that can be described by low dimensional, potentially nonlinear dynamics. The methods rely on estimating the conditional probability distribution of the output of the system at a future time given knowledge of the current state of the system. We show how to estimate these conditional probabilities using a variety of techniques, including bagged neural networks and kernel methods such as Gaussian Process Regression (GPR). The results are compared with standard method such as the nearest neighbor algorithm. We demonstrate the algorithms on a real-world data set and a simulated data set. The real-world data set consists of the intensity of an NH3 laser. The laser data set has been shown by other authors to exhibit low-dimensional chaos with sudden drops in intensity. The simulated data set is generated from the Lorenz attractor and has known statistical characteristics. On these data sets, we show the evolution of the estimated conditional probability distribution, the way it can act as a prognostic signal, and its use as an early warning system. We also review a novel approach to perform Gaussian Process Regression with large numbers of data points.
Distributed Diagnosis in Uncertain Environments Using Dynamic Bayesian Networks
공공데이터포털
This paper presents a distributed Bayesian fault diagnosis scheme for physical systems. Our diagnoser design is based on a procedure for factoring the global system bond graph (BG) into a set of structurally observable bond graph fac- tors (BG-Fs). Each BG-F is systematically translated into a corresponding DBN Factor (DBN-F), which is then used in its corresponding local diagnoser for quantitative fault detec- tion, isolation, and identification. By construction, the ran- dom variables in each DBN-F are conditionally independent of the random variables in all other DBN-Fs, given a subset of communicated measurements considered as system inputs. Each DBN-F and BG-F pair is used to derive a local diag- noser that generates globally correct diagnosis results by lo- cal analysis. Together, the local diagnosers diagnose all single faults of interest in the system. We demonstrate on an electri- cal system how our distributed diagnosis scheme is compu- tationally more efficient than its centralized counterpart, but without compromising the accuracy of the diagnosis results.