Identification

Title

Using K-means clustering to detect anomalous file removes

Abstract

One of the purposes of a data archive is to preserve irreplaceable data for future studies and generations. There are a number of ways that data can be lost from an archive, including accidental or malicious deletion of data. While there is a lot of software that can check for specific known threats or problems on a system, detecting non-specific anomalous behavior, such as unusual file removal patterns, is harder. One approach to detecting this kind of problem is machine learning. Machine learning algorithms can build a statistical model of what constitutes normal behavior and then flag data points that are outliers. To help protect the 87 petabytes of data in the National Center for Atmospheric Research's data archive, we explored our file removal patterns and implemented a k-means clustering solution to detect anomalous file removes. This approach can also be used to detect other anomalies, such as operational inconsistencies.

Resource type

document

Resource locator

Unique resource identifier

code

https://n2t.org/ark:/85065/d7f47s15

codeSpace

Dataset language

eng

Spatial reference system

code identifying the spatial reference system

Classification of spatial data and services

Topic category

geoscientificInformation

Keywords

Keyword set

keyword value

Text

originating controlled vocabulary

title

Resource Type

reference date

date type

publication

effective date

2016-01-01T00:00:00Z

Geographic location

West bounding longitude

East bounding longitude

North bounding latitude

South bounding latitude

Temporal reference

Temporal extent

Begin position

End position

Dataset reference date

date type

publication

effective date

2018-09-01T00:00:00Z

Frequency of update

Quality and validity

Lineage

Conformity

Data format

name of format

version of format

Constraints related to access and use

Constraint set

Use constraints

Limitations on public access

None

Responsible organisations

Responsible party

contact position

OpenSky Support

organisation name

UCAR/NCAR - Library

full postal address

PO Box 3000

Boulder

80307-3000

email address

opensky@ucar.edu

web address

http://opensky.ucar.edu/

name: homepage

responsible party role

pointOfContact

Metadata on metadata

Metadata point of contact