Keywords

1 Introduction

For every organization, a data breach is a serious issue. Any incident in which data is seen, deleted, altered, or transferred by an unauthorized party or an authorized person unintentionally or intentionally is referred to as a data breach [1]. A data breach can be caused by a variety of factors, including hardware issues, software crashes, phishing, malware, ransomware, distributed denial-of-service, human error, misplaced or lost data storage devices (such as USB drives, laptops, portable drives, and so on), malicious insiders, and external issues such as power outages [2]. However, our research focuses on malicious insider threats.

Insider data breaches are growing more common and have a higher financial impact on organizations. A recent report states that insider threats are responsible for 60% of data breaches [3]. An insider is typically someone who has allowed access to company resources and intentionally or unintentionally damages the company. Current or former employees, contractors, partners, or employees who have access to an organization’s systems or data may pose a threat to them [4].

Since the General Data Protection Regulation (GDPR) went into effect in May 2018, there has been a paradigm shift in data privacy [5]. The GDPR specifies processes and rules to address the challenges of insider threat and data protection. As a result, when a data breach occurs, the Data Controller (DC) must notify the Data Protection Authority (DPA) and the affected Data Owner (DO). He will face significant fines if he fails to notify the breaches within a particular time frame. According to GDPR, organizations that experience a data breach may be fined up to 4% of their annual revenue, or €20 million, whichever is greater [6]. Such a system that can detect data breaches is required to avoid severe penalties.

On the other hand, in contrast to traditional internet technology, which simply provides a “network of information,” blockchain is a cutting-edge technology that provides a “network of value” [7]. Ethereum blockchain utilizes particular languages such as solidity [8] to become fully programmable, allowing the building of modern decentralized applications. These decentralized applications use smart contracts. Smart contracts are coding scripts that enable users to execute transactions without the possibility of fraud or third-party interference [9].

To address the issue of data breaches, this article developed a GDPR-compliant detection system that takes advantage of the semantic web, smart contracts, and blockchain technologies. The processes of the proposed system methodology and the functioning of smart contracts are detailed in the subsequent sections.

2 Related Work

Malicious insider threats have recently been identified as one of the organizations’ most harmful breach attacks. Data breaches are security incidents that occur when an attacker gains access to a company’s network, application, or database and performs malicious activity. Numerous studies have been conducted to solve this issue.

In [10], the authors presented a data leakage prevention system. Authors employed document semantic signatures to detect breaches. When the semantic signatures of the outgoing document match those of the original document, the system detects data leaks. A sensitive file, however, can avoid detection if an attacker encrypts it and sends it via email. In this circumstance, the detection system cannot recognize the encrypted data as a sensitive file. As a result, sensitive data may be leaked. In [11], an anomaly detection model is proposed for database protection. The Hidden Markov Model (HMM) was utilized for prediction, and the authors achieved minimal false-positive rates. The HMM-based system, on the other hand, is dependent on the training dataset. If the training dataset is insufficient, the system may generate false-positive alarms.

The authors suggested a three-tiered data protection strategy in [12] in response to the information leakage concern created by cloud indexing. However, the process requires a pre-defined data classification. Data that has been misclassified may be leaked. The authors of [13] presented a data leak prevention strategy based on Named Entity Recognition (NER). However, the approach did not use semantic technologies to provide meaning to entities. As a result, spelling errors and related words could impact NER.

To detect insider attacks in relational database systems, the authors proposed a blockchain-based framework in [14]. However, the authors’ solution only addresses the private data and centralized control system, in which a private blockchain network is built in a privately controlled environment with no democratic participants. Furthermore, regardless of whether a network is built on blockchain technology, attackers can manipulate any data or network within an organization. Storing all proof within the same centralized controls or system can increase the attack risk. The authors employed a private blockchain network, meaning anyone with access to that company can modify the private blockchain network even if the entire organization is compromised.

In [15], a blockchain-based event-driven data alteration detection system is presented. However, the model described in the study does not clarify how the framework will function technically. Furthermore, the paper lacks any practical application examples or solutions. It is important to note that storing any data in any structure mandates using a smart contract, yet this approach does not share any knowledge of the smart contract. Furthermore, this paper does not specify how to keep data evidence or fingerprints on the blockchain network. Due to the lack of an appropriate structure, this approach will be ambiguous and impractical. Furthermore, existing research lacks GDPR-compliant practical methods for data leak detection that Data Protection Authorities and Data Controllers can use to determine if it is required to notify affected Data Owners.

To summarize the above discussion, we find that existing blockchain-based data breach detection solutions have several limitations. As a result, developing a system capable of addressing the issues mentioned above is challenging. Considering the limitations of previous studies, we present a novel Personal Data Breach Detection (PDBD) technique in this paper. The following are the main contributions of our work.

  1. i

    A GDPR-compliant PDBD model is developed. It will enable the DC to quickly determine the necessary mitigation measures for data breach events.

  2. ii

    Semantic Web Rule Language (SWRL) rules are developed for the Data Breach Severity Assessment (DBSA) mechanism. This will result in providing the DC with a computable tool to assess the severity of data breaches. It will also help the DC in the process of notifying about breaches accordingly to the data protection authorities and the affected DO.

  3. iii

    Severity level detection ontology is developed to calculate breach severity index score. Also, ontology will indicate breach severity level using SWRL rules.

  4. iv

    Hash Variance Algorithm (HVA) is introduced to reduce the computational overhead of both DBSA and Ethereum.

3 Use Case Scenario

The use case scenario for the health industry is discussed in this part to show how the system performs. In current hospitals, collecting and processing personal data from patients has become mandatory. Almost every hospital department handles protected health information and personally identifiable information about patients. It is hard to recover privacy or restore psychosocial damage when an insider attack discloses a patient’s private information. Furthermore, compromised information can interfere with hospital operations and negatively impact the health and well-being of the patient. If immediate treatment is not received, this condition may result in death or permanent disability due to these operating delays.

The use case scenario assumes that John, the data processor, is a medical specialist who frequently requests patients’ medical records for operational needs, and Michael, the Data Owner, is the patient. Michael, the patient, agrees with having her medical data preserved on the blockchain. John can get the required patient data from the patient database by submitting a request to Robert, the Data Controller. Robert uses our proposed system for tasks like data verification and consent validation. Before providing John with any data, our recommended approach allows Robert to detect any alterations to the database record and confirm its authenticity.

Fig. 1.
figure 1

Proposed System Model

3.1 System Design and Methodology

Figure 1 illustrates the proposed data breach detection model and its components with operation flow on DBSA and Ethereum layers. The main components of the proposed model are as follows.

Data Consumer:

Supposedly trusted third parties or data consumers are important entities of the proposed model that request Data Owners’ personal information. For instance, a surgeon who frequently seeks patients’ medical records for operation purposes. (as discussed in the previous section)

Ethereum:

Ethereum is a blockchain-based platform. Blockchain technology is the collection of blocks containing transaction data linked to each other in a chain. It is a digital ledger that is secure, cryptography-based, and distributed across a network. And this ledger is such of a kind that allows your transactions to be secure, anonymous, fast, and without any central authority. We have used the Ethereum (ETH) network with shared database records in this proposed model. Intending not to store all the data on the blockchain, we create a Cell Signature (CS) against each data table cell and only store that on the blockchain. These cell signatures are generated using the SHA256 [16] for each cell in a table. SHA256 is a cryptographic hash function. As such, it is practically impossible to reverse it and find a message or data that hashes to a given digest. For each row in the table, we generate cell pointer CnRn. For example, in Table X, row 1 (R1) has N columns, and the Cell Pointer (CP) will be generated as shown in Eq. 1.

$$\begin{aligned} Table X\_CP\_R1 = R1C1(FLH), R1C2(FLH), ..., R1CN(FLH) \end{aligned}$$
(1)
figure a

The sequence of CP with cell signature is depicted in Fig. 2. These CPs are then stored on a blockchain using a private key. Any modifications to a CP get logged on the blockchain with a new cell signature of the respective row. Any previous CPs of the modified data cell are also preserved in the blockchain.

HVA:

In the previous phase, we created a cell signature for each data cell and stored this CP on the blockchain. The next phase is the Hash Variance Algorithm (HVA) phase. Cell signatures created using SHA256 in the previous (ethereum phase) will serve as inputs to this phase. The function of the HVA mechanism is shown in Algorithm 1. This can reduce the computational overhead of both DBSA and Ethereum and increase the system throughput. This phase mainly utilizes semantic web technologies such as SPARQL, SWRL rules, reasoning engine, and Jena framework to fetch calculated cell signatures from the previous phase and calculate runtime cell signatures of CPs by fetching shared records from the shared part of database applications by using the SPARQL query. In other words, we need to calculate the difference between CPs (Ethereum and shared database) and then compare them one by one according to the Fixed-Length Hash (FLH) threshold set in advance. If a difference is found, it will be considered as an attack and modified record.

Fig. 2.
figure 2

Classes of Severity Level Detection Ontology

DBSA: The above HVA phase has described the basic structure and mechanism of data breach detection. Based on the above methodology, the calculated output value of the HVA phase can be forwarded. It is necessary to forward the final output of HVA to the DBSA phase to calculate the severity score. In this phase, severity level detection ontology is developed to calculate the breach severity index score. The presented DBSA mechanism in this synopsis uses severity assessment methodology [17] provided by the European Union Agency for Network and Information Security (ENISA). ENISA introduced a severity level assessment formula [17] to calculate the overall severity score, which is shown below in Eq. 2.

$$\begin{aligned} Severity\_level\_score= DPF * ER+ SB \end{aligned}$$
(2)

where DPF is a data processing factor, ER is the ease of recognition, and SB denotes a breach situation. Furthermore, DPF includes classified breached data as simple, behavioral, financial, and sensitive. ER evaluates how easily a certain person is identified using breached data. The ER can be negligible, limited, significant, or maximum. Whereas SB includes malicious intents and security loss in terms of confidentiality, integrity, and availability.

The methodology, as mentioned above, is implemented in the DBSA phase. Severity level detection ontology is developed to apply these guidelines using the recommended methodology [17]. The main classes and subclasses are shown in Fig. 2. In addition, SWRL rules are developed to indicate the data breach’s severity level. However, two rules are modeled for the proof of concept, as shown below.

Rule1: Affected_DO(?ado), Breach_Detected(?bd)(?ado dc: hasSB Min)

(?ado dc:hasER Negligible)(?ado dc: hasDPF Simple) —>

(?bd dc: setFlag Low)

Rule2: Affected_DO(?ado), Breach_Detected(?bd)(?ado dc: hasSB Confidentiality_loss)(?ado dc:hasER Maximum)(?ado dc: hasDPF Sensitive) —>

(?bd dc: setFlag High)

4 Conclusion and Future Works

Data Controllers are obliged to implement measures that will facilitate compliance with GDPR and notify the Data Protection Authorities and every affected party (data owner) in case of any data breaches or possible risk of data privacy violation with undue delay (72 h). Failure to issue a breach notification within time can result in a heavy fine. However, the ability to effectively detect a data breach is still a critical issue and challenging task. Thus, the Data Controllers must have an efficient system for detecting data breaches within time, along with severity level, and in an appropriate way to manage the personal information within organizations and smart devices. This paper presented a novel semantic-blockchain-based model for rapid data breach detection to protect personal data from breaches and reduce direct and indirect data damage that prevents direct and indirect personal data damage. The proposed model generates alerts against data breaches by taking into account severity assessment details and grading the breach incident according to the Data Owner’s impact and the significance of the breach. In the future, we will implement this system using semantic web and blockchain technologies.