1 Introduction

1.1 Problems statement

The source code vulnerability is defined as weaknesses and security risks for the system. Successfully exploiting security vulnerabilities in the source code allows attackers to steal data or take control of the system. According to statistics from CVE 2022 (http://cve.mitre.org), there has been a significant increase in the number of vulnerabilities and their severity. According to the report CWE TOP25 (https://cwe.mitre.org/top25/archive/2021/2021_cwe_top25.html), an increasing number of Common Vulnerabilities and Exposures (CVEs) are being discovered and disclosed, especially those with very high severity levels. This indicates that if these CVEs are attacked and exploited, they will have very dangerous consequences for an organization's systems.

There are two main methods for detecting source code vulnerabilities: manual and automated methods. The manual approach is effective but requires highly skilled analysts and a lot of time. The issue of automated detection of source code vulnerabilities is currently being researched and implemented due to the availability of various computational technologies and strong support methods. It can be observed that recent approaches have focused on analyzing source code in semantic and syntactic forms, and then using machine learning or deep learning algorithms to further analyze and predict vulnerabilities. Therefore, recent studies have often concentrated on applying graph analysis techniques in conjunction with machine learning or deep learning algorithms (Chen and Guestrin 2016; Cho et al. 2022; Corinna and Vladimir 1995; Devlin et al. 2018; Ferrante et al. 1989; Gascon et al. 2013; Handa et al. 2019). Accordingly, the approaches typically analyze source code into graph forms such as Abstract Syntax Tree (AST) (Harer et al. 2018), Control Flow Graph (CFG) (Haridas et al. 2020), Program Dependence Graph (PDG) (He et al. 2016), and then employ machine learning or deep learning techniques to classify normal source code and source code containing security vulnerabilities. Additionally, other approaches utilize NLP methods such as Word2vec, Doc2vec, etc., to extract features from source code. Subsequently, machine learning algorithms or deep learning models are also used to classify these features.

1.2 Motivation and objectives

1.2.1 Motivation

It is evident that the accuracy of the classification process in detecting source code vulnerabilities heavily relies on the feature extraction phase. However, based on our observations, the current approaches used in this context encounter issues related to insufficient features and attributes concerning security vulnerabilities. Specifically:

  • Regarding the graph-based approaches for source code extraction: upon observing and analyzing these approaches, we notice that analyzing and representing the source code in forms such as AST, CFG, and PDG fail to adequately express the relationships between functional components and source code variables. The reason behind this issue lies in the fact that AST, CFG, and PDG representations can only capture certain relationships among the source code components. Therefore, relying solely on one of these representations leads to an incomplete and ambiguous representation of the source code. Consequently, the extracted information is not efficiently captured. In other words, the traditional methods of constructing source code features have not encompassed all the necessary features. Specifically, when applying these methods to different datasets, the false positive rate is exceedingly high. This indicates that the proposed methods are only suitable for specific datasets, making it challenging to apply them to other datasets. In other words, the attributes and features suggested by the research have not fully accounted for the variety and differences in software vulnerabilities. Consequently, when employing these methods in practice, both the false positive rate for accurately predicting source code vulnerabilities and the false positive rate for normal source code is high. There are instances where no software vulnerabilities can be detected due to model overfitting.

  • Regarding the approaches using NLP methods: we observe that these approaches encounter the issue of not extracting information from the source code. Specifically, the conventional methods utilizing models like Doc2vec, Word2vec, and BERT fail to extract code snippets when dealing with long source code segments. These models can only handle relatively short data segments, such as 64 or 128 units. However, in reality, there exist source code segments that are significantly longer. Consequently, if these models are employed, important data will be missed, leading to difficulties for the detection model. Furthermore, standardizing source code segments to the same length introduces significant inaccuracies. The system becomes unable to determine which code segments genuinely contain vulnerabilities. Moreover, enforcing the same length will necessitate concatenating certain code segments, resulting in incorrect classifications.

1.2.2 Proposed solutions

To address the issues encountered in Sect. 1.2.1 of the research, we propose a new vulnerability detection model based on the combination of the RoBERTa and Machine Learning methods. Specifically, RoBERTa will be used to extract features from the source code, while the Machine Learning algorithm will be employed to classify the source code based on the feature vectors obtained after normalizing the source code using RoBERTa. The use of RoBERTa in this paper will overcome the limitation that NLP approaches cannot resolve. Furthermore, based on the characteristics of the source code dataset and the features of the feature vectors extracted by RoBERTa, supervised machine learning algorithms will be the most suitable and optimal choice in this case. Because supervised machine learning algorithms can learn patterns and make predictions or decisions based on data, they can automatically identify complex relationships and extract valuable information from large and diverse datasets, even when the basic patterns are not explicitly programmed. Specifically, the input data is the output of the RoBERTa network, which means the most important features of the code snippets have been extracted to be fed into the machine learning models. This allows the machine learning algorithms to capture the behavior of the original data for better classification. Moreover, machine learning models are flexible in adjusting internal parameters or structures based on observed patterns in the data. This adaptability helps the model handle changing conditions and make more accurate predictions when new information emerges. Lastly, machine learning models can analyze historical data, recognize patterns, and make probabilistic predictions or decisions based on available evidence. This ability is particularly useful when the input is incorrect or incomplete source code, as the model can rely on learned parameters to make relatively accurate decisions. Therefore, the proposed model combining RoBERTa and machine learning algorithms is a good proposition to enhance the ability to accurately detect source code vulnerabilities.

1.2.3 Operating principles of the proposed model

The characteristics of the model researched and proposed in this study are as follows: first, the source code will undergo data preprocessing to remove white spaces and line breaks. Before constructing the source code features, it is necessary to retrain an appropriate Tokenize module that suits the source code dataset, and then input it into the pre-trained model Roberta for feature extraction. Finally, it is inputted into the classification model. Specifically, the pre-processing steps and the required processing stages are as follows:

  • Stage 1: data preprocessing: The purpose of this stage is to standardize the source code by removing redundant data or characters such as notes, abbreviations, and capitalization. The details of this process are presented in Sect. 3.2 of the article.

  • Stage 2: tokenization: This stage will train a new tokenizer on the entire dataset containing the source codes. To do this, the article proposes using the RoBERTa model. The operating principles and process of the RoBERTa model in tokenizing the source code will be described in detail in Sect. 3.3 of the article. Based on this proposal, we will obtain a sequence of tokens corresponding to each source code segment.

  • Stage 3: vectorizing token sequences using the RoBERTa model: This process will use the token sequences from Stage 2 as input to represent the source code as a feature vector. The proposed approach of using the RoBERTa model for the process of vectorizing token sequences is a new direction that has not been suggested in any previous research. With this approach, the features of the source code will be synthesized and extracted more completely, thereby enhancing the ability to predict the source code.

  • Stage 4: classifying source code using ML algorithms: Based on the results of stage 3, we propose using supervised ML algorithms such as Random Forest, Decision Tree, Support Vector Machine, etc. to determine which source code contains vulnerabilities and which source code is normal.

1.3 Contributions of the paper

The new and scientific features in this study include:

  • Proposing a model for classifying source code vulnerabilities written in C and C +  + based on a combination of the NLP method RoBERTa and supervised ML algorithms. This is a model for detecting new source code vulnerabilities that have not been proposed in any other research. With this proposal, the paper has demonstrated the superior effectiveness of this combined model compared to other approaches and research. This demonstrates that the proposed model is both practical and scientifically sound.

  • Proposing a method for Tokenization and feature extraction of source code based on the RoBERTa model. The use of NLP methods to standardize and analyze source code features is not a new approach. However, using the RoBERTa model for this problem is a new proposal. With this approach, this study has successfully extracted and synthesized features contained in the source code. This proposal not only has scientific significance but also has practical significance for the task of detecting source code vulnerabilities, as it not only contributes to solving difficulties in finding correlations between functions and parameters of the source code but also partially solves the problem of the lack of general features of source code.

  • Experimenting, comparing, and evaluating the effectiveness of source code vulnerability detection using various combined models of Roberta with machine learning models and different approaches. In this paper, we conducted experiments and evaluations to select the optimal parameters and models for the source code vulnerability detection problem. The experimental results for model selection are presented in Sect. 4.4 of the paper. These results provide the foundation for the vulnerability detection system to choose models and parameters that satisfy the balance between performance and detection effectiveness.

Organization of the research paper: In part 2 of the article, some related research is presented. The content related to the Transformer network architecture, the application of the Transformer to RoBERTa, and the improvements of RoBERTa and its application in the task of classifying source code vulnerabilities are described in Sect. 3. The experimental and evaluation part of the proposed model is presented in Sect. 4 of the article. Finally, there is a conclusion and future development section.

2 Related works

In the study by Chen et al. (2017), it was pointed out that current vulnerability detection methods often rely on AST graph traversal algorithms and rule-based node matching, leading to a large number of unnecessary rules that can affect detection speed. This study proposes an optimized rule-checking algorithm to improve the speed of rule matching on AST trees, resulting in a 28.7% improvement over the original PMD. In the study by Tian et al. (2009), an improved Apriori algorithm is applied to detect web application vulnerabilities to improve the ability to detect unknown vulnerabilities. This method uses combined rules to analyze the logical relationship between application components, providing results that can detect most vulnerabilities existing on all pages of the target site. The study by Hu et al. (2020) describes memory leak, double-free, and use-after-free vulnerabilities using the CFG and Pointer-related Control Flow Graph (PCFG) framework, combined with the use of two algorithms—Vulnerability Feature (VJVF) and Feature Judging (FJ)—for detection. The results show that the proposed MRADAVF method is more effective than some tools such as cppcheck, flawfinder, and splint in detecting memory leak, double-free, and use-after-free vulnerabilities, although the detection ability is not diverse. The research by Lee et al. (2006) points out the drawback of current rule-based vulnerability detection tools, which is the lack of adaptability to software in general and usually being limited to specific techniques and technologies. Based on that, the author team proposes a model consisting of 3 components: an Analysis module, a Rule process, and a Detection module. The Analysis module and Rule process have improved the mentioned problem, but still exhibit many limitations in terms of complexity and time. The research by Lee et al. (2006) suggests that current popular source code vulnerability detection tools only detect a limited set of weaknesses based on predefined rules or laws. However, applying ML and DL to vulnerability detection enables direct learning of attributes from the source code, which allows for the detection of a wider range of vulnerabilities. The research by Harer et al. (2018) presents a data-driven approach to vulnerability detection using machine learning, specifically for C and C +  + programs. The results show that the DL model outperforms traditional ML models with a ROC of 0.87. The research by Tang et al. (2021) proposes an Extreme Learning Machine (ELM) model combined with doc2vec to represent the program. The authors demonstrate that doc2vec provides faster training and better generalization performance compared to word2vec. The research by Wu et al. (2020) emphasizes that source code attribute extraction is an extremely important step that significantly affects the detection results. However, previous methods have not fully exploited this aspect, so the authors extracted attributes using CPG graphs for the vulnerability detection task. The research by Li et al. (2021) has pointed out that current DL-based methods ignore the connections between semantic graphs and do not effectively process structured graph information. The application of GNN models provides a deep insight into the problem of security vulnerability detection. The experimental results also demonstrate that graph-based models perform better than sequence-based models, especially with semantic graphs, achieving up to 5.01% higher accuracy. The research by Wang et al. (2020) shows that using regression networks such as Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Bidirectional Long Short-Term Memory (BiLSTM) does not yield good results for vulnerability detection tasks. Traditional regression models are designed for sequential data and are not suitable for current source code representation methods using complex graph forms such as AST, CFG, CPG, etc. Graph neural networks currently allow direct manipulation of graph data, enabling powerful exploitation of complex relationships in this data type to enhance detection efficiency.

3 Proposed model

3.1 Architecture of the proposed model

Figure 1 illustrates the architecture of the proposed model for detecting source code vulnerabilities. Based on the introductions in Sect. 1.2, along with Fig. 1, we can observe the operational principle of the proposed model as follows: Initially, the source code files undergo preprocessing in the "Data preprocessing" block to eliminate noise and remove redundant characters. Subsequently, the processed data is passed through the "Tokenization" block to convert the textual data into numerical vector format. The "Pretrain Model RoBERTa" block then employs algorithms to synthesize, normalize, and extract numerical vectors, resulting in a feature vector. Finally, supervised machine learning algorithms in the "Classification" block are employed to predict and detect normal source code files as well as source code files that contain security vulnerabilities. In the subsequent sections of the paper, we will provide detailed explanations of the operational principles, algorithms, and methods used in each block.

Fig. 1
figure 1

Architecture of the source code vulnerability detection model proposed in the paper

3.2 Data preprocessing

The input source code may contain numerous redundant characters, excessive whitespace, and comments, therefore all of which can impact the subsequent steps. Hence, it needs to be processed initially. The preprocessing operations on the source code include converting file.c to file.txt, reading the.txt file, eliminating redundant characters, line breaks, and comments, and saving it to a.csv file along with the corresponding label for that code file. Once the source code has been cleaned, we can proceed to the Fine Tuning Tokenizer.

3.3 New tokenizer training

3.3.1 Introduction to transformer

The Transformer Neural Network is an artificial neural network announced by Google in their research (Vaswani et al. 2017). The Transformer is a neural network architecture designed to process continuous data, such as natural language (NL), more efficiently than previous models. The Transformer is a deep learning model designed to serve various tasks in NLP and speech, such as machine translation, language generation, classification, entity recognition, speech recognition, and code-to-speech conversion. However, unlike RNNs, the Transformer does not process elements in a sequence one by one. If the input data is a natural language sentence, the Transformer does not need to process the beginning part of the sentence before the end part. Because of this feature, the Transformer can take advantage of the parallel computing capabilities of GPUs and significantly reduce processing time. Figure 2 below depicts the architecture of the Transformer model.

Fig. 2
figure 2

Architectural of transformer model

From Fig. 2, we can see that the architecture of the Transformer consists of 2 main phases:

  • Encoder: Consisting of 6 consecutive layers. Each layer will include a sub-layer called Multi-Head Attention combined with a fully-connected layer as described in the left branch of the encoder in the diagram. The model uses residual connections (Martínez Torres et al. 2019), which are placed around two sub-layers to make gradient flow easier through the model, improving the speed and performance of the model. Next, normalization layers (Russell, et al. 2018a) are used to normalize the activation of each layer, reducing the variance of the output, and making the network more flexible to the effects of weight initialization, which improves the speed and performance of the network. Therefore, the output of each sub-layer will consist of LayerNorm(x + Sublayer(x)). At the end of the encoder process, we obtain an embedded output vector for each word. To enable these residual connections, all sub-layers and embedding layers have a size of \(d_{model} = 512\).

  • Decoder: The decoder architecture also consists of 6 consecutive layers. Each layer of the decoder also has sub-layers similar to the layers of the encoder, but with an additional first sub-layer called Masked Multi-Head Attention that removes future words from the attention process. Like the encoder, the model also uses residual connections around two sub-layers, allowing the gradient to flow more easily through the model to improve the speed and performance of the model. Next, the model uses normalization layers to normalize the activation of each layer, which helps reduce the variance of the output and makes the network more flexible to the effects of weight initialization, improving the speed and performance of the network.

Based on the Transformer's architecture depicted in Fig. 2, the Transformer operating principle can be divided into two main layers as follows:

  • Encoder layer: input embedding; self-attention, multi-head attention, feed forward.

    • The input embedding will encode the input data as real or complex numbers, as well as represented by mathematical structures such as vectors and matrices. Specifically, the word embedding matrix has several rows equal to the size of the vocabulary set. Then, the words in the sentence are searched in this matrix and concatenated into the rows of a 2-dimensional matrix containing the meanings of each word. Word embeddings help represent the meaning of a word, but the same word in different positions of a sentence may have different meanings, and the Transformer processes words in parallel, so with just word embedding, the model cannot know the positions of the words. To address this issue, the Input Embedding will perform a process called Positional Embedding. The general formula for calculating the value of the position of words in the sentence is expressed as follows:

      $$PE_{{\left( {pos,2i} \right)}} = sin (pos/1000)^{{\frac{2i}{{d_{model} }}}}$$
      (1)
      $$PE_{{\left( {pos,2i + 1} \right)}} = cos (pos/10000)^{{\frac{2i}{{d_{model} }}}}$$
      (2)

      where \(pos\) is the position of the word in the sentence, PE is the i-th element value in the embeddings with length \(d_{model}\). Then add the PE vector and the Embedding vector.

    • Self-attention: self-attention is a mechanism that allows the encoder to look at other words while encoding a specific word. Therefore, Transformers can understand the relationships between words in a sentence, even when they are far apart. The decoder also has a similar structure, but with an attention layer between them so that it can focus on relevant parts of the input. With the general structure, the input of the Self Attention modules consists of three vectors: Queries (Q), Keys (K), and Values (V). Query vector (Q) is used to hold the information of the word being searched for and compared. The key vector (K) is used to represent the information of the words being compared with the word being searched for above. The value vector (V) represents the content and meaning of the words. From these three vectors, calculate the attention vector Z for a word according to the following formula:

      $$Z = softmax\left( {\frac{{Q . K^{T} }}{{\sqrt {Dimention \, of \, vector} Q, K\, or\, V}}} \right). V$$
      (3)

      In the research (Vaswani et al. 2017), the principles of operation as well as the steps of implementation of the Self-Attention network were presented in detail. In this paper, we also apply the Self-Attention network in the RoBERTa model to extract and build features of source code.

    • Multi-head attention: In the process of researching Self-Attention networks, we found that the attention of a word will always "focus" on itself because it is most related to itself. This leads to each relationship of "it" with the surrounding components not being clearly expressed. To solve this problem, the researchers (Vaswani et al. 2017) proposed the Multi-head Attention network. Specifically, the interaction between different words in the sentence is performed by Multi-head Attention by using multiple Attention heads instead of using one Self-Attention head. Each Attention head pays attention to a different part of the sentence. Each "head" produces its attention matrix. Concatenating these matrices and multiplying them with the weight matrix \(W^{0}\) generates a unique attention matrix (weighted sum). This weight matrix is tuned during training. The general formula for the MultiHead layer is calculated as follows (Vaswani et al. 2017):

      $$MultiHead\left( {Q,K,V} \right) = Concat\left( {head_{1} , \ldots ,head_{n} } \right).W^{0}$$
      (4)

      where

      $$head_{i} = Attention\left( {QW_{i}^{Q} , KW_{i}^{K} , V W_{i}^{V} } \right)$$
      (5)
    • Feed forward: The output vectors of the MultiHead layer after going through the Normalization layer will be fed through a fully connected network before being pushed to the Decoder. Because these vectors do not depend on each other, parallel computation can be utilized for the entire sentence.

  • Decoder layer: The decoding process is similar to encoding, except for the Masked MultiHead Attention layer, which is similar to the MultiHead Attention layer described above, except that the input of the Decoder is masked. In a language model that uses an encoder and a decoder, the decoder is used to predict the final output based on the input from the encoder. The purpose of the decoder is to generate a token sequence corresponding to the target sentence or source code based on the input from the encoder.

3.3.2 CodeT5

CodeT5 (Russell et al. 2018b) is a DL model based on Transformer to understand, differentiate, and generate code based on the T5 architecture. It uses a pre-trained objective to recognize and examine important token information from code. Specifically, the T5 Seq2Seq output objective is extended with two tasks of prediction and tagging of identifier tokens to allow the model to better leverage type token information from programming languages, which are identifier tokens designated by developers. To improve the linkage between NL and programming language, a dual-learning objective method is used to transform between the two directions of NL and programming language (PL). CodeT5 is built on a similar architecture to T5 but incorporates knowledge specific to code to help the model better understand source code. It takes source code and accompanying comments as input sequences.

CodeT5 is built on the Transformer architecture, so the basic structures of the encoder-decoder model will remain unchanged, so the way to tokenize the code segment will not change. The novelty of CodeT5 does not stop at tokenization but will continue to be exploited to serve many types of NLP problems through the following 4 layers.

  • Masked span prediction (MSP) randomly masks arbitrary length spans and requires the decoder to recover the original input. It captures the syntax information of the NL and PL input and learns a powerful multilingual representation when pre-trained on multiple PLs using a shared model:

    $$L_{MSP} \left( \theta \right) = \mathop \sum \limits_{t = 1}^{k} - logP_{\theta } (x_{t}^{mask} |x^{\backslash mask} , x_{ < t}^{mask} )$$
    (6)

    where θ are the model parameters, \(x^{\backslash mask}\) is the masked input, \(x^{mask}\) is the masked sequence to predict from the decoder with k denoting the number of tokens in \(x^{mask}\) mask, \(x_{ < t}^{mask}\) is the span sequence generated so far

  • Identifier Tagging is only applied to the encoder to distinguish whether each code token is an identifier (e.g. variable or function names) or not. It works similarly to the syntax highlighting feature in some developer tools.

    $$L_{IT} \left( {\theta_{e} } \right) = \mathop \sum \limits_{i = 1}^{m} - [y_{i} log p_{i} + \left( {1 + y_{i} } \right)log\left( {1 - p_{i} } \right)]$$
    (7)

    where \(\theta_{e}\) is the encoder parameter. Note that by casting the task as a sequence labeling problem, the model is expected to capture the code syntax and the data flow structures of the code.

  • Masked Identifier Prediction (MIP) is the opposite of MSP, only masking the identifier and using the same masked placeholder for all occurrences of a single identifier. It works similarly to source code decoding in software engineering and is a more challenging task that requires the model to understand the semantic meaning of the code based on the source code.

    $$L_{MIP} \left( \theta \right) = \mathop \sum \limits_{j = 1}^{\left| I \right|} ( - log P_{\theta } (I_{j} |x^{\backslash I} , I_{ < j} )$$
    (8)

    where \(x^{\backslash I}\) is the masked input. Note that deobfuscation is a more challenging task that requires the model to comprehend the code semantics based on obfuscated code and link the occurrences of the same identifiers together.

  • Bimodal dual generation optimizes the conversion between code and its comments together, creating better links between NL and PL counterparts.

3.3.3 RoBERTa

3.3.3.1 Introduction to BERT

The bidirectional encoder representation from transformer (BERT) (Devlin et al. 2018) enhances the attention mechanism to better understand language context. BERT is a stack of multiple encoder blocks. The input text is tokenized as in the Transformer model, and each token is transformed into a vector at the output of BERT. The BERT model is trained using both the masked language model and next sentence prediction models simultaneously.

Each training sample for BERT is a pair of sentences from a text, where two sentences may or may not be adjacent in the text. A token (CLS) is added before the first sentence (to represent the class) and a token (SEP) is added after each sentence (as a separator). Then, the two sentences are concatenated into a token sequence to become a training sample. A small proportion of tokens in the training sample are masked by a special token (MASK) or replaced by a random token.

Before the training samples are fed into the BERT model, the tokens in the training samples are transformed into embedding vectors, after being encoded for position and, specifically for BERT, segment embeddings are also added to mark whether the token is the first or second sentence. Each input token for the BERT model will generate an output vector. The output corresponding to the masked token can reveal what the original token was, and the output corresponding to the [CLS] token at the beginning can reveal whether two sentences are contiguous in the text or not. The BERT model can be used for many NLP tasks such as classification, question answering, named entity recognition, sentiment analysis, and text summarisation.

3.3.3.2 Introduction to RoBERTa

RoBERTa (Shai and Shai 2014) is a Facebook project that inherits the architecture and algorithm of the BERT model on the PyTorch framework (PyTorch is also a framework developed by Facebook). This is a project that supports retraining BERT models on new datasets for languages other than some popular ones. Since its inception, there have been many pre-trained models for different languages trained on RoBERTa. In terms of operation principle, the RoBERTa model has similar training procedures to the BERT model, but with a change in that, the model is trained longer, with larger batch size, and on more data. As presented by Liu et al. (2019), Roberta also used two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) same as BERT.

Masked language modeling (MLM) is a type of language modeling that involves randomly masking certain tokens in a sentence and then asking a model to predict the original token given the context of the surrounding words.

The MLM formula can be expressed as follows:

$${\text{P}}({\text{w}}_{{\text{i}}} { }|{\text{ w}}_{1} ,{\text{w}}_{2} , \ldots ,{\text{w}}_{{\left\{ {{\text{i}} - 1} \right\}}} ,{\text{ w}}_{{\left\{ {{\text{i}} + 1} \right\}}} , \ldots ,{\text{w}}_{{\text{n}}} { })$$
(9)

where P is the probability of the masked word \({\text{w}}_{{\text{i}}}\) given the context of the other words in the sentence, \({\text{w}}_{1}\) through \({\text{w}}_{{\text{n}}}\). To calculate this probability, a neural network model is typically trained on a large corpus of text using a technique called maximum likelihood estimation (MLE). The model is trained to maximize the probability of the correct token being predicted given the masked input. During inference, the model is given a sentence with one or more masked tokens, and it generates a probability distribution over the possible words that could fill each mask. The model then selects the word with the highest probability as the predicted word for that mask.

NSP is a task in NLP where a model is trained to predict whether a given pair of sentences are contiguous or not. The NSP formula can be described as follows:

$${\text{P}}({\text{is}}\_{\text{next }}|{\text{s}}_{1} ,{\text{ s}}_{2} )$$
(10)

where P is the probability of the second sentence \({\text{s}}_{2}\) being the next sentence following the first sentence \({\text{s}}_{1}\), and is_next is a binary variable that takes on the value 1 if \({\text{s}}_{2}\) is the next sentence and 0 otherwise. To compute this probability, a neural network model is typically trained using a binary cross-entropy loss function. The model is given pairs of sentences and labels indicating whether the second sentence is contiguous with the first sentence or not. The model then predicts a probability of the second sentence being contiguous based on the input sentences. During inference, the model is given a pair of sentences and generates a probability indicating whether the second sentence is likely to be contiguous with the first sentence or not. If the probability is above a certain threshold, the model predicts that the second sentence is the next sentence following the first sentence.

In addition, to improve the accuracy of word representations, RoBERTa has removed the task of predicting the next sentence and trained on longer sentences. At the same time, the model also changes the masking technique flexibly (i.e., hiding some words in the output sentence using codes) applied to the training data. In addition to these issues, research (Tang et al. 2021) has also shown that RoBERTa has improvements in the tokenization process. Specifically, in RoBERTa, the tokenizer used is Byte Pair Encoding (BPE) (Sennrich et al. 2015) and a dictionary locator is a dictionary with about 50,000 tokens. BPE (Vaswani et al. 2017) is a basic word compression technique that helps us index all words, including open words (not appearing in the dictionary), by encoding words with a string of subwords. The operating principle of BPE is based on the intuitive analysis that most words can be broken down into subcomponents. BPE will count the frequency of subwords appearing together and merge them if their frequency of occurrence is the highest. Continue the process of merging subwords until there are no more subwords to merge, we will get a set of subwords for the entire source code that every word can be represented through subwords.

3.3.3.3 Using RoBERTa model to tokenize source codes

The RoBERTa model has been applied to many different NLP tasks. However, since source code has its specific characteristics, adapting the RoBERTa model for source code feature extraction tasks requires adjusting the model architecture. Specifically, this process involves dividing code lines or source code into tokens and encoding them as numbers to be processed by the model. For a source code repository containing many untrained characters, the tokenizer will break an unknown word into smaller characters. Additionally, the max length of a sentence is limited to 512 words. Since the subword classification method does not fully represent the information of the source code, in this study we will train a new Tokenizer model based on the BPE algorithm of RoBERTa. The process for training a new tokenizer based on BPE in RoBERTa is as follows (Liu et al. 2019): First, a vocabulary must be initialized. Each word in the source code repository is represented as a combination of characters with the < \w > token at the end to mark the end of a word. Then, the frequency of appearance of all token pairs in the vocabulary is counted, and the most frequently appearing pairs are merged to create a new n-gram based on character levels for the vocabulary. In this paper, the input data set is about 3 GB, and the processing speed should be fast (3 min 16 s on a GPU). After tokenization, words and special characters in the source code are divided into tokens and stored in an array. In RoBERTa, special characters are marked with "#" to indicate that they are special characters. The output after tokenization is fed into the pre-trained Tokenize RoBERTa to obtain corresponding tokens and tokens indices. After tokenization, the tokens are passed through an embedding layer to convert them into feature vectors and then through the encoder layers to train the model. The resulting output is fed into the RoBERTa model encoder layer to build a sequence of feature vectors. For example, a sample source code after being tokenized using both the old and new tokenizer.

figure a

In Table 1, the variations in token IDs between different embeddings can be attributed to the diverse methodologies employed by underlying embedding models. Each model, such as Word2Vec, GloVe, or BERT, adopts distinct approaches to numerical representation of words, capturing unique semantic relationships. These discrepancies can arise from variances in tokenization and text preprocessing techniques utilized by different models, affecting the assignment of token IDs. Additionally, contextual embeddings, like those generated by RoBERTa, consider not only the individual word but also its surrounding context, leading to nuanced differences in token representations. Training data discrepancies, stemming from diverse datasets or corpus sizes, contribute to the divergence in learned patterns and relationships between words. Moreover, hyperparameters chosen during model training, such as embedding dimensionality and context window size, play a crucial role in shaping token IDs.

Table 1 Illustrate some words in the source code after tokenize

3.4 Build a sequence of feature vectors

After obtaining the tokenized strings of the source code as presented in Sect. 3.3, to serve the classification method in the next step, we need to convert the sentences (of different lengths) into fixed-length real number vectors. As mentioned earlier, the purpose of this phase is to build a feature profile of the source code using the RoBERTa model. In other words, it converts the sentences in the source code into semantic and syntactic meaning-carrying number vectors. In this article, we propose using the pre-trained model available in RoBERTa. As is known, RoBERTa is an automated language model trained using a very large dataset and Transformer technology and can learn language representations for words and sentences in NL. It can also be used to generate embeddings (contextual language representations) for words and sentences in source code. This embedding is a vector representing that word in the model's vectorized space. To obtain the embedding of a word in RoBERTa, the article uses the encoding function of the RoBERTa tokenizer. This function returns a tensor of size (batch_size, sequence_length, hidden_size) containing the embeddings of each word in the sentence. The output of the embedding process is a tensor of size (number of tokens x representation size). Each row of this tensor is a contextual representation for each token in the source code segment. After the data is passed through the encoder layers in the Transformer model, the output is a sequence of feature vectors with depth equal to the number of encoder layers. These vectors are used to represent input sentences or source code as a local representation in the feature space of the model.

3.5 Classification

ML algorithms are being considered suitable classification algorithms for current classification models (Cho et al. 2022). Therefore, in this article, we also propose using supervised ML algorithms to predict source code vulnerabilities. The process of detecting source code based on a supervised ML algorithm is as follows (Handa et al. 2019; Martínez Torres et al. 2019): the classification task starts with building a set of source code D = {d1,…, dn} called the training set, where the documents \(d_{i}\) are labeled with \(c_{j}\) belonging to a set of labels C = {c1,…, cm}. The goal is to determine a classification model so that any given document \(d_{k}\) can be accurately classified into one of the C. In other words, the objective of the problem is to find the function f.

$${\text{s}}\;f: D x C \to Boolean$$

\(f\left( {d,c} \right) = \{ true,\;if\;d\,belongs\;to\;c\;or\;false,\;if\;d\;not\;belongs\;to\;c\}\)}

There are many algorithms used to classify source code. In this paper, we experimented with three algorithms: XGBoost (XGB) (Chen and Guestrin 2016), Decision Tree (DT), Logistic Regression (LR), Support Vector Machine (SVM) (Corinna and Vladimir 1995; Shai and Shai 2014), RF (Leo 2001). These are some algorithms that have been evaluated by many studies as effective in classification tasks.

4 Experiment and evaluation

4.1 Description of the experimental dataset

Table 2 below summarizes the experimental datasets used to evaluate the effectiveness of the proposed method.

Table 2 Statistics on experimental data

Accordingly, the experimental dataset proposed in the paper is FFmpeg + Qume (https://ffmpeg.org/download.html?fbclid=IwAR1KLWWNSsUj7YMipraauMmPwNOu9OBqUWfaIwQ9-8heiWMrueKvPf8ZC0). These datasets are evaluated to be most suitable for real-world data. The main current approaches often rely on these datasets for vulnerability detection experiments. After preprocessing and vectorization, the source code dataset is trained using automatic classification algorithms. The source code dataset is automatically separated, with around 80% (20,396 samples) used for training, and the remaining 20% (5099 samples) used for testing.

4.2 Evaluation measurement

Below are four metrics and their formulas used in this paper to evaluate the effectiveness of the proposed model.

$$accuracy = \frac{TP + TN}{{TP + TN + FP + FN}} \times 100\%$$
$$precision = \frac{TP}{{TP + FP}} \times 100\%$$
$$Recall = \frac{TP}{{TP + FN}} \times 100\%$$
$$F1 = \frac{{2 \times precision \times {\text{Re}} call}}{{precision + {\text{Re}} call}}$$

where accuracy: it’s the ratio between the number of samples classified correctly and the total number of samples.

  • Precision: it’s the ratio of true positive points to the total number of points classified as positive (TP + FP).

  • Recall: it’s the ratio of true positive points to the total number of real positive points (TP + FN).

  • F1- score: it’s the harmonic mean of precision and recall.

  • TP—true positive: the number of vul samples classified correctly.

  • FN—false negative: the number of vul samples classified as non-vul.

  • TN—true negative: the number of non-vul samples classified correctly.

  • FP—false positive: the number of non-vul samples classified as vul.

4.3 Experiment scenario

To demonstrate the effectiveness of the model, the study conducted several scenarios as follows:

  • Scenario 1: an experiment detecting source code vulnerabilities using the RoBERTa model and ML algorithms such as XGB, DT, LR, SVM, and RF. The purpose of this scenario is to test and evaluate the effectiveness of the proposed method in building a feature profile of source code as well as predicting code vulnerabilities. During the experiment, the paper will adjust the parameters of the RoBERTa model and ML algorithms to find the best and most optimal prediction model.

  • Scenario 2: replace the RoBERTa model with other models. The purpose of this scenario is to compare and evaluate the effectiveness of the RoBERTa model with other NLP methods and models. To achieve the stated goal, in this experimental scenario, we will use some models such as Bert (Devlin et al. 2018), and CodeT5 (Wang et al. 2021), etc.to analyze and extract source code features, then use supervised ML algorithms to predict source code vulnerabilities.

  • Scenario 3: replace some ML algorithms with other DL algorithms such as MultiLayer Perceptron (MLP), and Convolutional Neural Network (CNN): For this scenario, the paper aims to demonstrate the suitability of supervised MLproblems for source code vulnerability classification tasks. Therefore, we will choose some DL models such as CNN, MLP, etc.to detect source code vulnerabilities.

  • Scenario 4: compare the proposed model in the paper with other approaches on the same dataset.

4.4 Experimental results

4.4.1 Experimental results of scenario 1

The experimental results in Table 3 show that changing the token length significantly affects the accuracy of the predictive model. However, this change is not uniform. Specifically, when the token length is set to 32 or 64, there is not much difference in the results. Moreover, the experimental results do not demonstrate which classification algorithm produces significantly superior results. Even supervised ML algorithms that have many advantages and superior performance, such as SVM, RF, and XGB, do not perform as well as other classical algorithms. This indicates that the RoBERTa model with short and small token lengths cannot extract or summarize the features of the source code. Nevertheless, when the token length of RoBERTa is increased to 512, all metrics increase significantly, especially recall (increasing from 61% with RoBERTa 32 to 77% with RoBERTa 512). The difference in other metrics also fluctuates around 10%. In particular, among the ML algorithms used for classification, the RF approach achieves better results than other algorithms across all metrics. With an accuracy of 54.5% and a recall of 77%, this shows that the rate of missing source code containing vulnerabilities is low. This result is very meaningful for the task of detecting vulnerabilities in source code because the most important thing in this problem is not to miss the source code containing vulnerabilities. Comparing the results of predicting source code vulnerabilities when changing the token length, it can be seen that RoBERTa 512 provides the best results for all metrics. Although in theory, increasing the length of the vector is equivalent to the results also increasing because the complexity and completeness of the algorithm also increase, here, RoBERTa 1024 does not have significantly higher results than RoBERTa 512. This may be because there are very few source codes with a total number of words greater than 512, so when tokenizing with a length of 1024, it will only add many zeros behind, and this does not help the model classify better. This result shows that the RoBERTa model only delivers good performance when used with a reasonable token length. If the token length is too short, the performance will not be high, but if it is too long, it will be slow and may not deliver good results. Figure 3 below shows the ROC AUC graph results of the model combining RoBERTa 512 with the RF algorithm.

Table 3 Experimental results of detecting source code vulnerabilities using RoBERTa
Fig. 3
figure 3

ROC graph of RoBERTa 512 and RF model

The results on the graph in Fig. 3 can easily show that the True Positive Rate (TPR) or Recall achieved 77% (the highest among all scenarios), indicating that the proposed model works very effectively on the experimental dataset in the problem of detecting security vulnerabilities on source code. Figure 4 below shows the test results of the RoBERTa 512 model with the RF algorithm.

Fig. 4
figure 4

Confusion matrix of RoBERTa 512 and RF model

Based on the results of the Confusion Matrix in Fig. 4, it can be seen that the model combining RoBERTa 512 with RF operates very effectively on the experimental dataset in detecting source code vulnerabilities. Specifically, the model accurately detected 1951 source code files containing vulnerabilities out of a total of 2532 files. For accurately predicting normal source code files, the model also achieved good results with 1916 files predicted accurately. It can be observed that, for classification problems, the results shown in Fig. 4 are not good. However, for the task of classifying source code vulnerabilities, this result is very good and impressive because few research directions in detecting source code vulnerabilities achieve such efficiency as the proposed model.

4.4.2 Experimental results of scenario 2

As mentioned above, in this scenario, the paper will replace the RoBERTa model with other NLP methods such as Bert (Devlin et al. 2018), and CodeT5 (Wang et al. 2021). Table 4 below shows some experimental results of this scenario.

Table 4 Experimental results of detecting source code vulnerabilities using some alternative methods RoBERTa

From the results in Table 4, it can be seen that the general trend of the combined models using NLP and ML algorithms is that the results increase as the token length increases. In addition, when the token length is short and low, the ML algorithms almost cannot demonstrate their classification effectiveness. Furthermore, comparing the results between the Bert model and the CodeT5 model, it can be seen that the results of these two models are also equivalent. Among them, the Bert model provides the best effectiveness with the RF algorithm and a token length of 256, with Accuracy, Precision, Recall, and F1_score values of 46%, 44.7%, 74.8%, and 56%, respectively. The CodeT5 model combined with SVM provides the best results with a token length of 256. It can be seen that the CodeT5 model is more effective than Bert only in the Recall measure (about 1.1%), while other measures are inferior to the Bert network. Comparing Table 3 with Table 4, it is clear that the CodeT5 model in tokenizing source code is not as effective as RoBERTa in Scenario 1. Therefore, with the same ML algorithm and corresponding parameters, the experimental results are not as good as RoBERTa 's, with Recall scores being close (75% with CodeT5 and 77% with RoBERTa), but other scores such as Accuracy or Precision are much lower. One possible reason is that CodeT5 is designed to serve source-related problems such as code generation or code detection, so the dictionary and tokenization algorithms are designed differently from the RoBERTa network. Therefore, tokenizing source code and classification training may not be as effective as the RoBERTa network. Figure 5 below shows the ROC curve of the CodeT5 combined with the RF model at the parameters for the best results.

Fig. 5
figure 5

ROC graph of CodeT5 and RF

Comparing the ROC curve in Fig. 5 with the ROC curve in Fig. 4, the scenario of using CodeT5 to replace RoBERTa in source code tokenization did not perform well. Specifically, with an AUC of 50%, the model will work poorly in classifying normal source code from source code containing security vulnerabilities. Figure 6 below presents the test results of the model combining CodeT5 and RF through the values of the confusion matrix.

Fig. 6
figure 6

Confusion matrix of CodeT5 and RF model

Based on the results of the Confusion Matrix in Fig. 6, it can be seen that the CodeT5 model used to replace RoBERTa 512 in scenario 2 did not perform well compared to scenario 1. The results show that the model only correctly predicted 1847 source code files containing security vulnerabilities out of a total of 2532 source code files, achieving an accuracy of about 73% (lower by 4% compared to scenario 1). In addition, for normal code segments misclassified as containing vulnerabilities, the model achieved a very high number of 2252 out of a total of 2990, which is much higher than 1916 in scenario 1.

4.4.3 Experimental results of scenario 3

In this scenario, the paper will replace some ML algorithms with other DL algorithms such as CNN, and MLP. Table 5 below shows the experimental results of this scenario.

Table 5 Experimental results of source code vulnerability detection using some DL models

The results of scenario 3 shown in Table 5 indicate that when replacing the ML algorithm with DL algorithms, the accurate prediction results for normal source code and source code containing security vulnerabilities have changed significantly and also changed according to the token length of the RoBERTa model. Additionally, in each parameter of the RoBERTa model, MLP is more effective than CNN on all metrics. Comparing the results of Table 5 with Table 3, it can be seen that even though RoBERTa is still used to tokenize the source code and then DL algorithms are used for classification, the results are not as high as using ML algorithms. Specifically, the recall index only reaches 55.5%, much lower than in scenario 1. However, in some metrics such as accuracy and precision, the results are slightly higher than in scenario 1. Overall, the numbers are balanced with each other, which is because, in the process of training the model, the parameters are adjusted most optimally so that the predicted results are appropriate to the labels, and not focused on predicting a specific class. Furthermore, the experimental dataset is also quite balanced, with no significant difference between the two labels, so the model's predicted results fluctuate within the range of 50–55%. This result is not good because almost half of the source code containing vulnerabilities will be overlooked while the correct prediction rate for that source code is only over 50%. This result is also clearly reflected in the ROC curve in Fig. 7, where the AUC is 50%, similar to scenario 2.

Fig. 7
figure 7

ROC graph of RoBERTa _512 and CNN model

For the Confusion Matrix, it is not surprising that TPR, FPR, TNR, and FNR are all at similar levels. The model only correctly predicted 1308 out of 2532 source codes containing security vulnerabilities, which is lower than both previous scenarios. However, the ratio of normal source codes wrongly predicted to contain security vulnerabilities is only 1601 out of 2532 source codes, the lowest among the 3 scenarios. Comparing the results in Fig. 8 with Figs. 4 and 6, it is clear that the combination model of RoBERTa with a DL network did not yield effective results.

Fig. 8
figure 8

Confusion Matrix of RoBERTa _512 and CNN model

4.4.4 The experimental results of scenario 4

In this scenario, we compared our proposed approach in this study with some other approaches on the same FFmpeg + Qume dataset. The experimental results of this scenario are presented in Table 6.

Table 6 Experimental results of some other approaches of the models on the same dataset

As shown in Table 6, the proposed method in the paper demonstrates superior effectiveness compared to other approaches in accurately classifying source code vulnerabilities. Specifically, with a recall measure of 77.4%, our study outperforms Devign (Zhou et al. 2019), Russell (Russell, et al. 2018a, 2018b), SySeVR (Li et al. 2018b), and VulDeePecker (Li et al. 2018a), whose recall measures range from 13 to 50%. Regarding the F1_score measure, experimental results show that the proposed model is also more effective by 5% to 26%. For the Precision measure, the results are relatively consistent. These results once again affirm the proposal to improve the effectiveness of source code vulnerability detection based on RoBERTa and ML algorithms is correct and reasonable.

4.5 Discussion

Based on the experimental results from the four scenarios above, it can be seen that the proposed approach in this paper has achieved good effectiveness, meeting the objectives stated in the paper. Such results were obtained through the flexible integration of various components in the proposed model.

4.5.1 The suitability of Roberta for source code feature extraction task

To achieve the aforementioned good results, we believe there are two reasons as follows: Firstly, the outstanding advantages of Roberta compared to other Embedding models. Based on mathematical analyses, it can be observed that Roberta possesses advantages that other NLP techniques do not include:

  • Training process: RoBERTa utilizes a large amount of text data and a longer training time. This allows the model to learn more sophisticated language representations, resulting in improved performance on various NLP tasks. Using these pre-trained models to train on the proposed dataset in the paper optimizes performance and enhances the ability to detect vulnerabilities in source code.

  • Masked language model (MLM): During the training process, certain code tokens are masked, and the model needs to predict them based on the surrounding context. This helps the model learn how code tokens interact with each other, as well as the structure and meaning of the code segment. MLM enables RoBERTa to comprehensively represent code tokens. By predicting masked tokens, the model must understand the meaning of the token in context and its correlation with other code tokens. This helps the model learn accurate representations of code tokens and utilize them for code predictions and classifications.

  • Dynamic masking: Dynamic masking in RoBERTa has several important benefits. First, it diversifies the training process by varying the masked patterns, preventing the model from relying solely on the specific position information. This helps the model learn more generalized representations. Second, dynamic masking enhances generalization by challenging the model to understand and predict masked code tokens based on the surrounding context. It also helps avoid overfitting by not allowing the model to memorize specific position information or static patterns. Finally, dynamic masking improves contextual understanding by requiring the model to rely on the surrounding context to predict masked code tokens, enabling a better understanding of the correlations between code tokens in the code segment.

  • Fine-tuning: Fine-tuning in RoBERTa allows the model to adapt to specific tasks through training on task-specific data, such as the task of detecting security vulnerabilities in the code segment presented in this paper. This process enhances the performance of RoBERTa on those tasks by fine-tuning its understanding of requirements and improving its ability to process specific language. Additionally, fine-tuning leverages the transfer of previously learned knowledge, utilizing the existing language understanding of RoBERTa to quickly adapt and achieve good performance on new tasks with fewer data. With the aforementioned advantages, we believe that utilizing the pre-trained capabilities of RoBERTa will improve the performance of the source code vulnerability detection task. The experimental results in the paper have demonstrated this assertion.

The second aspect is the suitability of the dataset's features. The table below, Table 7, describes the features of the experimental dataset.

Table 7 Statistics on the length of source code

It can be seen that with such a length of source code snippets, RoBERTa will bring more effectiveness in source code feature extraction.

4.5.2 The suitability of machine learning for source code vulnerability detection tasks

Experimental results have shown that machine learning algorithms have yielded very good results in accurately detecting source code vulnerabilities. This demonstrates that choosing machine learning algorithms is correct and reasonable. There are several reasons why machine learning is more effective than deep learning methods, as follows:

Data suitability for classification algorithms: Deep learning models perform well when provided with a large amount of labeled data because they can automatically learn complex patterns and representations from the data. However, in the scope of this paper, when the dataset used only includes 25,495 samples, deep learning models may struggle with generalization due to the risk of overfitting. On the other hand, machine learning algorithms still operate effectively by leveraging feature engineering techniques and statistics even with limited data. Furthermore, deep learning models, especially deep neural networks with multiple layers, are essentially black boxes. They learn complex hierarchical representations to adjust model parameters accordingly, which makes it difficult to explain how specific input features contribute to the results. Machine learning models, such as decision trees or linear regression, provide clear rules or coefficients as they rely on mathematics and statistics.

Relevance of input data: Accordingly, deep learning models have the advantage of automatically learning features from raw data, minimizing the need for manual feature engineering. They can extract useful representations from unprocessed input. However, in this paper, the input for the classification models consists of features extracted using the RoBERTa model. Therefore, deep learning models unintentionally learn and extract additional features from the input, leading to missed information from RoBERTa. On the other hand, machine learning models have a significant advantage if the input data has already been extracted or undergone complex data processing to transform the data. In this case, machine learning algorithms simply apply rules, mathematical formulas, and statistics to perform computations and produce classification results. This is what has resulted in deep learning models being less effective than machine learning models.

4.5.3 Threats to validity

From the experimental results, it can be seen that the proposed model has indeed provided good effectiveness for accurately detecting source code vulnerabilities. The recall and F1_Score metrics have outperformed other approaches. However, the accuracy and precision metrics are lower compared to the RoBERTa model combined with Deep learning. This indicates that although the proposed model has effectively addressed the task of accurately detecting source code vulnerabilities, it still confuses normal source code detection. There are three reasons for these results: firstly, the lack of data. It can be observed that the experimental data in the paper is limited, which makes the training process of RoBERTa more challenging. Secondly, the machine learning algorithms did not provide mechanisms to support RoBERTa in representing or highlighting important and meaningful information in the feature vector of the source code. In our research, we noticed that normal source code segments in this experimental dataset often have the characteristic of asymmetric length. Through observation and analysis, we found that the average length of normal code segments is 482, with the longest code segment being 90,858 and the shortest being 13. The number of code segments with a length greater than 512 is 2975 (21.48%) out of a total of 13,852 normal code segments. From this, it can be observed that although long code segments (greater than 512) account for less than a quarter, some code segments in this group have abnormal lengths compared to the rest of the code segments. This leads to the model having to reduce the dimensionality of the data when extracting features, resulting in the loss of information for these long code segments, as the new data may not contain important information for classification. Therefore, deep learning models are more effective in this case.

Similarly, for source code without vulnerabilities, there are cases where vulnerabilities are located in positions that are difficult to identify even for humans, or there are very long code segments but only a short code segment contains the vulnerability and is located in a difficult-to-detect position. This can confuse the model during classification. For example, a source code segment contains 72,257 tokens, but the maximum token length allowed by our proposed model is 512, so the model will have to truncate the tokens at the end. The important point is that vulnerabilities may be present in the truncated parts, which the model will not have information about, leading to incorrect predictions.

In the future, to address this issue, we will apply some new methods and techniques to optimize the feature extraction and classification process. Specifically, for feature extraction techniques, because the characteristics of source code segments have different lengths, we can conduct processing on each function of the source code. After that, we perform calculations and represent the correlation between functions through algorithms such as Attentions, Infernce… For the vulnerability classification process, we can use some additional data balancing techniques combined with some algorithms such as representation learning and contrastive learning.

4.5.4 The scientific significance of the research results

This article provides two scientific significances in the problem of detecting source code vulnerabilities as follows:

Firstly, it offers a completely new solution for the task of selecting and extracting source code features. As mentioned above, traditional approaches often involve two directions for extracting source code, including using NLP methods such as Word2Vec, and Doc2Vec… or representing source code as graphs such as AST, CPG, and CFG… In this article, we also approach using NL to represent source code, but not through traditional methods; instead, we use RoBERTa. Based on the explanations in Sect. 4.5.1, it can be seen that this is a reasonable and scientific proposal. Therefore, the scientific contribution of the article lies in providing solutions for extracting source code features. Specifically, the article presents several new choices and solutions to replace existing ones, including BERT, RoBERTa, and Code T5. These solutions have been widely applied in other fields, but they are still novel in the context of source code extraction. Therefore, based on the characteristics of the source code, these solutions can be selected to fit practical needs. Accordingly, for source code segments shorter than 512, it is recommended to choose Code T5 or BERT. For longer source codes, especially those exceeding 512 tokens, RoBERTa should be used.

Secondly, it provides a scientific foundation for source code classification solutions. Based on experimental results, the article offers two choices for the task of source code classification: deep learning models and supervised machine learning algorithms. Specifically, the advantages and disadvantages are analyzed and presented in Sect. 4.5.1, which provides a scientific basis to support the process of algorithm selection for classification. The article addresses when to choose machine learning algorithms and when to choose deep learning models. Two aspects should be considered when selecting a classification solution: the characteristics of the experimental dataset and the research objective. Furthermore, based on the analysis in Sect. 4.5.3, the article introduces several new models and solutions that can be used to optimize and improve the accuracy of source code classification.

Thirdly, it is about the practical significance of the research. The research results in this article, in addition to the scientific significance listed above, also have great practical significance for the source code vulnerability research community. Specifically as follows:

  • For researchers, this study has provided a new approach and solution to select and extract source code features, especially source code segments with large lengths. Based on our proposal, thay have an additional proven scientific basis for using Roberta for feature extraction tasks for source code, malicious code, mobile applications (malicious APKs), email phishing, spear phishing, text summaries…. Previously, to extract the characteristics and behavior of these objects, researches will often focus on extracting functions (of malicious code and malicious APKs) or text segments (of phishing, spear phishing emails and text summaries) and using some traditional text embedding techniques such as: Word2vec, Doc2vec, Bert, Code T5… These traditional embedding techniques are only capable of working with small pieces of text. For documents with large lengths, using the above traditional embedding techniques will not be able to synthesize all the information, thus losing important attributes and meaning, thereby affecting the results of classification.

  • For developers, based on the results of this research, they can use the Roberta model that has been trained for C and C +  + languages to continue developing models for other languages such as Java, Python…. In addition, developers can also use the Roberta model that has been trained to build functions, standards, safety libraries, extensions, and then integrate them into programming platform such as Visual Studio, Dev-C… With this approach, programming support tools will not only detect faulty source code segments but can also detect source code segments that violate security policies.

  • For tool builders, the article has provided an additional tool, a technique to develop not only tasks related to detecting source code vulnerabilities but also other text processing tasks. In practice, to quickly and accurately detect source code vulnerabilities, tools often use detection techniques based on vulnerabilities known in advance (CVE, CWE available). With this approach, tools cannot detect new vulnerabilities that have not been defined or known in advance. However, tool builders can use the Roberta model trained for C and C +  + languages to add to their tools as a legacy step. Accordingly, the vulnerability detection tool will operate through two successive layers. The first layer will perform vulnerability scanning through CVE and CWE. As a result, known vulnerabilities are detected quickly and accurately. After that, the suspected source code will be transferred to the second layer to detect new vulnerabilities based on our trained Roberta model. By using these two legacy layers, tools not only quickly and accurately detect known vulnerabilities but also detect some new vulnerabilities through the model we propose. In addition, tool builders can use this article's approach to develop text processing tools for other tasks such as: detecting phishing emails, spear phishing emails, text summarization, chat-bots, detecting malicious code, classifying malicious APK.

  • For educators, in order to minimize the risks of information insecurity and source code vulnerabilities, the issue of training and raising human awareness is very important. Therefore, the results of this research can be used for educators in training and raising awareness about minimizing source code vulnerabilities in human resource training. Specifically, educators collaborate with developers and tool builders to create safe functions, libraries, and extensions for programming support tools based on the Roberta model that has been trained for C and C +  + languages, such as Visual Studio or Dev-C…. When users have finished writing and want to run and check for bugs, these libraries or extensions can run in parallel with other libraries of the programming environment, thus supporting programmers in checking for errors and source code vulnerabilities. With this approach, users such as students, developers, and tool builders can quickly discover source code vulnerabilities like detecting faulty source code segments, thereby minimizing errors, malicious software…

5 Conclusion and future development directions

Detecting source code vulnerabilities always poses many challenges for analyzing and evaluating systems. In this paper, to improve the effectiveness of the source code vulnerability detection process, we successfully proposed a model combining RoBERTa and ML algorithms. Based on experimental scenarios, we have demonstrated the correctness and scientific significance of the paper. Specifically, scenarios 2 and 3 have shown the superior effectiveness of the proposal compared to other combined models. This scenario has shown that only when combining RoBERTa with ML algorithms can the most effective results be achieved. In addition, in scenario 4, when comparing the proposed model with other research directions and papers, it is clear that the proposed model also provides significantly better results, especially in the Recall metric. With these results, we believe that the RoBERTa model has performed very well in synthesizing and extracting features of source code. Because only then can ML algorithms have a basis for accurately classifying normal code and code with security vulnerabilities. From these experimental results, we have opened up a completely new approach to the task of detecting source code vulnerabilities based on NLP techniques and ML algorithms. Previously, NL algorithms were only applied to spoken and written languages and had not yet been applied to machine language. In the future, to improve this research, we believe that three main issues need to be addressed: i) apply clustering and classification methods to standardize source code before inputting it into NLP models; ii) improve NL models to be more suitable for programming languages instead of only using them as they are currently; iii) use some graph representation techniques for source code to establish correlations between them to find and synthesize more meaningful information and features.