Bidirectional Long Short-Term Memory and Word Embedding Feature for Improvement Classification of Cancer Clinical Trial Document

In recent years, the application of deep learning methods has become increasingly popular, especially for big data, because big data has a very large data size and needs to be predicted accurately. One of the big data is the document text data of cancer clinical trials. Clinical trials are studies of human participation in helping people's safety and health. The aim of this paper is to classify cancer clinical texts from a public data set. The proposed algorithms are Bidirectional Long Short Term Memory (BiLSTM) and Word Embedding Features (WE). This study has contributed to a new classification model for documenting clinical trials and increasing the classification performance evaluation. In this study, two experiments work are conducted, namely experimental work BiLSTM without WE, and experimental work BiLSTM using WE. The experimental results for BiLSTM without WE were accuracy = 86.2; precision = 85.5; recall = 87.3; and F-1 score = 86.4. meanwhile the experiment results for BiLSTM using WE stated that the evaluation score showed outstanding performance in text classification, especially in clinical trial texts with accuracy = 92,3; precision = 92.2; recall = 92.9; and F-1 score = 92.5.


Introduction
Clinical trials (CT) are very important activities to determine the safety and effectiveness of medical treatment. These trials form the basis for conducting clinical practice guidelines [1] and greatly assist medical teams in conducting health practice and when making treatment-related decisions. Clinical documents containing clinical statements require a review of the feasibility of clinical trial results (feasible and unfit), the eligibility criteria used in clinical trials are very limited and require a computerized process or text classification [2] [3].
Text classification is considered a fundamental problem in Information Sciences. Text classification, known as an effective method for text information organization and management, is widely employed in the fields of information sorting [4] sentiment analysis [5] [6], spam filtering [7] [8], clinical text [9] etc. The method of deep learning is deemed as an effective method for classification. [10] Moreover, an increasing number of scholars have applied commonly used neural networks, including the conditional random filed [11] and the recurrent neural network (RNN), to text classification [12], Recently, neural networks have had tremendous success for text classification, and are showing more significant advances than other models.One of the challenges of building a text classification model is how to capture features for different units of text, such as phrases, sentences and documents. Making use of its iterative structure, Recurrent Neural Network (RNN), as a type of alternative neural network, is particularly suitable for processing text of variable length. [13] [14] Because RNN is equipped with recurrent network structure which can be used to maintain information, it can better integrate information in certain contexts [15]. For the purpose of avoiding the problem of gradient exploding or vanishing in a standard RNN, long shortterm memory (LSTM) [16] [17] have been designed for the improvement of remembering and memory accesses. Living up to expectations, LSTM does show a remarkable ability in the processing of natural language. In improving performance and computing, Zuheros [18] in a study explained the challenges of one of the methods of deep learning. One of the challenges of deep learning problems as revealed in the research of Vincent Menger [19] which states that in some cases the approach with deep learning techniques applied to clinical text classification can produce conclusions that are in line with expectations, but will be different if tested on datasets. different clinical settings and with different domains and different sizes. They propose a neural network architecture to improve the system performance and computing efficiency. While Zhipeng Jiang [20] utilized LSTM and the result produced good accuracy, however the result indicated with low computing performance. The LSTM development is BiLSTM. Beakcheol Jang [21] has also used BiLSTM to improve accuracy in text classification. In another study, there is also research that discusses the features of word embedding [22] [23]. The word embedding feature is also able to improve classification text. For instance including Zeynep H. Kilimci used the word embedding with Random Forest and CNN [24]. Lihao Ge uses word embedding with naive bayes and Support vector machine [25] and Duyu Tang uses word embedding with Support vector machine and N-gram [26], Jasmir etc using Bidirectional Long Short Term Memory Recurrent Neural Network to improve eligibility classification clinical trials document [27] There are several studies related to BiLSTM for instance Guixian Xu [28] discussed the use of BiLSTM in sentiment analysis, meanwhile Annisa Darmawahyuni [28] conducted Myocardial Infarction Classification with BiLSTM, this paper is a preliminary study so that contains only brief analysis and plan. However, it can present other point-of-view to process cardiac rhythm that associated in timesteps based on deep learning approach. Beakcheol Jang [21] used BiLSTM to improve text classification.

Research Methods
Dataset containing clinical texts that have been labeled was intervened by inserting words or data containing several other types of cancer for each criterion. The insertion of words or data for other types of cancer in this case is called the word embedding process. The result of the Word Embedding process is integrated into the IDF TF algorithm and produces a weighted word vector. The weighted word vector was entered into BiLSTM and then generated a BiLSTM model from this clinic dataset. The next stage is to evaluate the classification performance with a configuration matrix. In this paper the authors compare the evaluation results with BiLSTM without Word Embedding and BiLSTM with Word Embedding.

A. Construction of Weighted Word Vector
In this paper, the Word2vec model is used to achieve a distributed word representation. The Word2vec model is part of the CBOW model and the Skip-gram model. The advantage of this word2vec model is the ability to predict words based on context distribution [29]. This model also contains an input layer, a projection layer and an output layer. For the word wk, the context is stated in formula (1) In contrast, the Skip-gram model predicts the context based on the word target wk.
TF-IDF is the most commonly used weight calculation method in text categorization. This method considers the frequency of words and distribution of words in the document, so that the features in the classification are reflected in this method. The TF-IDF formula can be seen in the formula (2,3) is the frequency of the word ti in the document d, N is the number of documents and ti is the number of documents where the word ti appears The weight calculation method for word vectors is as follows:

B. Word Embedding
Word embedding is a technique for converting a word into a vector or array consisting of a set of numbers.
With the word embedding technique, words can be converted into a vector containing numbers with a size that is small enough to contain more information. The information obtained will be sufficiently large that our vectors will be able to detect meaning. Every one word is charted to one vector. One vector is carried out learning which is similar to a neural network model, then combined in the field of deep learning [30].
In simple terms, word embedding is the process of converting a text into numbers, because most machine learning algorithms and deep learning architectures are unable to perform the analysis process on the input data in the form of strings or text, so they require numbers as input. A simple example of converting a word into a number vector. For example, given the following sentence: "Word Embedding are Word Converted into numbers." A dictionary will contain a list of all unique words. So that the dictionary that is formed is: ["Word", "Embeddings", "are", "Converted", "into", "numbers"].
Using the one-hot encoding method will generate a vector where 1 represents the position of the word, and 0 for other words. The vector representation of the word "numbers" refers to the dictionary format above is [0,0,0,0,0,1] and the word "Converted" is [0,0,0,1,0,0]. Above is an example of a form of representation of text into numbers.

C. BiLSTM Layer
Long Short-term Memory (LSTM) [31] evolved from the Recurrent Neural Network (RNN). The main idea used is to add a "gateway" to the Recurring Neural Network for the purpose of controlling the passing data. Generally, LSTM architecture consists of memory cells, input gates, output gates, and forget gates. The LSTM is presented in the form of a chain constructed with repeated modules of neural networks. With the information stored inside, memory cells travel along the chain. In addition, the other three gates are mainly designed to control whether to add or block information to the memory cell.
The LSTM transition functions are defined as follows: = − 1 + ℎ( + ℎ ℎ − 1 + ) = ( + ℎ ℎ − 1 + + ) (9) ℎ = ℎ( ) (10) σ refers to the logistic sigmoid function that has an output in [0, 1], tanh indicates the hyperbolic tangent function that has an output in [-1, 1], and • denotes the element wise multiplication. At the current time t, ht refers to the hidden state, ft represents the forget gate, it indicates the input gate, and ot denotes the output gate. Wt, Wo, and Wf represent the weight of these three gates, respectively, while bt, bo, bf refers to the biases of the gates. As for BLSTM, it is regarded as an extension of the unidirectional LSTM, and it not only adds another hidden layer but also connects with the first hidden layer in the opposite temporal order. Because of its structure, BLSTM can process the information from both the past and the future.
Therefore, BLSTM is adopted to capture the information of the text input in this paper. In general, the BiLSTM architecture can be seen in the figure 2.

Accuracy Precision Recall and F-1 Score
Measures of classification performance can be defined based on the confusion matrix [32] as seen in Table1.
The confusion matrix provides information on the comparison of the classification results carried out by the system (model) with the actual classification results. The confusion matrix is in the form of a matrix table that describes the performance of the classification model on a series of test data whose true values are known.  Precision is a representation of uniformity and repetition of measurements. Precision is the degree of excellence, on the performance of an operation or technique used to get results.
Recall is a measure of the success of a system in finding and retrieving information. Furthermore, F-Measure is a process of calculating evaluation by combining precision and recall calculations. Recall and Precision in a situation can have different weights. The measure that displays the reciprocity between Recall and Precision is F-Measure which is the average harmonic weight and realization and precision.
F-Measure or F1-score is one of the evaluation calculations in information retrieval that combines Recall and Precision. The Recall value and Precision in a situation can have different weights. The size that displays reciprocity between Recall and Precision is F-Measure which is the mean harmonic weight and Recall and Precision.

Model Evaluation and Validation
This study uses public data from the National Library of Medicine about cancer in free-text language. Data taken from Clinical Trial.gov with a total of 500,000 records.
In this paper, two experiments are carried out for text classification. The first experiment used the BiLSTM model without the Word Embedding (WE). The second experiment was to combine BiLSTM using WE. The propose Bidirectional LSTM (BiLSTM) with Word Embedding (WE). Table-3    To clarify the difference in evaluation results between the use of BiLSTM without WE and the use of BiLSTM using WE in the clinical trial text dataset presented in Table 4 and Figure 5.

Discussion
The aim of this paper is to address the lack of standard word embedding in NLP (compared to vision) by introducing a series of simple operations that may serve as the basis for future investigations. With the rate at which NLP research has progressed in recent years, we suspect that researchers will soon discover a higherperformance word embedding technique that will also be easy to use.
In particular, much of the recent work at NLP has focused on making neural models larger or more complex. Simple operation has been introduced in this study. With this study, it is hoped that it can inspire new approaches to universal or task-specific word embedding.

Conclusion
In this paper a BiLSTM using WE model is introduced to a clinical text case. The experiment compares the BiLSTM without WE and BiLSTM using WE, which finds that evaluation can be improved after using BiLSTM using WE. In addition, our model was evaluated by applying it to the classification of clinical trial texts. The models are applied to clinical trial texts and then the corresponding effects are compared with each other. It turns out that our model is more suitable for clinical text. Apart from that, through our evaluation it has also been shown that our BLSTM using WE model achieves excellent results and also outperforms various basic models.