Memaksimalkan Penggunaan WebView pada Android

Teman-teman pengembang aplikasi android pasti sudah tidak asing lagi dengan sebutan WebView. Buat yang belum familiar, WebView adalah view di Android untuk menampilkan sebuah tampilan HTML yang…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Abstractive Text Summarization

One of the challenges in natural language processing and understanding is text generation that could be applied for example in text summarization. In fact, it consists of understanding the information conveyed by a given text and reducing its size into a concise summary that contains the main important and relevant information. In this context, several studies are conducted to build extractive summaries based on frequentist approach. The main idea of those algorithms is to set a scoring system that allocates high values to the sentences assumed important and vice versa. However, the complexity lies in conceptualizing a model that goes trough the whole text, keep the important information in its memory and skip noisy ones like redundancies, details… In short, the model must create a context vector from a long text.

It is obvious that a model performs well on a concentrated text that contains just concise information with less noise. In this context, we can extract important parts from a given text using some extractive summarizers that will be the input of the summarization model. I chose to work with TFIDF vectoriser to reduce the size of the text and get the most important sentences based on the following scoring system. To test the model, we chose a text about “Artificial intelligence” from Wikipedia to see how the model will perform since it will be the input of the summarization model.

While reading the summary generated, we can notice that there is not a coherence between sentences knowing that they will be fed into the abstractive summarization model subsequently built.. Therefore, if we let ourselves be done, we introduce biased data into our model. This means that our model has to get the context vector from an incomprehensible and non-homogeneous text. Consequently, we will get biased summary.

As mentioned before, the dataset consists of Amazon customers reviews. It contains about 500000 reviews with their summaries which required a huge capacity of training. Limited by this constraint,we fix the number of rows to 100000 assuming that it will be trained fast. Ones we load the data, we feed it into a processing part which consists of several steps. First,we drop duplicated rows and missing values as well in order to ensure that all observations are distinct. Then,we define a processing function that includes the following subfunctions:

Below some samples of clients reviews and their summaries after the preprocessing. As you can notice, the summaries starts and ends with -Start- and -END- tokens respectively as a limitations of their beginings and their ends.

After cleaning the data, we proceed to the modeling part that consists of building an analytics model taking the tokenized data.

For text tokenization, we use Keras tokenizer via 3 functions:

It is word — index dictionary so every word gets a unique integer value. 0 is reserved for padding. Lower integer means more frequent word (often the first few are stop words because they appear a lot).

We take as an example: [“The earth is an awesome place live”,”Earth protection is our common goal”]

The position where padding or truncation happens is determined by the arguments padding and truncating, respectively.

We are going to implement seq2seq model on the reviews&summaries preprocessed before. Seq2seq is a deep learning model that converts an input sequence into an output sequence. Our input and output are denoted by X and Y respectively; X refers to the customer Review and Y refers to its summary.

The purpose of seq2seq is to understand and model the probability of generating the output Y when the input sequence X is given. Since the model generates word by word the conditional probability modeling could be seen as the probability of generating the jth word when a set of words belonging to the summary and an input were given;

The seq2seq consists of 2 processing steps:

a- Generating the fixed size vector Z from the given input sequence X: Z=f(X)

b- Generating the summary output from the fixed size vector Z.

K: the function that generates the hidden vector hj

g: the function to calculate the generative probability of the one-hot vector yj which justifies the recurrent aspect mentioned above.

2. Model Architecture:

The general architecture of seq2seq consists of 2 major parts: Encoder and Decoder. Each one of them is constructed by a combination of layers in a given order.

The encoder consists of two layers: the embedding layer and the recurrent layer(s), and the decoder consists of three layers: the embedding layer, the recurrent layer(s), and the output layer.

The encoder Recurrent layer generates the hidden state hi from the embedding vectors: F refers to an activation function often taken non-linear (sigmoid, tanh…)

The decoder recurrent layer converts each word in the output sequence into an embedding vector

The decoder recurrent layer generates the hidden vectors from the embedding vectors.

The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector.

After feeding the input sequences into the encoder part, the model has to recognize all the important parts of the text and skip all noisy information including redundancy.

Consequently, “Attention” is an interface between the encoder and decoder that provides the decoder with information from every encoder hidden state. With this setting, the model is able to selectively focus on useful parts of the input sequence and hence, learn the alignment between them. This helps the model to cope effectively with long input sentences.

RMSPROP optimizer:

In order to optimize the cost function of our seq2seq model, we opted for the RMSPROP optimizer for those reasons

This update of weight is done seperatly as follows:

Where:

𝜂: Initial learning rate

𝑣𝑡: Exponential average of squares of gradients

𝑔𝑡: Gradient at time t along 𝑤𝑗

We trained the seq2seq model on 100000 reviews and their summaries. In the encoder part, we used 3 stacked LSTM followed by the attention layer. We fit the model on the data preprocessed using the early stopping technique to avoid overfitting. After each epoch, if the model does not improve we stop turning the model.

The training phase consists of setting the weights of layers that allow us make an accurate predictions on an unseen reviews by inference. Below, the predicted summaries we generate on new customers reviews:

Even the original and predicted summaries did not meet, both of them conveyed the same meaning and they give us a general idea about clients judgement about the product. However, we need to quantify this closeness between the original summaries and the predicted ones.

In this part, we will evaluate the quality of our predictions using a semantic similarity approach which means we will quantify the meaning similarity between the predicted and the original summary using pretrained word2vec model.

Suppose we want to calculate the similarity between those two sentences:

S1 = ‘obama speaks media illinois’ and S2 = ‘president greets press chicago’

It is true that those sentences don’t share the same words but maybe they conveyed the same information. At this stage, each sentence is represented as a matrix but the question that arises how could you make a match between the similar words belonging to each sentence. For example, How could you associate Obama to president? (Whis is represented in the image below)

Consequently, we use Word Moving Distance optimization algorithm inspired from the usual optimization problem Earth Mover Distance. It allows transfer every word from sentence 1 to sentence 2 because algorithm does not know “obama” should transfer to “president”. At the end it will choose the minimum transportation cost to transport every word from sentence 1 to sentence 2.

We assume that we have a vocabulary of n words {1, . . . , n} and a collection of documents. We assume that the set of distinct words of D and D’ are, respectively, {𝑤1,…,𝑤|𝐷|} and {𝑤1,…,𝑤|𝐷′|}. Moreover, we use 𝐷𝑖 to denote the normalized frequency of 𝑤𝑖, that is, the number of occurrences of 𝑤𝑖 in D over the total number of words in D. We use 𝐷′𝑗 to refer to the normalized frequency of 𝑤′𝑗 analogously. Note that:

𝑥𝑖 denotes the embedding of a word i in a vector space of dimension d and c(i,j) denotes the Euclidean distance between the embedding of words i and j, that is,

Our vocabulary: V = {1:’John’, 2:’likes’, 3:’algorithms’, 4:’Mary’, 5:’too’, 6:’also’, 7:’data, 8:’structures’}

The sets of distinct words of D and D’ are:

Therefore, the normalized documents representations are:

Where: c(i,j) the cost of transporting word i to word j.

After loading the pretrained vectors and solve the optimization problem that finally gives the semantic similarity between each predicted and original summary and its reviews.

The variations of semantic similarity between the reviews and their original and predicted summaries are shown in blue and orange respectively. Therefore, we can notice the two quantities vary identically.

If we are interested in the semantic similarity between reviews and their summaries; original & predicted. We can see in the box plots below that those two quantities have almost the same distribution in term. In fact, 50% of the predicted summaries are 52% semantically similar to their reviews. On the other hand, 50% of original summaries are assumed to be 48% similar in terms of meaning to their own reviews.

The summaries generated by seq2seq model and all the techniques listed above contain the product the customer is reviewing and his judgement in a concise and short sentence. In fact, the model is able to go through the entire text, keeps important information notably the product and the judgement and skip all noisy details using recurrent neural networks such us LSTMs. However, sometimes the model generates biased summaries especially when customers start comparing the product with an other one that does the same function or a substitute. In that case, the model makes an error while detecting the main product the customer is reviewing and the judgement as well. Therefore, we can simply overcome this problem by training the model on more data and question the assumptions already made especially in the data preprocessing step.

Add a comment

Related posts:

How To Create Customer Avatar To Attract Your Ideal Customer.

A user persona is also known as a customer avatar. You can have a great marketing strategy, great sales funnel and a fantastic offer but if you’re not getting the right people into your funnel then…

How I Stopped Feeling Like a Victim

I used to feel this way with one of my clients until I changed my mind frame and the way I responded to him. He would come to my office and ask me for help. In the beginning, I was delighted to help…

Mi Experiencia en el Regional Scrum Gathering Tokyo 2020

Esa fue mi introducción a un grupo de japoneses que estaban charlando en un pasillo de Agile 2019, la conferencia de la Agile Alliance que el año pasado se realizó en Washington, DC. Como muchos…