在该图中,{s0,s1,…,sN}表示我们试图预测的描述单词,{wes0,wes1,…,wesN-1}是每个单词的字嵌入向量。LSTM的输出{p1,p2,…,pN}是由该模型基于原有的单词序列为下一个单词生成的概率分布。该模型的训练目标是为了最大化每个单词对数概率的总和指标。 defbuild_model(self):# declaring the placeholders for our extracted image feature vectors, our caption, and our mask# (describes how long our caption is with an array of 0/1 values of length `maxlen` img = tf.placeholder(tf.float32, [self.batch_size, self.dim_in]) caption_placeholder = tf.placeholder(tf.int32, [self.batch_size, self.n_lstm_steps]) mask = tf.placeholder(tf.float32, [self.batch_size, self.n_lstm_steps]) # getting an initial LSTM embedding from our image_imbeddingimage_embedding = tf.matmul(img, self.img_embedding) + self.img_embedding_bias # setting initial state of our LSTMstate = self.lstm.zero_state(self.batch_size, dtype=tf.float32) total_ loss = 0.0withtf.variable_scope( "RNN"): fori inrange(self.n_lstm_steps): ifi > 0: #if this isn't the first iteration of our LSTM we need to get the word_embedding corresponding# to the (i-1)th word in our caption withtf.device( "/cpu:0"): current_embedding = tf.nn.embedding_lookup(self.word_embedding, caption_placeholder[:,i- 1]) + self.embedding_bias else: #if this is the first iteration of our LSTM we utilize the embedded image as our input current_embedding = image_embedding ifi > 0: # allows us to reuse the LSTM tensor variable on each iterationtf.get_variable_scope().reuse_variables() out, state = self.lstm(current_embedding, state) print(out,self.word_encoding,self.word_encoding_bias) ifi > 0: #get the one-hot representation of the next word in our caption labels = tf.expand_dims(caption_placeholder[:, i], 1) ix_range=tf.range( 0, self.batch_size, 1) ixs = tf.expand_dims(ix_range, 1) concat = tf.concat([ixs, labels], 1) onehot = tf.sparse_to_dense( concat, tf.stack([self.batch_size, self.n_words]), 1.0, 0.0) #perform a softmax classification to generate the next word in the captionlogit = tf.matmul(out, self.word_encoding) + self.word_encoding_bias xentropy = tf.nn.softmax_cross_entropy_with_logits(logits=logit, labels=onehot) xentropy = xentropy * mask[:,i] loss = tf.reduce_sum(xentropy) total_loss += loss total_loss = total_loss / tf.reduce_sum(mask[:, 1:]) returntotal_loss, img, caption_placeholder, mask 通过推断生成描述 训练后,我们得到一个模型,能够根据图像和标题的已有单词给出下一个单词出现的概率。 那么我们该如何用这个网络来产生新的字幕? 最简单的方法是根据输入图像并迭代输出下一个最可能的单词,来构建单个标题。 defbuild_generator(self, maxlen, batchsize=1):#same setup as `build_model` function img = tf.placeholder(tf.float32, [self.batch_size, self.dim_in]) image_embedding = tf.matmul(img, self.img_embedding) + self.img_embedding_bias state = self.lstm.zero_state(batchsize,dtype=tf.float32) #declare list to hold the words of our generated captionsall_words = [] print(state,image_embedding,img) withtf.variable_scope( "RNN"): # in the first iteration we have no previous word, so we directly pass in the image embedding# and set the `previous_word` to the embedding of the start token ([0]) for the future iterationsoutput, state = self.lstm(image_embedding, state) previous_word = tf.nn.embedding_lookup(self.word_embedding, [ 0]) + self.embedding_bias fori inrange(maxlen): tf.get_variable_scope().reuse_variables() out, state = self.lstm(previous_word, state) # get a one-hot word encoding from the output of the LSTMlogit = tf.matmul(out, self.word_encoding) + self.word_encoding_bias best_word = tf.argmax(logit, 1) withtf.device( "/cpu:0"): # get the embedding of the best_word to use as input to the next iteration of our LSTM previous_word = tf.nn.embedding_lookup(self.word_embedding, best_word) previous_word += self.embedding_bias all_words.append(best_word) returnimg, all_words 在许多情况下,这种方法是比较有效的。但是通过贪心算法来选取最可能的单词序列,我们可能不会得到一句连贯通顺的字幕序列。 (责任编辑:本港台直播) |