版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Search-Guided, Lightly-Supervised Training of Structured Prediction Energy Networks Amirmohammad Rooshenas, Dongxu Zhang, Gopal Sharma, and Andrew McCallum College of Information of Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 pedram,dongxuzhang,gopalsharma,mccallumcs.umas
2、 Abstract In structured output prediction tasks, labeling ground-truth training output is often expensive. However, for many tasks, even when the true output is unknown, we can evaluate predictions using a scalar reward function, which may be easily assembled from human knowledge or non-differe
3、ntiable pipelines. But searching through the entire output space to fi nd the best output with respect to this reward function is typically intractable. In this paper, we instead use effi cient truncated randomized search in this reward function to train structured prediction energy networks (SPENs)
4、, which provide effi cient test-time inference using gradient- based search on a smooth, learned representation of the score landscape, and have previously yielded state-of-the-art results in structured prediction. In particular, this truncated randomized search in the reward function yields previou
5、sly unknown local improvements, providing effective supervision to SPENs, avoiding their traditional need for labeled training data. 1Introduction Structured output prediction tasks are common in computer vision, natural language processing, robotics, and computational biology. The goal is to fi nd
6、a function from an input vectorxto multiple coordinated output variablesy. For example, such coordination can represent constrained structures, such as natural language parse trees, foreground-background pixel maps in images, or intertwined binary labels in multi-label classifi cation. Structured pr
7、ediction energy networks (SPENs) (Belanger Maes et al., 2009), or where it is the result of evaluating a non-differentiable pipeline over the predicted output (Sharma et al., 2018). In these settings, the reward function is often non-differentiable or has low-quality continuous relaxation (or surrog
8、ate) making end-to-end training inaccurate with respect to the task-loss. Interestingly, we can also rely on easily accessible human domain-knowledge to develop such reward functions, as one can easily express output constraints to evaluate structured outputs (e.g., predicted outputs get penalized i
9、f they violate the constraints). For example, in dependency parsing each sentence should have a verb, and thus parse outputs without a verb can be assigned a low score. More recently, Rooshenas et al. (2018) introduce a method to use such reward functions to supervise the training of SPENs by levera
10、ging rank-based training and SampleRank (Rohanimanesh et al., 2011). Rank-based training shapes the energy landscape such that the energy ranking of alternativey pairs are consistent with their score ranking from the reward function. The key question is how to sample the pairs ofys for ranking. We d
11、ont want to train on all pairs, because we will waste energy network representational capacity on ranking many unimportant pairs irrelevant to inference; (nor could we tractably train on all pairs if we wanted to). We do, however, want to train on pairs that are in regions of output space that are m
12、isleading for gradient-based inference when it traverses the energy landscape to return the target. Previous methods have sampled pairs guided by the thus-far-learned energy function, but the fl awed, preliminarily-trained energy function is a weak guide on its own. Moreover, reward functions often
13、include many wide plateaus containing most of the sample pairs, especially at early stages of training, thus not providing any supervision signal. In this paper we present a new method providing effi cient, light-supervision of SPENs with margin- based training. We describe a new method of obtaining
14、 training pairs using a combination of the models energy function and the reward function. In particular, at training time we run the test-time energy-gradient inference procedure to obtain the fi rst element of the pair; then we obtain the second element using randomized search driven by the reward
15、 function to fi nd a local true improvement over the fi rst. Using this search-guided approach we have successfully performed lightly-supervised training of SPENs with reward functions and improved accuracy over previous state-of-the-art baselines. 2Structured Prediction Energy Networks A SPEN param
16、etrizes the energy functionEw(y,x)using deep neural networks over inputxand output variablesy, wherewdenotes the parameters of deep neural networks. SPENs rely on parameter learning for fi nding the correlation among variables, which is signifi cantly more effi cient than learning the structure of f
17、actor graphs. One can still add task-specifi c bias to the learned structure by designing the general shape of the energy function. For example, Belanger and McCallum (2016) separate the energy function into global and local terms. The role of the local terms is to capture the dependency among input
18、xand each individual output variableyi, while the global term aims to capture long-range dependencies among output variables. Gygli et al. (2017) defi ne a convolutional neural network over joint input and output. Inference in SPENs is defi ned as fi ndingargminy2YEw(y,x)for given inputx. Structured
19、 outputs are represented using discrete variables, however, which makes inference an NP-hard combinatorial optimization problem. SPENs achieve effi cient approximate inference by relaxing each discrete variable to a probability simplex over the possible outcome of that variable. In this relaxation,
20、the vertices of a simplex represent the exact values. The simplex relaxation reduces the combinatorial optimization to a continuous constrained optimization that can be optimized numerically using either projected gradient-descent or exponentiated gradient-descent, both of which return a valid proba
21、bility distribution for each variable after every update iteration. Practically, we found that exponentiated gradient-descent, with updates of the formyt+1 i = 1 Zt i yt iexp(? E yi) (whereZt i is the partition function of the unnormalized distribution over the values of variableiat iterationt) impr
22、oves the performance of inference regarding convergence and fi nds better outputs. This is in agreement with similar results reported by Belanger et al. (2017) and Hoang et al. (2017). Exponentiated gradient descent is equivalent to defi ningyi= Softmax(Ii), 2 whereIiis the logits corresponding to v
23、ariableyi, and taking gradient descent inIi, but with gradients respect to yi(Kivinen (see Appendix B for a comparison on reward margin values). Of course, many randomized search procedures are possiblesimple and complex. In the experiments of this paper we fi nd that a simple randomized search work
24、s well: we start from the gradient-descent inference output, iteratively select a random output variable, uniformly sample a new state for the selected variable; if the reward increases more than the margin, return the new sample; if the reward increases less than the margin, similarly change an add
25、itional randomly selected variable; if the reward decreases, undo the change, and begin the sampling again. (If readily available, domain knowledge could be injected into the search to better explore the reward function; this is the target of future work.) We truncate the randomized search by boundi
26、ng the number of times that it can query the reward function to evaluate structured outputs for each inputxat every training step. As a result, the search procedure may not be able to fi nd a local improvement (this also may happen ifysis already near-optimal), in which case we simply ignore that tr
27、aining example in the current training iteration. Note that the next time that we visit an ignored example, the inference procedure may provide a better starting point or truncated randomized search may fi nd a local improvement. In practice we observe that, as training continues, the truncated rand
28、omized search fi nds local improvements for every training point (see Appendix C). 3 Figure 1: Search-guided training: the solid and dashed lines show a schematic landscape of energy and reward functions, respectively. The blue circles indexed byyirepresent the gradient-descent inference trajectory
29、with fi ve iterations over the energy function. Dashed arrows represent the mapping between the energy and reward functions, while the solid arrows show the direction of updates. Intuitively, we are sampling yfrom the energy functionE(y,x)by adding Gaussian noise (with the standard deviation of?) to
30、 the gradient descent on logits:It+1 i = It i ? E yi + N(0,?), which is similar to using Langevin dynamics for sampling from a Boltzmann distribution. Via the search procedure, we fi nd someS(x, y)that is a better solution than ywith respect to the reward function. Therefore, we have to train the SP
31、EN model such that, conditioning onx, gradient-descent inference returnsS(x, y), thus guiding the model toward predicting a better output at each step. Figure 1 depicts an example of such a scenario. For the gradient-descent inference to fi nd yn= S(x, y), the energy of(x, yn)must be lower than the
32、energy of (x, y) by margin M. We defi ne the margin using scaled difference of their rewards: M(x, y, yn) = (R(x, yn) ? R(x, y),(2) where 1 is a task-dependent scalar. Now, we defi ne at most one constraint for each training example x: w(x) = M(x, y, yn) ? Ew(x, y) + Ew(x, yn) 0(3) As a result, our
33、objective is to minimize the magnitude of violations regularized by L2norm: min w X x2D max(w(x),0) + c|w|2,(4) where c is the regularization hyper-parameter. Algorithm 1 shows the search-guided training. Algorithm 1 Search-guided training of SPENs D unlabeled mini-batch of training data R(.,.) rewa
34、rd function Ew(.,.) input SPEN repeat L 0 for each x in D do y sample from Ew(y,x). yn S(x, y) /search in reward function R starting from y w(x) M(x, y, yn) ? Ew(x, y) + Ew(x, yn) L L + max(w(x),0) end for L L + c|w|2 w w ? ?rwL/? is learning rate until convergence 4Related Work Peng et al. (2017) i
35、ntroduce maximum margin rewards networks (MMRNs) which also use the indirect supervision from reward functions for margin-based training. Our work has two main 4 advantages over MMRNs: fi rst, MMRNs use search-based inference, while SPENs provide effi cient gradient-descent inference. Search-based i
36、nference, such as beam-search, is more likely to fi nd poor local optima structured output rather than the most likely one, especially when output space is very large. Second, SG-SPENs gradually train the energy function for outputting better prediction by contrasting the predicted output with a loc
37、al improvement of the output found using search, while MMRNs use search-based inference twice: once for fi nding the global optimum, which may not be accessible, and next, for loss-augmented inference, so their method heavily depends on fi nding the best points using search, while SG-SPEN only requi
38、res search to fi nd more accessible local improvements. Learning to search (Chang et al., 2015) also explores learning from a reward function for structured prediction tasks where the output structure is a sequence. The training algorithm includes a roll-in and roll-out policy. It uses the so-far le
39、arned policy to fi ll in some steps, then randomly picks one action, and fi lls out the rest of the sequence with a roll-out policy that is a mix of a reference policy and the learned policy. Finally, it observes the reward of the whole sequence and constructs a cost-augmented tuple for the randomly
40、 selected action to train the policy network using a cost-sensitive classifi er. In the absence of ground-truth labels, the reference policy can be replaced by a sub-optimal policy or the learned policy. In the latter case, the training algorithm reduces to reinforcement learning. Although it is pos
41、sible to use search as the sub-optimal policy, we believe that in the absence of the ground-truth labels, our policy gradient baselines are a good representative of the algorithms in this category. For some tasks, it is possible to defi ne differentiable reward functions, so we can directly train th
42、e prediction model using end-to-end training. For example, Stewart and Ermon (2017) train a neural network using a reward function that guides the training based on physics of moving objects with a differentiable reward function. However, differentiable reward functions are rare, limiting their appl
43、icability in practice. Generalized expectation (GE) (Mann Bahdanau et al., 2017; Ranzato et al., 2016), in which they access ground-truth labels to compute the task loss, pretraining the policy network, or training the critic. These approaches benefi t from mixing strong supervision with the supervi
44、sion from the reward function (task-loss), while reward functions for training SG-SPENs do not assume the accessibility of ground-truth labels. Moreover, when the action space is very large and the reward function includes plateaus, training policy networks without pretraining with supervised data i
45、s very diffi cult. Daum et al. (2018) address the issue of sparse rewards by learning a decomposition of the reward signal, however, they still assume access to reference policy pre-trained on supervised data for the structured prediction problems. In Daum et al. (2018), the reward function is also
46、the task-loss. The SG-SPEN addresses these problems differently, fi rst it effectively trains SPENs that provide joint-inference, thus it does not require partial rewards. Second, the randomized search can easily avoid the plateaus in the reward function, which is essential for learning at the early
47、 stages. Our policy gradients baselines are a strong representative of the reinforcement learning algorithms for structured prediction problems without any assumption about the ground-truth labels. 5 5Experiments We have conducted training of SPENs in three settings with different reward functions:
48、1) Multi- label classifi cation with the reward function defi ned asF1score between predicted labels and target labels. 2) Citation-fi eld extraction with a human-written reward function. 3) Shape parsing with a task-specifi c reward function. Except for the oracle reward function that we used for m
49、ulti-label classifi cation, our other reward functions of citation-fi eld extraction and shape parsing do not have access to any labeled data. In none of our experiments the models have access to any labeled data (for comparison to fully-supervised models see Appendix A). 5.1 Multi-label Classifi ca
50、tion We fi rst evaluate the ability of search-guided training of SPENs, SG-SPEN, to learn from light super- vision provided by truncated randomized search. We consider the task of multi-label classifi cation on Bibtex dataset with 159 labels and 1839 input variables and Bookmarks dataset with 208 la
51、bels and 2150 input variables. We defi ne the reward function as theF1distance between the true label set and the predicted set at training time, and none of the methods have access to the true labels directly, which makes this scenario different from fully-supervised training. We also trained R-SPE
52、N (Rooshenas et al., 2018) and DVN (value-matching training of SPENs) (Gygli et al., 2017) with the same oracle reward function and energy function. In this case, DVN matches the energy value with the value of the reward function at different structured output points generated by the gradient-descen
53、t inference. Similar to SG-SPEN, R-SPEN and DVN do not have direct access to the ground-truth. In general, DVNs require access to ground-truth labels to generate adversarial examples that are located in a vicinity of ground-truth labels, and this restriction signifi cantly hurts the performance of D
54、VNs. In order to alleviate this problem, we also add Gaussian noise to gradient-descent inference in DVN, so it matches the energy of samples from the energy function with their reward values, giving it the means to better explore the energy function in the absence of ground-truth labels. See Append
55、ix D for more details on this experimental setup. Table 1.B shows the performance of SG-SPEN, R-SPEN, and DVN on this task. We observed that R-SPEN has diffi culty fi nding violations (optimization constraints) as training progresses. This is attributable to the fact that R-SPEN only explores the re
56、gions of the reward function based on the samples from the gradient-descent trajectory on the energy function, so if the gradient-descent inference is confi ned within local regions, R-SPEN cannot generate informative constraints. 5.2Citation Field Extraction Citation fi eld extraction is a structur
57、ed prediction task in which the structured output is a sequence of tags such as Author, Editor, Title, and Date that distinguishes the segments of a citation text. We used the Cora citation dataset (Seymore et al., 1999) including 100 labeled examples as the validation set and another 100 labeled ex
58、amples for the test set. We discard the labels of 300 examples in the training data and added another 700 unlabeled citation text acquired from the web to them. The citation text, including the validation set, test set, and unlabeled data, have the maximum length of 118 tokens, which can be labeled
59、with one of 13 possible tags. We fi xed the length of input data by padding all citation text to the maximum citation length in the dataset. We report token-level accuracy measured on non-pad tokens. Our knowledge-based reward function is equivalent to Rooshenas et al. (2018), which takes input citation text and predicated tags and evaluates the consistency of the prediction with about 50 given rules describing the human domain-knowledge about citation text. We compare SG-SPEN with R-SPEN (Rooshenas et al.,
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 长虹电器市场分析与销售策略探讨
- 汽车销售行业总经理招聘面试技巧
- 2025年AI设计 IP包装创意生成方案
- 阅读修德美演讲稿
- 2026年高一数学下学期个人教学工作计划
- 2026年工业云DFE可回收性设计:技术创新与产业实践
- 医院改进作风演讲稿范文
- 2026年安徽中考道德与法治总复习分类汇编:七年级下册
- 软件开发创业计划演讲稿
- 2026年大学生国防安全知识网络竞赛题库及答案(共80题)
- 2025年内科主治医师(呼吸内科学)考试题库(含答案)
- 2026江苏南京卧中资环新源城市更新(江苏)有限公司招聘电梯事业部市场开拓岗2人笔试备考试题及答案解析
- 小学语文第二学期教学目标与计划
- 统编版一年级下册道德与法治《第1课 有个新目标(第1课时)》教学课件
- 2026吉林农业大学三江实验室办公室招聘工作人员笔试参考题库及答案解析
- 九师联盟2025-2026学年高三核心模拟卷英语(中) (二)(含答案)
- 包装净菜车间卫生制度
- 海底捞卫生标准制度
- 广东省事业单位2026年集中公开招聘高校毕业生【11066人】笔试备考试题及答案解析
- 仲裁委员会财务制度
- 三级安全教育培训试题及答案(班组级)
评论
0/150
提交评论