




已阅读5页,还剩7页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Structured Prediction with Projection Oracles Mathieu Blondel NTT Communication Science Laboratories Kyoto, Japan Abstract We propose in this paper a general framework for deriving loss functions for structured prediction. In our framework, the user chooses a convex set including the output space and provides an oracle for projecting onto that set. Given that oracle, our framework automatically generates a corresponding convex and smooth loss function. As we show, adding a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output spaces convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use any convex superset instead, with potentially cheaper-to-compute projection. Since effi cient projection algorithms are available for numerous convex sets, this allows us to construct loss functions for a variety of tasks. On the theoretical side, when combined with calibrated decoding, we prove that our loss functions can be used as a consistent surrogate for a (potentially non-convex) target loss function of interest. We demonstrate our losses on label ranking, ordinal regression and multilabel classifi cation, confi rming the improved accuracy enabled by projections. 1Introduction The goal of supervised learning is to learn a mapping that links an input to an output, using examples of such pairs. This task is noticeably more diffi cult when the output objects have a structure, i.e., when they are not mere vectors. This is the so-called structured prediction setting 4 and has numerous applications in natural language processing, computer vision and computational biology. We focus in this paper on the surrogate loss framework, in which a convex loss is used as a proxy for a (potentially non-convex) target loss of interest. Existing convex losses for structured prediction come with different trade-offs. On one hand, the structured perceptron 16 and hinge 52 losses only require access to a maximum a-posteriori (MAP) oracle for fi nding the highest-scoring structure, while the conditional random fi eld (CRF) 29 loss requires access to a marginal inference oracle, for evaluating the expectation under a Gibbs distribution. Since marginal inference is generally considered harder than MAP inference, for instance containing #P-complete counting problems, this makes the CRF loss less widely applicable. On the other hand, unlike the structured perceptron and hinge losses, the CRF loss is smooth, which is crucial for fast convergence, and comes with a probabilistic model, which is important for dealing with uncertainty. Unfortunately, when combined with MAP decoding, these losses are typically inconsistent, meaning that their optimal estimator does not converge to the target loss functions optimal estimator. Recently, several works 15,26,39,31 showed good results and obtained consistency guarantees by combining a simple squared loss with calibrated decoding. Since these approaches only require a decoding oracle at test time and no oracle at train time, this questions whether structural information is even benefi cial during training. In this paper, we propose loss functions for structured prediction using a different kind of oracle: projections. Kullback-Leibler projections onto various polytopes have been used to derive online algorithms 24,56,49,1 but it is not obvious how to extract a loss from these works. In our 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. framework, the user chooses a convex set containing the output space and provides an oracle for projecting onto that set. Given that oracle, we automatically generate an associated loss function. As we show, incorporating a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output spaces convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use instead any convex superset, with potentially cheaper-to-compute projection. When using the marginal polytope as the convex set, our loss comes with an implicit probabilistic model. Our contributions are summarized as follows: Based upon Fenchel-Young losses 11,12, we introduce projection-based losses in a broad setting. We give numerous examples of useful convex polytopes and their associated projections. We study theconsistencyw.r.t. a target loss of interest when combined with calibrated decoding, extending a recent analysis 38 to the more general projection-based losses. We exhibit atrade-off between computational cost and statistical estimation. We demonstrate our losses on label ranking, ordinal regression and multilabel classifi cation, confi rming the improved accuracy enabled by projections. Notation.We denote the probability simplex by4p:= q Rp +: kqk1 = 1, the domain of: Rp R bydom() := u Rp: (u) 0 and i 2 1,.,l, 1else if i 2 1,.,u and i 0, 0else. The marginal polytope is a generalization of budget polytope 2 and is now equal toM = 2 0,1k : l h,1i u. The next proposition, proved in C.2, shows how to project effi ciently. Proposition 3 Euclidean and KL projections on the budget polytope Let be the projection of onto the unit cube (cf. “unit cube” paragraph). If l h,1i u, then is optimal. Otherwise, return the projection ofonto 2 0,1k: h,1i = m, wherem = uif h,1i u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(klogk) in the KL case (cf. C.2 for details). Birkhoff polytope.We view ranking as a structured prediction problem and letYbe the set of permutationsofk. Setting() 2 0,1kkas the permutation matrix associated with, MAP inference becomes the linear assignment problemMAP() = argmax2Y Pk i=1i,i and can be computed exactly using the Hungarian algorithm 24. The marginal polytopeMbecomes the Birkhoff polytope 7, the set of doubly stochastic matrices M = P 2 Rkk: P1k= 1,P1k= 1,0 P 1. Noticeably, marginal inference is known to be #P-complete 48,45, 3.5, since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm 43,15. The Euclidean projection can be computed using Dykstras algorithm 17 or dual approaches 10. For both kinds of projections, the cost of obtaining an-approximate solution isO(k2/). To obtain cheaper projections, one can also consider 10,33 the set of row-stochastic matrices, a strict superset of the Birkhoff polytope 4kk:= 4k 4k= P 2 Rkk: P1m= 1,0 P 1 ? M. 6 (a) Probability simplex(b) Unit cube(c) Budget polytope (d) Birkhoff polytope (3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1) (e) Permutahedron(f) Order simplex Figure 2: Examples of polytopes Budget polytope.We now setY = y 2 2k: l |y| u, the subsets ofkof bounded size, where we assume0 l u k. This is useful in a multilabel setting with known lower bound l 2 Nand upper boundu 2 Non the number of labels per sample. Setting again(y) = P|y| i=1eyi 2 0,1k, MAP inference is equivalent to the integer linear programargmax(y)20,1kh,(y)is.t. l h(y),1i u. Let be a permutation sorting in descending order. An optimal solution is (y)i= ( 1if l 0 and i 2 1,.,l, 1else if i 2 1,.,u and i 0, 0else. The marginal polytope is a generalization of budget polytope 2 and is now equal toM = 2 0,1k : l h,1i u. The next proposition, proved in C.2, shows how to project effi ciently. Proposition 3 Euclidean and KL projections on the budget polytope Let be the projection of onto the unit cube (cf. “unit cube” paragraph). If l h,1i u, then is optimal. Otherwise, return the projection ofonto 2 0,1k: h,1i = m, wherem = uif h,1i u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(klogk) in the KL case (cf. C.2 for details). Birkhoff polytope.We view ranking as a structured prediction problem and letYbe the set of permutationsofk. Setting() 2 0,1kkas the permutation matrix associated with, MAP inference becomes the linear assignment problemMAP() = argmax2Y Pk i=1i,i and can be computed exactly using the Hungarian algorithm 24. The marginal polytopeMbecomes the Birkhoff polytope 7, the set of doubly stochastic matrices M = P 2 Rkk: P1k= 1,P1k= 1,0 P 1. Noticeably, marginal inference is known to be #P-complete 48,45, 3.5, since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm 43,15. The Euclidean projection can be computed using Dykstras algorithm 17 or dual approaches 10. For both kinds of projections, the cost of obtaining an-approximate solution isO(k2/). To obtain cheaper projections, one can also consider 10,33 the set of row-stochastic matrices, a strict superset of the Birkhoff polytope 4kk:= 4k 4k= P 2 Rkk: P1m= 1,0 P 1 ? M. 6 (a) Probability simplex(b) Unit cube(c) Budget polytope (d) Birkhoff polytope (3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1) (e) Permutahedron(f) Order simplex Figure 2: Examples of polytopes Budget polytope.We now setY = y 2 2k: l |y| u, the subsets ofkof bounded size, where we assume0 l u k. This is useful in a multilabel setting with known lower bound l 2 Nand upper boundu 2 Non the number of labels per sample. Setting again(y) = P|y| i=1eyi 2 0,1k, MAP inference is equivalent to the integer linear programargmax(y)20,1kh,(y)is.t. l h(y),1i u. Let be a permutation sorting in descending order. An optimal solution is (y)i= ( 1if l 0 and i 2 1,.,l, 1else if i 2 1,.,u and i 0, 0else. The marginal polytope is a generalization of budget polytope 2 and is now equal toM = 2 0,1k : l h,1i u. The next proposition, proved in C.2, shows how to project effi ciently. Proposition 3 Euclidean and KL projections on the budget polytope Let be the projection of onto the unit cube (cf. “unit cube” paragraph). If l h,1i u, then is optimal. Otherwise, return the projection ofonto 2 0,1k: h,1i = m, wherem = uif h,1i u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(klogk) in the KL case (cf. C.2 for details). Birkhoff polytope.We view ranking as a structured prediction problem and letYbe the set of permutationsofk. Setting() 2 0,1kkas the permutation matrix associated with, MAP inference becomes the linear assignment problemMAP() = argmax2Y Pk i=1i,i and can be computed exactly using the Hungarian algorithm 24. The marginal polytopeMbecomes the Birkhoff polytope 7, the set of doubly stochastic matrices M = P 2 Rkk: P1k= 1,P1k= 1,0 P 1. Noticeably, marginal inference is known to be #P-complete 48,45, 3.5, since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm 43,15. The Euclidean projection can be computed using Dykstras algorithm 17 or dual approaches 10. For both kinds of projections, the cost of obtaining an-approximate solution isO(k2/). To obtain cheaper projections, one can also consider 10,33 the set of row-stochastic matrices, a strict superset of the Birkhoff polytope 4kk:= 4k 4k= P 2 Rkk: P1m= 1,0 P 1 ? M. 6 (a) Probability simplex(b) Unit cube(c) Budget polytope (d) Birkhoff polytope (3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1) (e) Permutahedron(f) Order simplex Figure 2: Examples of polytopes Budget polytope.We now setY = y 2 2k: l |y| u, the subsets ofkof bounded size, where we assume0 l u k. This is useful in a multilabel setting with known lower bound l 2 Nand upper boundu 2 Non the number of labels per sample. Setting again(y) = P|y| i=1eyi 2 0,1k, MAP inference is equivalent to the integer linear programargmax(y)20,1kh,(y)is.t. l h(y),1i u. Let be a permutation sorting in descending order. An optimal solution is (y)i= ( 1if l 0 and i 2 1,.,l, 1else if i 2 1,.,u and i 0, 0else. The marginal polytope is a generalization of budget polytope 2 and is now equal toM = 2 0,1k : l h,1i u. The next proposition, proved in C.2, shows how to project effi ciently. Proposition 3 Euclidean and KL projections on the budget polytope Let be the projection of onto the unit cube (cf. “unit cube” paragraph). If l h,1i u, then is optimal. Otherwise, return the projection ofonto 2 0,1k: h,1i = m, wherem = uif h,1i u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(klogk) in the KL case (cf. C.2 for details). Birkhoff polytope.We view ranking as a structured prediction problem and letYbe the set of permutationsofk. Setting() 2 0,1kkas the permutation matrix associated with, MAP inference becomes the linear assignment problemMAP() = argmax2Y Pk i=1i,i and can be computed exactly using the Hungarian algorithm 24. The marginal polytopeMbecomes the Birkhoff polytope 7, the set of doubly stochastic matrices M = P 2 Rkk: P1k= 1,P1k= 1,0 P 1. Noticeably, marginal inference is known to be #P-complete 48,45, 3.5, since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm 43,15. The Euclidean projection can be computed using Dykstras algorithm 17 or dual approaches 10. For both kinds of projections, the cost of obtaining an-approximate solution isO(k2/). To obtain cheaper projections, one can also consider 10,33 the set of row-stochastic matrices, a strict superset of the Birkhoff polytope 4kk:= 4k 4k= P 2 Rkk: P1m= 1,0 P 1 ? M. 6 (a) Probability simplex(b) Unit cube(c) Budget polytope (d) Birkhoff polytope (3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1) (e) Permutahedron(f) Order simplex Figure 2: Examples of polytopes Budget polytope.We now setY = y 2 2k: l |y| u, the subsets ofkof bounded size, where we assume0 l u k. This is useful in a multilabel setting with known lower bound l 2 Nand upper boundu 2 Non the number of labels per sample. Setting again(y) = P|y| i=1eyi 2 0,1k, MAP inference is equivalent to the integer linear programargmax(y)20,1kh,(y)is.t. l h(y),1i u. Let be a permutation sorting in descending order. An optimal solution is (y)i= ( 1if l 0 and i 2 1,.,l, 1else if i 2 1,.,u and i 0, 0else. The marginal polytope is a generalization of budget polytope 2 and is now equal toM = 2 0,1k : l h,1i u. The next proposition, proved in C.2, shows how to project effi ciently. Proposition 3 Euclidean and KL projections on the budget polytope Let be the projection of onto the unit cube (cf. “unit cube” paragraph). If l h,1i u, then is optimal. Otherwise, return the projection ofonto 2 0,1k: h,1i = m, wherem = uif h,1i u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(klogk) in the KL case (cf. C.2 for details). Birkhoff polytope.We view ranking as a structured prediction problem and letYbe the set of permutationsofk. Setting() 2 0,1kkas the permutation matrix associated with, MAP inference becomes the linear assignment problemMAP() = argmax2Y Pk i=1i,i and can be computed exactly using the Hungarian algorithm 24. The marginal polytopeMbecomes the Birkhoff polytope 7, the set of doubly stochastic matrices M = P 2 Rkk: P1k= 1,P1k= 1,0 P 1. Noticeably, marginal inference is known to be #P-complete 48,45, 3.5, since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm 43,15. The Euclidean projection can be computed using Dykstras algorithm 17 or dual approaches 10. For both kinds of projections, the cost of obtaining an-approximate solution isO(k2/). To obtain cheaper projections, one can also consider 10,33 the set of row-stochastic matrices, a strict superset of the Birkhoff polytope 4kk:= 4k 4k= P 2 Rkk: P1m= 1,0 P 1 ? M. 6 (a) Probability simplex(b) Unit cube(c) Budget polytope (d) Birkhoff polytope (3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1) (e) Permutahedron(f) Order simplex Figure 2: Examples of polytopes Budget polytope.We now setY = y 2 2k: l |y| u, the subsets ofkof bounded size, where we assume0 l u k. This is useful in a multilabel setting with known lower bound l 2 Nand upper boundu 2 Non the number of labels per sample. Setting again(y) = P|y| i=1eyi 2 0,1k, MAP inference is equivalent to the integer linear programargmax(y)20,1kh,(y)is.t. l h(y),1i u. Let be a permutation sorting in descending order. An optimal solution is (y)i= ( 1if l 0 and i 2 1,.,l, 1else if i 2 1,.,u and i 0, 0else. The marginal polytope is a generalization of budget polytope 2 and is now equal toM = 2 0,1k : l h,1i u. The next proposition, proved in C.2, shows how to project effi ciently. Proposition 3 Euclidean and KL projections on the budget polytope Let be the projection of onto the unit cube (cf. “unit cube” paragraph). If l h,1i u, then is optimal. Otherwise, return the projection ofonto 2 0,1k: h,1i = m, wherem = uif h,1i u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(klogk) in the KL case (cf. C.2 for details). Birkhoff polytope.We view ranking as a structured prediction prob
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 安全培训的目的及意义
- 家庭教育课件
- 家庭成员介绍课件
- 家庭急救心肺复苏课件
- 家庭安检员课件
- 安全培训的微电影课件
- 2025药品管理法考试试题及参考答案
- 2025年度团险渠道反洗钱及消费者权益保护培训考试题及答案
- 家安全用电培训课件
- 广西专业技术人员继续教育公需科目考试试题及答案
- ISO9001 质量管理体系全套(质量手册+程序文件+表格记录全套)
- 建筑施工安全分项检查评分表
- 室外栏杆底座施工方案
- 《人力资源管理》全套教学课件
- 《化学(医药卫生类)》高职全套教学课件
- 人教版六年级数学上册教案全册
- 新人教版一年级数学上册全册教学课件(2024年秋季新教材)
- 植保无人机打药合同
- 1.2《在庆祝中国共产党成立100周年大会上的讲话》(课件)-【中职专用】高一语文同步课堂(高教版2023基础模块下册)
- 医院信息化网络安全培训
- 老年高血压指南解读
评论
0/150
提交评论