Differences

This shows you the differences between two versions of the page.

--- nlp:structured_prediction_energy_networks [2021/04/01 09:39] – jmflanig
+++ nlp:structured_prediction_energy_networks [2023/06/15 07:36] (current) – external edit 127.0.0.1
@@ Line 3: / Line 3: @@
 <blockquote>
-Structured prediction energy networks (SPENs) are trained to assign  global  energy  scores  to  output  structures,and the gradient descent is used during inference to minimize the global energy (Belanger and Mc-Callum, 2016).  As the gradient descent involves iterative optimization,  its steps can be viewed as iterative refinement.  In particular, Belanger et al.(2017) build a SPEN for SRL, but for the span-based formalism, not the dependency one we con-sider in this work.  While they improve over their baseline model, their baseline model used multi-layer perceptron to encode local factors, thus the encoder power is limited.  Moreover their refined model  performs  worse  in  the  out-of-domain  set-ting than their baseline model, indicating overfitting (Belanger et al., 2017).
+Structured prediction energy networks (SPENs) are trained to assign  global  energy  scores  to  output  structures,and the gradient descent is used during inference to minimize the global energy (Belanger and Mc-Callum, 2016).  As the gradient descent involves iterative optimization,  its steps can be viewed as iterative refinement.  In particular, Belanger et al. (2017) build a SPEN for SRL, but for the span-based formalism, not the dependency one we consider in this work.  While they improve over their baseline model, their baseline model used multi-layer perceptron to encode local factors, thus the encoder power is limited.  Moreover their refined model  performs  worse  in  the  out-of-domain  set-ting than their baseline model, indicating overfitting (Belanger et al., 2017).
-In the follow-up work,  Tu and Gimpel (2018,2019) introduce inference networks to replace gradient descent.   Their inference networks directly refine  the  output.    Improvements  over  competitive baselines are reported on part-of-speech tagging,  named  entity  recognition  and  CCG  super-tagging  (Tu  and  Gimpel,  2019).   However,  their inference networks are distilling knowledge from a  tractable  linear-chain  conditional  random  field(CRF)  model.   Thus,  these  methods  do  not  provide direct performance gains.  More importantly,the interactions captured in these models are likely local, as they learn to mimic Markov CRFs.
+In the follow-up work,  Tu and Gimpel (2018,2019) introduce inference networks to replace gradient descent.   Their inference networks directly refine  the  output.    Improvements  over  competitive baselines are reported on part-of-speech tagging,  named  entity  recognition  and  CCG  super-tagging  (Tu  and  Gimpel,  2019).   However,  their inference networks are distilling knowledge from a  tractable  linear-chain  conditional  random  field (CRF)  model.   Thus,  these  methods  do  not  provide direct performance gains.  More importantly,the interactions captured in these models are likely local, as they learn to mimic Markov CRFs.
 </blockquote>
 ===== Papers =====
+  * [[https://arxiv.org/pdf/1511.06350.pdf|Belanger & McCallum 2015 - Structured Prediction Energy Networks]]
+  * [[https://arxiv.org/pdf/1703.05667.pdf|Belanger et al 2017 - End-to-End Learning for Structured Prediction Energy Networks]]
+  * [[https://www.aclweb.org/anthology/N18-2021.pdf|Rooshenas et al 2018 - Training Structured Prediction Energy Networks with Indirect Supervision]]
+  * [[https://arxiv.org/pdf/1812.09603.pdf|Rooshenas et al 2018 - Search-Guided, Lightly-Supervised Training of Structured Prediction Energy Networks]]
+  * Kevin Gimple's papers
+    * [[https://arxiv.org/pdf/1803.03376.pdf|Tu & Gimple 2018 - Learning Approximate Inference Networks for Structured Prediction]]
+    * [[https://www.aclweb.org/anthology/N19-1335.pdf|Tu & Gimple 2019 - Benchmarking Approximate Inference Methods for Neural Structured Prediction]]
+    * [[https://www.aclweb.org/anthology/2020.spnlp-1.8.pdf|Tu et al 2020 - Improving Joint Training of Inference Networks and Structured Prediction Energy Networks]]
+    * [[https://www.aclweb.org/anthology/2020.acl-main.251.pdf|Tu et al 2020 - ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation]]
   * [[https://www.aclweb.org/anthology/W19-4109.pdf|Trinh et al 2019 - Energy-Based Modelling for Dialogue State Tracking]]
   * [[https://www.aclweb.org/anthology/2020.spnlp-1.5.pdf|Trinh et al 2020 - Energy-based Neural Modelling for Large-Scale Multiple Domain Dialogue State Tracking]]
+  * [[https://arxiv.org/pdf/2009.13267.pdf|Bhattacharyya et al 2020 - Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models]] Interesting paper, but has some flaws.  First, the energy-based models (EBMs) use BERT.  For a fair comparision of the merits of EBMs, they would compare the baseline to EBMs without BERT.  Second, they use reranking and do not attempt to use the ideas of SPENs to improve the decoding.
 ===== Related Pages =====