This is an old revision of the document!

Loss Functions

Cross-entropy (aka log loss, conditional log-likelihood, CRF loss)
- Lots of different ways to write this loss function. One way is minimize $L(\mathcal{D}) = -\sum_{i=1}^{N} log(p(y_i|x_i))$, where $p(y|x) = \frac{e^{score(x,y)}}{\sum_{y} e^{score(x,y)}}$, where $p(y|x) = \frac{e^{score(x,y)}}{\sum_{y} e^{score(x,y)}}$
- The cross-entropy version writes it as $L(\mathcal{D}) = -\sum_{i=1}^{N}\sum_{y} p(y|x_i) log(p_\theta(y|x_i))$, but usually we put in the empirical distribution $p(y|x_i) = I[y=y_i]$ which gives us the log-loss above.
- The minimum of cross-entropy loss does not always exist, and does not exist if the data training data can be completely separated. See for example, section 1.1 of this paper.
Perceptron loss
Hinge (SVM) loss
Softmax margin
- Gimple & Smith 2010 - Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions
- Large-Margin Softmax Loss for Convolutional Neural Networks L-Softmax. Doesn't cite Gimple & Smith. I suspect it may be different, but need to check.
Ramp loss
Soft ramp loss
Infinite ramp loss
Squared error loss
- Hui & Belkin 2020 - Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks
Squentropy (Cross-entropy + squared error)
- Hui et al 2023 - Cut your Losses with Squentropy

Related Pages