See also NLP Progress - Coreference Resolution.
Many papers use $B^3$ (B-cubed), MUC, CEAF, or an average of these three as the metric. The CoNLL-2011/2012 shared tasks (Pradhan et al 2011, Pradhan et al 2012) used an average. See also these papers: