Mewsic Bench Leaderboard

I run model evaluations via API on request.

To request a model evaluation, click Request Evaluation tab and enter the model ID.

Rankings

About the Metrics

  • Meter - How closely the model sticks to the meter of the lines.
  • Verse - How closely the model aligns the lines to the verse and chorus breakup.
  • Focus - How much of the response is extraneous commentary instead of the song. (Focus in particular has a very minor contribution to the final score)
  • Thinking - The estimated average number of thinking tokens per response. Zero means it's not a reasoning model (or is a hybrid model with reasoning off).