Mewsic Bench Leaderboard
I run model evaluations via API on request.
To request a model evaluation, click Request Evaluation tab and enter the model ID.
Rankings
About the Metrics
- Meter - How closely the model sticks to the meter of the lines.
- Verse - How closely the model aligns the lines to the verse and chorus breakup.
- Focus - How much of the response is extraneous commentary instead of the song. (Focus in particular has a very minor contribution to the final score)
- Thinking - The estimated average number of thinking tokens per response. Zero means it's not a reasoning model (or is a hybrid model with reasoning off).