Don’t aim high! Test waters as low as you can. Just a single query, like this current one, and check recs, and give scores for the top-5, weight 1/X and there’s your score. Then another. Average of those and your stabile.
Next stream fitting and tasty separately
Then in the third alter it to be rec changes invariant: query -> graded dish preferred rank pairs. Calculate actual distance. Thus changes of rec system development reflect.
To be honest, any eval is a great one. We just need to drag your ass away from this pit of no evals thus no seen visible progress on successfulness of Mealeon development and new ideas.