#hittaus why not whip up your own evaluation, and run it against g-4o and c 3.5 s and c 3.0 o and l 3 8b and see what you like, and why
include creative writing, coding instructions, philosophical question, summarison, racially sensitive question, datind advice, 2nd hand sports car buying guide, photography questions about cropping, diablo 4 features vs sales projection and… well… I guess you see the plot here?
get going. 3 parts: manual questions, code sending it to LLM, and evaluation. For the first found of eval do it also manually. For sending use some suitable lib that includes OpenAI and Anthropic already.
Later on: change the evaluation to do something like you did personally but in automated fashion, perhaps as g to evaluate c and c to g but aim to have only one prompt for evaluation.
then level up: alter the evaluation by using DSpy or TextGrad to optimize the prompt to minimize error between your evaluations and auto-evaluator. <- this will be very fun