Moral of story. Keep prompts that fail and try them every 3-6 months on newer models. You will be surprised by the progress. While some of it could be "data leakage" or some training targetting your use case but if its obscure enough is unlikely.
So glad others are making the point about the rapid changes in the models. Many (if not most) people's impression of AI was fixed in the first 6 months of ChatGPT's release and few have kept up (understandably - the explosion of models and LLM's has been significant in just 30 months). This is a great example of the progression of improvement.
It seems like it keeps finding ways to scrape together new gains. The dream that data would solve everything turned out to be a dream, but CoT reasoning and grounding techniques produced different kinds of gain (and to my way of thinking a more exciting ones). I imagine some of the benefits of those are slowing, so I guess the question will be "What next?"
Moral of story. Keep prompts that fail and try them every 3-6 months on newer models. You will be surprised by the progress. While some of it could be "data leakage" or some training targetting your use case but if its obscure enough is unlikely.
Fascinating work as always! But Step Brothers is a modern classic. Give it a chance!
Ok, ok, people may have convinced me
So glad others are making the point about the rapid changes in the models. Many (if not most) people's impression of AI was fixed in the first 6 months of ChatGPT's release and few have kept up (understandably - the explosion of models and LLM's has been significant in just 30 months). This is a great example of the progression of improvement.
It really is striking the contrast. I wonder if we'll be able to say the same a year from now.
It seems like it keeps finding ways to scrape together new gains. The dream that data would solve everything turned out to be a dream, but CoT reasoning and grounding techniques produced different kinds of gain (and to my way of thinking a more exciting ones). I imagine some of the benefits of those are slowing, so I guess the question will be "What next?"