This is so helpful; your prompting example definitely taught me that I can make a prompt quite long and complex and still hope the directions will be followed...
I think having come to prompting a bit late I never realized that early versions of these systems stalled out on complex prompts, so I just started talking to it like I would to undergraduate researchers who needed to classify things systematically. But it turns out it was definitely true in the past that by the time you pushed 500 words the instructions at the beginning were being forgotten as the session progressed.
Yes, we were seeing this with feedback prompts, but we upgraded models. My colleague thinks it's still a problem, but I don't think so; probably we need to test a bit more. I don't know how you manage the volume of testing you do! It's so time consuming to do a good job evaluating the results.
The secret is a well developed test prompt library from years as a misinformation researcher plus a habit of binging TV shows on the weekend while lazily running tests against a prompt and comparing it to the reference spec. Also never leak your test prompt library because prompts get spoiled so fast.
I'd love to know more! (I'm mystified about prompts being spoiled fast).
I'm toying with the idea of trying a platform set up for testing rather than just me and my spreadsheet. Maybe PromptLayer, Anthropic's workbench, or OpenAI's Playground.
This is so helpful; your prompting example definitely taught me that I can make a prompt quite long and complex and still hope the directions will be followed...
I think having come to prompting a bit late I never realized that early versions of these systems stalled out on complex prompts, so I just started talking to it like I would to undergraduate researchers who needed to classify things systematically. But it turns out it was definitely true in the past that by the time you pushed 500 words the instructions at the beginning were being forgotten as the session progressed.
Yes, we were seeing this with feedback prompts, but we upgraded models. My colleague thinks it's still a problem, but I don't think so; probably we need to test a bit more. I don't know how you manage the volume of testing you do! It's so time consuming to do a good job evaluating the results.
The secret is a well developed test prompt library from years as a misinformation researcher plus a habit of binging TV shows on the weekend while lazily running tests against a prompt and comparing it to the reference spec. Also never leak your test prompt library because prompts get spoiled so fast.
But everyone can benefit from even a small test prompt library, maybe some day I'll talk about how to make one.
I'd love to know more! (I'm mystified about prompts being spoiled fast).
I'm toying with the idea of trying a platform set up for testing rather than just me and my spreadsheet. Maybe PromptLayer, Anthropic's workbench, or OpenAI's Playground.