forked from openai/evals
-
Notifications
You must be signed in to change notification settings - Fork 6
tasks and metrics for Biomedicine from lmj #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Linmj-Judy
wants to merge
668
commits into
main
Choose a base branch
from
biomedicine
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, pelase note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name russian-verse ### Eval description The most popular Russian poems that nearly every Russian speaker can recall ### What makes this a useful eval? Understanding a basic Russian poem or any foreign literature is significant for a Language Learning Model (LLM) like GPT-4 because it enhances multilingual ability, provides cultural context, and improves understanding of language structure. It makes the model globally useful, and culturally sensitive. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Continue verse with no punctuation marks:\nМороз и солнце день чудесный\nЕще ты дремлешь друг прелестный \nПора красавица проснись\nОткрой сомкнуты негой взоры\nНавстречу северной Авроры"}], "ideal": "Звездою севера явись"} {"input": [{"role": "system", "content": "Continue verse with no punctuation marks:\nУ лукоморья дуб зелёный\nЗлатая цепь на дубе том\nИ днём и ночью кот учёный\nВсё ходит по цепи кругом\nИдёт направо песнь заводит"}], "ideal": "Налево сказку говорит"} {"input": [{"role": "system", "content": "Continue verse with no punctuation marks:\nЯ к вам пишу чего же боле\nЧто я могу еще сказать\nТеперь я знаю в вашей воле\nМеня презреньем наказать\nНо вы к моей несчастной доле"}], "ideal": "Хоть каплю жалости храня"} {"input": [{"role": "system", "content": "Continue verse with no punctuation marks:\nЯ помню чудное мгновенье\nПередо мной явилась ты\nКак мимолетное виденье\nКак гений чистой красоты\nВ томленьях грусти безнадежной"}], "ideal": "В тревогах шумной суеты"} {"input": [{"role": "system", "content": "Continue verse with no punctuation marks:\nЛюбви надежды тихой славы\nНедолго нежил нас обман\nИсчезли юные забавы\nКак сон как утренний туман\nНо в нас горит еще желанье"}], "ideal": "Под гнетом власти роковой"} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name cybersecurity-filepaths ### Eval description Assesses cybersecurity skills by identifying the malicious Windows filepath from a given collection of filepaths. Experienced threat analysts over time learn to recognize the patterns of malicious filepaths, and this eval tests that ability with a set of tricky clean applications and malicious attacks that an expert human can identify based on their years of experience. This includes 93 tests, that is constructed from 44 clean filepaths and 32 malicious filepaths. Only malicious filepaths that stand out to an expert security researcher based on context were included, and the clean paths were also chosen to be challenging and realistic. Usernames are stripped from the filepaths. The test is constructed by selected a single malicious filepath, and mixing it with 5, 10, or 20 clean filepaths to model increasing levels of difficulty. gpt-3.5-turbo scores 18.3% accuracy, and based on some manual tests it looks like ChatGPT 4 might score around 80% (though I'll need access to check). ### What makes this a useful eval? Getting an accurate understanding of the cybersecurity space poses a unique challenge. Web content scraped from the web might over-index on certain filepath being related to malware - for example SolarWinds related filepaths are generally clean but may be incorrectly understood as malware due to the amount of content on the web related to the supply chain attack leveraging their software. Application of ChatGPT 4 to cybersecurity landscape to protect the world against attacks is an important emerging space for these large models. They are starting to show similar human-like insight into telemetry. If we can deploy the type of intelligence demonstrated by this eval at a cheaper cost-point, we'd be able to make a big difference to protecting the world from attacks. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files (x86)\\Google\\Chrome\\Application\\113.0.5672.64\\Installer\\chrome.7z,C:\\Program Files (x86)\\Prococious Technology Inc\\ClearDent\\txtools.dll,C:\\ProgramData\\checkmk\\agent\\modules\\python-3\\.venv\\Scripts\\python.exe,C:\\Program Files (x86)\\EaseUS\\Todo Backup\\bin\\Loader.exe,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\AppData\\Roaming\\chrome\\chrome.exe``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\chrome\\chrome.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\assembly\\NativeImages_v2.0.50727_64\\mscorlib\\b40ad7167488f81c80abc2de7e8c3b9c\\mscorlib.ni.dll,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar,C:\\Program Files (x86)\\Prococious Technology Inc\\ClearDent\\txtools.dll,C:\\Users\\username\\Desktop\\rufus-3.20p (1).exe,C:\\Windows\\svchost.exe,C:\\Windows\\SysWOW64\\mfc120deu.dll``` Answer: C:\\"}], "ideal": "Windows\\svchost.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Program Files (x86)\\YouKu\\YoukuClient\\plugins\\accs.dll,C:\\Program Files (x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\ProgramData\\muIWiyPVLEpuZPVB\\QVEVjLh.wsf,C:\\windows\\system32\\fsprocsvc.exe,C:\\LDPlayer\\LDPlayer9\\libssl-1_1.dll``` Answer: C:\\"}], "ideal": "ProgramData\\muIWiyPVLEpuZPVB\\QVEVjLh.wsf"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\ProgramData\\{2FA73671-7C6C-4bf8-9F0C-E1A75A3860ED}.tmp\\HardwareProtectEx_x64.sys,C:\\Program Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe,C:\\Program Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar,C:\\Users\\username\\AppData\\Roaming\\Microsoft\\unityhub.exe``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\Microsoft\\unityhub.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\SysWOW64\\drivers\\lsass.exe,C:\\Users\\username\\Downloads\\Inventory Update Tool GDSM v5.docx,C:\\Users\\username\\Downloads\\NTW3-9.2-Campaigns.exe,C:\\Windows\\Redirected\\Sync\\CCC_Certificates.vbs,C:\\Program Files (x86)\\EaseUS\\Todo Backup\\bin\\Loader.exe,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar``` Answer: C:\\"}], "ideal": "Windows\\SysWOW64\\drivers\\lsass.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Windows\\Redirected\\Sync\\CCC_Certificates.vbs,C:\\Program Files (x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll,C:\\Windows\\Installer\\a8a9a8f.msi,C:\\Windows\\Resources\\spoolsv.exe,C:\\Program Files (x86)\\Zscaler\\ZSATunnel\\ZSATunnel.exe``` Answer: C:\\"}], "ideal": "Windows\\Resources\\spoolsv.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files (x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Users\\username\\AppData\\Roaming\\imageclass\\Library.dll,C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Program Files (x86)\\Steam\\steamapps\\common\\Grand Theft Auto V\\GTA5.exe,C:\\Program Files (x86)\\EaseUS\\Todo Backup\\bin\\Loader.exe,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\imageclass\\Library.dll"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\windows\\system32\\fsprocsvc.exe,C:\\Users\\username\\Downloads\\Inventory Update Tool GDSM v5.docx,C:\\Users\\username\\Downloads\\Order_Verication_363900421 (1).docm,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Program Files (x86)\\Steam\\steamapps\\common\\Grand Theft Auto V\\GTA5.exe``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Order_Verication_363900421 (1).docm"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Users\\username\\Downloads\\Invoice_1232.docm,C:\\Users\\username\\Downloads\\Rotation Schedule Planner for April 2023.doc,C:\\Windows\\System32\\svchost.exe,C:\\Windows\\System32\\SECOMN64.exe,C:\\Windows\\Redirected\\Sync\\CCC_Certificates.vbs,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Invoice_1232.docm"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Users\\username\\Downloads\\validationDocument (1).docm,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\ProgramData\\checkmk\\agent\\modules\\python-3\\.venv\\Scripts\\python.exe,C:\\Program Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\validationDocument (1).docm"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Users\\username\\Downloads\\Rotation Schedule Planner for April 2023.doc,C:\\windows\\system32\\fsprocsvc.exe,C:\\Users\\username\\Downloads\\PaymentReceipt.docm,C:\\Program Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Program Files (x86)\\SolarWinds\\Agent\\Plugins\\ADMProbe\\SolarWinds.ADM.AgentPlugin.exe``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\PaymentReceipt.docm"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\assembly\\NativeImages_v2.0.50727_64\\mscorlib\\b40ad7167488f81c80abc2de7e8c3b9c\\mscorlib.ni.dll,C:\\Program Files\\SoftEther VPN Client\\vpnsetup.exe,C:\\Users\\username\\Downloads\\scan.docm,C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\Windows\\Installer\\a8a9a8f.msi``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\scan.docm"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files\\SoftEther VPN Client\\vpnsetup.exe,C:\\Program Files\\qBittorrent\\qbittorrent.exe,C:\\Users\\username\\AppData\\Local\\2345Explorer\\User Data\\Default\\Extensions\\lfjjlbddikjohbgnamejecaegefncbli\\1.0_0\\static\\js\\bfb.js,C:\\Users\\username\\Downloads\\Receipt(7682).docm,C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Receipt(7682).docm"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe,C:\\LDPlayer\\LDPlayer9\\libssl-1_1.dll,C:\\Program Files (x86)\\Zscaler\\ZSATunnel\\ZSATunnel.exe,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\Downloads\\Advice_Paymet_IBGC825120123123254 Pdf.html,C:\\Windows\\System32\\cmd.exe``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Advice_Paymet_IBGC825120123123254 Pdf.html"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files (x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Program Files (x86)\\SolarWinds\\Agent\\Plugins\\ADMProbe\\SolarWinds.ADM.AgentPlugin.exe,C:\\Windows\\System32\\cmd.exe,C:\\Program Files (x86)\\YouKu\\YoukuClient\\plugins\\accs.dll,C:\\Users\\username\\Downloads\\Complete_Setup_Use_2023_As_Passwrd.rar,C:\\Program Files\\OpeniT\\Core\\bin\\curl.exe``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Complete_Setup_Use_2023_As_Passwrd.rar"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\AppData\\Roaming\\3545695800.exe,C:\\Users\\username\\Desktop\\rufus-3.20p (1).exe,C:\\Program Files\\OpeniT\\Core\\bin\\curl.exe,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Windows\\System32\\svchost.exe``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\3545695800.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\PROGRA~1\\COMMON~1\\System\\symsrv.dll,C:\\Users\\username\\Downloads\\Inventory Update Tool GDSM v5.docx,C:\\Program Files (x86)\\YouKu\\YoukuClient\\plugins\\accs.dll,C:\\Program Files (x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Program Files (x86)\\EaseUS\\Todo Backup\\bin\\Loader.exe,C:\\Windows\\System32\\cmd.exe``` Answer: C:\\"}], "ideal": "PROGRA~1\\COMMON~1\\System\\symsrv.dll"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files\\qBittorrent\\qbittorrent.exe,C:\\Windows\\assembly\\NativeImages_v2.0.50727_64\\mscorlib\\b40ad7167488f81c80abc2de7e8c3b9c\\mscorlib.ni.dll,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Program Files (x86)\\SolarWinds\\Agent\\Plugins\\ADMProbe\\SolarWinds.ADM.AgentPlugin.exe,C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe,C:\\Windows\\SysWOW64\\WMIScriptingAPI\\nssy.exe``` Answer: C:\\"}], "ideal": "Windows\\SysWOW64\\WMIScriptingAPI\\nssy.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\System32\\cmd.exe,C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Windows\\Temp\\analyse.exe,C:\\windows\\system32\\fsprocsvc.exe,C:\\ProgramData\\checkmk\\agent\\modules\\python-3\\.venv\\Scripts\\python.exe,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar``` Answer: C:\\"}], "ideal": "Windows\\Temp\\analyse.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```D:\\RECYCLER.BIN\\hex.dll,C:\\Program Files (x86)\\Google\\Chrome\\Application\\113.0.5672.64\\Installer\\chrome.7z,C:\\ProgramData\\Overwolf\\Setup\\0.221.109.14\\OverwolfSetup.exe,C:\\Program Files (x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll,C:\\Windows\\System32\\cmd.exe,C:\\Windows\\SysWOW64\\mfc120deu.dll``` Answer: C:\\"}], "ideal": "RECYCLER.BIN\\hex.dll"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\lmc-1.2.35-win32\\mingwm10.dll,C:\\Program Files (x86)\\EaseUS\\Todo Backup\\bin\\Loader.exe,C:\\Program Files (x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Program Files (x86)\\Google\\Update\\GoogleUpdate.exe,C:\\Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\ui.exe,C:\\Windows\\System32\\svchost.exe``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\ui.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\System32\\SECOMN64.exe,C:\\Program Files (x86)\\SolarWinds\\Agent\\SolarWinds.Agent.Service.exe,C:\\Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\drive.exe,C:\\Users\\username\\AppData\\Local\\2345Explorer\\User Data\\Default\\Extensions\\lfjjlbddikjohbgnamejecaegefncbli\\1.0_0\\static\\js\\bfb.js,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Program Files (x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\drive.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\ProgramData\\MslBooster\\WindowsPaint-Ver3.8.8.3.exe,C:\\Users\\username\\Desktop\\0-848\\Shotcut\\lib\\mlt\\libmltplus.dll,C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe,C:\\Windows\\Installer\\a8a9a8f.msi,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\ProgramData\\Overwolf\\Setup\\0.221.109.14\\OverwolfSetup.exe``` Answer: C:\\"}], "ideal": "ProgramData\\MslBooster\\WindowsPaint-Ver3.8.8.3.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files (x86)\\Steam\\steamapps\\common\\Grand Theft Auto V\\GTA5.exe,C:\\Program Files\\Sharp\\File-Copy\\File-Copy.exe,C:\\Users\\username\\Downloads\\Inventory Update Tool GDSM v5.docx,C:\\Windows\\System32\\drivers\\iaStorAV.sys,C:\\Users\\username\\AppData\\Roaming\\1z6E2nwrGf05i3P\\jChGsIxjHw.exe,C:\\Windows\\System32\\cmd.exe``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Roaming\\1z6E2nwrGf05i3P\\jChGsIxjHw.exe"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Program Files (x86)\\Steam\\steamapps\\common\\Grand Theft Auto V\\GTA5.exe,C:\\Windows\\Temp\\wk_4996_28103\\wk.dll,C:\\Program Files (x86)\\Zscaler\\ZSATunnel\\ZSATunnel.exe,C:\\ProgramData\\modPackages\\mod\\AgentFix Client\\1.0\\Source\\Setup.exe,C:\\Users\\username\\Desktop\\rufus-3.20p (1).exe,C:\\Program Files\\qBittorrent\\qbittorrent.exe``` Answer: C:\\"}], "ideal": "Windows\\Temp\\wk_4996_28103\\wk.dll"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Windows\\SysWOW64\\mfc120deu.dll,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\Program Files (x86)\\Prococious Technology Inc\\ClearDent\\txtools.dll,C:\\Program Files\\Geni\\Genian\\Patch\\08D5ACEBCC0FADDF17FEDCC87ED25A4CC845A707.exe,C:\\WINDOWS\\system32\\drivers\\v7JViu6IUh.sys,C:\\Program Files (x86)\\VideoLAN\\VLC\\plugins\\libfilesystem_plugin.dll``` Answer: C:\\"}], "ideal": "WINDOWS\\system32\\drivers\\v7JViu6IUh.sys"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Users\\username\\AppData\\Local\\Microsoft\\OneDrive\\Secur32.dll,C:\\Users\\username\\Desktop\\Pinball-Arcade.rar,C:\\ProgramData\\{2FA73671-7C6C-4bf8-9F0C-E1A75A3860ED}.tmp\\HardwareProtectEx_x64.sys,C:\\Users\\username\\AppData\\Local\\Temp\\Rar$DRa0.960\\1.5.2\\jPortable\\lib\\ext\\sunec.jar,C:\\Windows\\system32\\drivers\\SkyRGDrv.sys,C:\\Program Files (x86)\\YouKu\\YoukuClient\\plugins\\accs.dll``` Answer: C:\\"}], "ideal": "Users\\username\\AppData\\Local\\Microsoft\\OneDrive\\Secur32.dll"} {"input": [{"role": "system", "content": "You are acting as an expert cybersecurity researcher reviewing telemetry logs of filepaths in your organization. The task is to identify the most-malicious looking filepath given a set of filepaths."}, {"role": "user", "content": "Here are a bunch of filepaths in my organization. Please tell me which one looks the most malicious. ```C:\\Users\\username\\Downloads\\NTW3-9.2-Campaigns.exe,C:\\Users\\username\\Downloads\\Set-up.exe,C:\\Program Files (x86)\\Google\\Chrome\\Application\\113.0.5672.64\\Installer\\chrome.7z,C:\\Users\\username\\Desktop\\rufus-3.20p (1).exe,C:\\Users\\username\\AppData\\Local\\Temp\\nsm4108.tmp\\UAC.dll,C:\\Windows\\System32\\cmd.exe``` Answer: C:\\"}], "ideal": "Users\\username\\Downloads\\Set-up.exe"} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name medication_dose ### Eval description This tests the LLM ability to identify medication doses that are inappropriate for therapeutic uses. ### What makes this a useful eval? For LLM to have a role in medical applications, the ability to recognize medication doses that are out of therapeutic range is a very important function. I have found that these models frequently fail at this task. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value LLM have tremendous possibilities for medical applications. Many future applications will need the LLM to be able to identify appropriate medication dose ranges. I have found that current LLM frequently fail at this task and this is an area where improvement would be important. ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Metronidazole 500 mg, Icatibant 30 mg, Haegarda 3000 IU, Docusate 100 mg, Levofloxacin 750 mcg, Famotidine 40 mg, Fentanyl 100 mcg, Budesonide 0.25 mg, Metoprolol 12.5 mg, Atenolol 50 mg, Labetalol 100 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Levofloxacin"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Docusate 100 mg, Diazepam 5 mg, Lanoxin 250 mcg, Privigen 25 g, Doxycycline 100 mg, CellCept 1.5 g, Xolair 300 mcg, Hydrocortisone 1%, Keflex 250 mg, Naloxone 400 mcg, Famotidine 40 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Xolair"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Budesonide 0.25 mg, Theophylline 450 mg, Humira 160 mg, Haegarda 3000 IU, Neoral 50 mg, Metronidazole 500 mg, Formoterol 20 mcg, Advair 500 mcg, Zosyn 3.375 mg, Furosemide 20 mg, Dilantin 100 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Zosyn"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Breo 200/25 mcg, Acetaminophen 1000 mg, Claratin 10 mg, Gammunex 100 gram, Levothyroxine 125 mg, Albuterol 108 mcg, Lovenox 40 mg, Betapace 120 mg, Levofloxacin 500 mg, Nystatin 100000 U, Warfarin 6.5 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Levothyroxine"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Azithromycin 500 mg, Heparin 5000 U, Atenolol 50 mg, Betapace 120 mg, Budesonide 0.25 mg, Privigen 25 g, Furosemide 20 mg, Humira 160 mg, Keflex 250 mg, Verapamil 40 mg, Symbicort 500/4.5 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Symbicort"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Famotidine 40 mg, Digoxin 0.125 mg, Rifampin 150 mg, Albuterol 108 mcg, Allegra 60 mg, Azithromycin 5 mg, Spiriva 1.25 mcg, Warfarin 6.5 mg, Nasacort 220 mcg, Cetirizine 10 mg, Azithromycin 500 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Azithromycin"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Acyclovir 200 mg, Patanol 0.2%, Famotidine 40 mg, Heparin 5000 U, Gammunex 100 gram, Prednisone 40 mg, Amitriptyline 25 mg, Betapace 120 mg, Sotolol 80 mg, Cetirizine 100 mg, Loperamide 2 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Cetirizine"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Fasenra 30 mg, Sotolol 80 mg, Levothyroxine 125 mcg, Digoxin 0.125 mg, Budesonide 1 gm, Loperamide 2 mg, Humira 160 mg, Patanol 0.2%, Dilantin 100 mg, Rifampin 150 mg, Keflex 250 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Budesonide"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Fasenra 130 mg, Doxycycline 100 mg, Patanol 0.2%, Pulmicort 1 mg, Flonase 50 mcg, Tiotropium 2.5 mcg, Nasacort 220 mcg, Acetaminophen 1000 mg, Icatibant 30 mg, Cetirizine 10 mg, Theophylline 450 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Fasenra"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Plaquenil 400 mg, Atorvastatin 2000 mg, Lovenox 40 mg, Keflex 250 mg, Pulmicort 1 mg, Versed 5 mg, Spiriva 1.25 mcg, Cyclosporine 100 mg, Ventolin 90 mcg, Icatibant 30 mg, Loperamide 2 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Atorvastatin"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Tiotropium 2.5 mcg, Plaquenil 400 mg, Azelastine 137 mg, Haegarda 3000 IU, Albendazole 400 mg, Phenytoin 30 mg, Naloxone 400 mcg, Symbicort 160/4.5 mcg, Isoniazid 100 mg, Diazepam 5 mg, Metformin 500 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Azelastine"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Budesonide 0.25 mg, Fexofenadine 180 mg, Keflex 250 mg, Fexofenadine 1.8 mg, Allegra 60 mg, Flonase 50 mcg, Rhinocort 32 mcg, Azithromycin 500 mg, Sotolol 80 mg, Fluoxetine 20 mg, Pulmicort 1 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Fexofenadine"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Flecainide 50 mg, Claratin 10 mg, Sotolol 80 mg, Atenolol 50 mg, Metformin 500 mg, Flonase 50 mcg, Advair 5 mcg, Labetalol 100 mg, Nasacort 220 mcg, Fentanyl 100 mcg, Budesonide 0.25 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Advair"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Prednisone 40 g, Digoxin 0.125 mg, Budesonide 0.25 mg, Icatibant 30 mg, Betapace 120 mg, Fluconazole 50 mg, Patanol 0.2%, Heparin 5000 U, Coumadin 3.5 mg, Metformin 500 mg, Fasenra 30 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Prednisone"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Fluconazole 50 mg, Furosemide 20 mg, Verapamil 40 mg, Nystatin 100000 U, Augmentin 875 mg, Augmentin 8.75 mg, Theophylline 450 mg, Lanoxin 250 mcg, Neoral 50 mg, Nasacort 220 mcg, Diazepam 5 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Augmentin"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Haegarda 3000 IU, Lithium 600 mg, Acyclovir 200 mg, CellCept 1.5 g, Fasenra 30 mg, Metformin 500 mg, Albendazole 400 mg, Advair 500 mcg, Zofran 4 mg, Ciprofloxacin 500 mg, Haegarda 3000 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Haegarda"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Orapred 15 mg, Carvedilol 3.125 mg, Lanoxin 250 mcg, Lovenox 40 mg, Loperamide 2 mg, Hydrocortisone 1%, Diazepam 5 mg, Zosyn 3.375 g, Warfarin 6.5 mg, Synthroid 88 mg, Cetirizine 10 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Synthroid"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Patanol 0.2%, Icatibant 30 g, Gammunex 100 gram, Versed 5 mg, Formoterol 20 mcg, Rhinocort 32 mcg, Metformin 500 mg, Motrin 800 mg,, Enoxaprin 30 mg, Metoprolol 12.5 mg, Xolair 300 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Icatibant"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? CellCept 1.5 g, Flonase 50 mcg, Trimethoprim/Sulfamethoxazole 160 mg/800 mg, Fluconazole 50 mg, Verapamil 40 mg, Loperamide 2 mg, Prednisone 40 mg, Pepcid 20 mg, Plaquenil 4 mcg, Nystatin 100000 U, Labetalol 100 mg"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Plaquenil"} {"input": [{"role": "system", "content": "Answer this question as concisely as possible. Of the following medications and doses which is incorrect for a standard dose for that medication? Carvedilol 3.125 mg, Digoxin 0.125 mg, Fexofenadine 180 mg, Nasacort 220 mcg, Zofran 4 mg, Fluoxetine 20 mg, Mupirocin 2%, CellCept 1.5 g, Keflex 250 mg, Atorvastatin 20 mg, Amitriptyline 25 g"}, {"role": "user", "content": "The medication that is not within a standard dose range is: "}], "ideal": "Amitriptyline"} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name context-free-grammar ### Eval description This tests the ability for GPT-4 to evaluate whether or not a string can be produced by a given context-free grammar. ### What makes this a useful eval? This is an interesting computational task. Context-free languages are important in linguistics, and it will be interesting to see how a language model fares in handling this task. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [✅] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [✅] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [✅] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [✅] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) I've handcrafted a lot of these examples. Some of them are there to 'trick' the model––I think it will be a useful test to see how well the language model can do at those. ## Eval structure 🏗️ Your eval should - [✅] Check that your data is in `evals/registry/data/{name}` - [✅] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [✅] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [✅] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [✅] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [✅] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [✅] I have filled out all required fields of this form - [✅] I have used **Git LFS** for the Eval JSON data - [✅] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system","content": "You will be provided with a context-free grammar, and a string. Respond whether or not that string can be produced by the given grammar. Respond only with the word true or false; you do not need to show your work. The empty string is represented with ϵ, and productions are separated with the symbol ;. The alphabet is always {0, 1}."}, {"role": "user", "content": "Given the grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '01010101' in the language?"}], "ideal": "true"} {"input": [{"role": "system","content": "You will be provided with a context-free grammar, and a string. Respond whether or not that string can be produced by the given grammar. Respond only with the word true or false; you do not need to show your work. The empty string is represented with ϵ, and productions are separated with the symbol ;. The alphabet is always {0, 1}."}, {"role": "user", "content": "Given the grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '00011101' in the language?"}], "ideal": "true"} {"input": [{"role": "system","content": "You will be provided with a context-free grammar, and a string. Respond whether or not that string can be produced by the given grammar. Respond only with the word true or false; you do not need to show your work. The empty string is represented with ϵ, and productions are separated with the symbol ;. The alphabet is always {0, 1}."}, {"role": "user", "content": "Given the grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '00110101' in the language?"}], "ideal": "true"} {"input": [{"role": "system","content": "You will be provided with a context-free grammar, and a string. Respond whether or not that string can be produced by the given grammar. Respond only with the word true or false; you do not need to show your work. The empty string is represented with ϵ, and productions are separated with the symbol ;. The alphabet is always {0, 1}."}, {"role": "user", "content": "Given the grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '01001101' in the language?"}], "ideal": "true"} {"input": [{"role": "system","content": "You will be provided with a context-free grammar, and a string. Respond whether or not that string can be produced by the given grammar. Respond only with the word true or false; you do not need to show your work. The empty string is represented with ϵ, and productions are separated with the symbol ;. The alphabet is always {0, 1}."}, {"role": "user", "content": "Given the grammar 'S -> 01 | SS | 0S1 | ϵ', is the string '01010011' in the language?"}], "ideal": "true"} ``` </details> --------- Co-authored-by: Arjun Taneja <[email protected]>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name 3D Globe Movement ### Eval description This eval tests an LLMs ability to understand 3D movement through space, in particular movement on or through planet Earth. Each example provides a starting point and a path consisting of one or two movements, and the expected answer is a state/province or ocean. Similar to the evals from openai#462 and openai#1060, this eval shows how difficult movement is for LLMs to understand, and builds upon those by showing how the problem is seemingly magnified by 3D movement and/or by requesting a region as answer rather than numerical positions. Testing on gpt-3.5-turbo, accuracy ranges from ~0.24 to ~0.31 ### What makes this a useful eval? This eval demonstrates that a long series of steps is not necessary in order to create a path that GPT is unable to follow, and that a simple trip to the planet's core and back again, with a slight offset in any direction, will often get it lost. Whereas it can often handle "travel 15 degrees East", hiding that actual travel behind a 2-step 3D path significantly hurts the model's performance. Interestingly, this eval seems to demonstrate the statistical rather than reasoning nature of GPT, as regardless of starting point or path, it shows a bias towards answering with oceans, especially the Pacific Ocean. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 15 degree offset to the East"}], "ideal": "Colorado"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel due East 15 degrees"}], "ideal": "Colorado"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel due East 0.1 degrees"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel due East 1 meter"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel 0 meter"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and make one complete revolution around the planet"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 1 degree offset to the East"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 0.1 degree offset to the East"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 0.01 degree offset to the East"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 1 degree offset to the West"}], "ideal": "Pacific Ocean"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 90 degree offset to the East"}], "ideal": "Atlantic Ocean"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 180 degree offset to the East"}], "ideal": "Ahal"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of San Francisco and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 1 degree offset to the North"}], "ideal": "California"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of Kansas City and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 15 degree offset to the East"}], "ideal": "West Virginia"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of Kansas City and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 15 degree offset to the West"}], "ideal": "Utah"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 1 degree offset to the East"}], "ideal": "Atlantic Ocean"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due East 10 degrees"}], "ideal": "Atlantic Ocean"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due East 500 kilometers"}], "ideal": "Atlantic Ocean"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 1 degree offset to the West"}], "ideal": "New Jersey"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 10 degree offset to the West"}], "ideal": "Ohio"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due West 10 degrees"}], "ideal": "Ohio"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due West 15 degrees"}], "ideal": "Illinois"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due West 25 degrees"}], "ideal": "Nebraska"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due West 30 degrees"}], "ideal": "Colorado"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of New York City and travel due West 35 degrees"}], "ideal": "Colorado"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of Oklahoma City and travel directly downwards to the center of the Earth, then straight up, back to the surface, at a 0.1 degree offset to the East"}], "ideal": "Oklahoma"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of Oklahoma City and travel directly downwards to the center of the Earth, then straight up, back to the surface along the same path that was traveled downards."}], "ideal": "Oklahoma"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of Oklahoma City and travel directly downwards to the center of the Earth, then continue onwards to the other side of the planet"}], "ideal": "Xinjiang"} {"input": [{"role": "system", "content": "Pretend you are 3DPathGPT, a model that can accurately track 3D paths on, around, and through planet Earth based on plain language descriptions. For each path, determine the state, province, or ocean of the final destination. You may assume the object traveling the path encounters no resistance, such as if it were a neutrino. Try reasoning through the 3D path one step at a time, and at the end, provide the final answer enclosed in square brackets like [Europe]. Start at the center of Oklahoma City and travel directly downwards through to the other side of the Earth"}], "ideal": "Xinjiang"} ``` </details>
This is a fix for a problem with https://github.com/openai/evals/blob/19352afd7ba6290b3320d2fe60a990efe9aaebd6/evals/registry/data/dutch-lexicon/samples.jsonl that I noted here: openai#616 (comment) ✅ Correct spelling (line 49): ```json {"input":[{"role":"system","content":"You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."},{"role":"user","content":"faliekant"}],"ideal":"Y"} ``` ✅ Incorrect spelling (line 100): ```json {"input":[{"role":"system","content":"You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."},{"role":"user","content":"falikant"}],"ideal":"N"} ``` ❌ But there's a mixup on line 137 - the correct spelling is marked as "not ideal": ```json {"input":[{"role":"system","content":"You will be prompted with a single word. Does this word exist in the Dutch language? Answer with exactly one letter: Y or N."},{"role":"user","content":"faliekant"}],"ideal":"N"} ``` Co-authored-by: László van den Hoek <[email protected]>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name simple_math logic_and_probability ### Eval description Eval that checks ability to do simple math questions. Eval that checks ability to do logical physics and statistics questions. ### What makes this a useful eval? ChatGPT fails in simple car arriving problem. ChatGPT fails in simple car elevator probability question and famous ant arriving problem. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [ ] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [ ] Check that your data is in `evals/registry/data/{name}` - [ ] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [ ] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [ ] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [ ] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [ ] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [ ] I have filled out all required fields of this form - [ ] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl INSERT_EVAL_HERE ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name mapping_to_matricies ### Eval description Given an array of binary values (0 or 1), a request is made for the array to be mapped to a two-dimensional array. The length of the original array must be evenly divisible by the dimensions of the two-dimensional array (i.e. an array of length 12 is evenly mappable onto a 3x4 two-dimensional array). An evaluation is made by comparing the final row of the mapped 2D array with the corresponding values of the original array. To further demonstrate that the failure cases are not due to poor prompting, I've included instructions in the prompt to present some rationale in the response -- it is evident therein that the LLM indeed understands the task, but fails to accomplish it. In fact, when asked to verify the answer, the LLM appears to double down and effectively "re-write" its own memory of the original input array so that it can claim that its answer was valid. A small python script that was used for generating the samples has been included at `/evals/registry/data/mapping_to_matricies/data_generator.py` ### What makes this a useful eval? This eval demonstrates a task that a human can easily do, but LLMs have trouble accomplishing. Further, it also demonstrates that the LLM understands the task accurately, but confidently and consistently provides the wrong answer; and when asked to check its answer, it alters its own understanding of the original user input so that it can claim to be correct. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [0,1,1,0,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]\nGrid Dimensions: 2x20"}], "ideal": "{\"Final Row\":[1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [0,1,1,0,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]\nGrid Dimensions: 4x10"}], "ideal": "{\"Final Row\":[1,0,1,0,0,0,0,0,0,1]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [0,1,1,0,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,1]\nGrid Dimensions: 5x8"}], "ideal": "{\"Final Row\":[1,0,0,0,0,0,0,1]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]\nGrid Dimensions: 2x21"}], "ideal": "{\"Final Row\":[1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]\nGrid Dimensions: 3x14"}], "ideal": "{\"Final Row\":[0,1,1,0,1,1,1,1,0,0,0,1,1,0]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,0]\nGrid Dimensions: 6x7"}], "ideal": "{\"Final Row\":[1,0,0,0,1,1,0]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [1,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,1,0,1,0,0,1,1]\nGrid Dimensions: 2x22"}], "ideal": "{\"Final Row\":[0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,1,0,1,0,0,1,1]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [1,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,1,0,1,1,0,1,1,0,1,0,0,1,1]\nGrid Dimensions: 4x11"}], "ideal": "{\"Final Row\":[1,1,0,1,1,0,1,0,0,1,1]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [0,0,1,1,0,1,1,1,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,0,0,1,1,0,0,0,1,0,1,0]\nGrid Dimensions: 3x15"}], "ideal": "{\"Final Row\":[0,1,0,1,0,0,1,1,0,0,0,1,0,1,0]}"} {"input": [{"role": "system", "content": "Given the user-provided binary array, map the binary array onto a grid, wherein the dimensions of the grid are as provided by the user ([num rows]x[num elements per row]), and the mapping to the grid is done from left to right, top to bottom (provide a visualization of the mapped result). Then explain in a second visualization how the final row of the grid was mapped from the corresponding final binary numbers of the array. Lastly, provide the final row of the grid, in minified JSON format, like this: {\"Final Row\":[...]}"}, {"role": "user", "content": "Array: [0,0,1,1,0,1,1,1,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,1,0,1,0,0,1,1,0,0,0,1,0,1,0]\nGrid Dimensions: 5x9"}], "ideal": "{\"Final Row\":[1,1,0,0,0,1,0,1,0]}"} ``` </details>
A bug in the handling of `--registry_path` was introduced in https://github.com/openai/evals/pull/1036/files#diff-a694333152a5a73db19b8951647e60aed78f43fec6b119707bc6d489289be6c0R87 To repro, run the example from https://github.com/openai/evals/blob/main/examples/retrieval-completionfn.ipynb Before the fix, the following exception will be thrown ``` TypeError: unsupported operand type(s) for /: 'list' and 'str' ```
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name Sindarin Fluency (Nouns) ### Eval description This eval tests the GPT model's ability to translate Sindarin (from the Tolkien works) to English. Sindarin was created in 1915, and was expanded upon until 1973, being heavily inspired by Literary Welsh. As of today, hundreds of poems and texts have been written in Sindarin, and a major effort is made by several publishers to kept to preserve the language. The eval uses a collection of 150 commonly used nouns. ### What makes this a useful eval? This eval provides an opportunity to understand how well GPT models can understand the fictional language, and improve upon their overall understanding in translating it to English. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [X] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [X] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [X] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [X] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > This eval will help preserve the historic fictional language, by making it possible to improve upon retaining knowledge of Sindarin, and potentially other languages used in fictional works. ## Eval structure 🏗️ Your eval should - [X] Check that your data is in `evals/registry/data/{name}` - [X] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [X] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [X] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [X] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [X] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [X] I have filled out all required fields of this form - [X] I have used **Git LFS** for the Eval JSON data - [X] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Translate a word from the Sindarin language (from Tolkien's works) to English. Respond only with the translated word or phrase, or 'none' if there is no translation."}, {"role": "user", "content": "ablad"}], "ideal": ["prohibition", "refusal"]} {"input": [{"role": "system", "content": "Translate a word from the Sindarin language (from Tolkien's works) to English. Respond only with the translated word or phrase, or 'none' if there is no translation."}, {"role": "user", "content": "achad"}], "ideal": ["rock ridge", "neck"]} {"input": [{"role": "system", "content": "Translate a word from the Sindarin language (from Tolkien's works) to English. Respond only with the translated word or phrase, or 'none' if there is no translation."}, {"role": "user", "content": "Adan"}], "ideal": ["man"]} {"input": [{"role": "system", "content": "Translate a word from the Sindarin language (from Tolkien's works) to English. Respond only with the translated word or phrase, or 'none' if there is no translation."}, {"role": "user", "content": "adar"}], "ideal": ["father"]} {"input": [{"role": "system", "content": "Translate a word from the Sindarin language (from Tolkien's works) to English. Respond only with the translated word or phrase, or 'none' if there is no translation."}, {"role": "user", "content": "aduial"}], "ideal": ["twilight"]} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name resource_id_extraction ### Eval description This Eval asks the model to identify UI elements and extract their resource ID from Android XML dumps. Android allows you to serialize the content of the screen to XML using accessibility information. A human can (with effort) read through this XML and understand approximately what is visible on the screen and what the semantic intent of UI elements are. Each sample in the eval contains: - Instructions to find and identify a ui element and extract its resource id (or return error code) - An XML dump from an actual Android app - A description of a UI element that may or may not be present in the xml The eval uses the simple "includes" test to see if the correct resource id is returned. ### What makes this a useful eval? This eval is useful because: - It requires semantic understanding of ui elements (foundational capacity) (eg that the ⭐ button means "add to favorites" and the ➕ button is used to create a new contact) - It tests against hallucinations (eg sometimes the model may hallucinate resource ids that don't exist) - It enforces error codes that distinguish between different problem states (system message steer-ability) Given certain successes, it seems that the model has the capacity to return good responses, but often fails to do so. This eval can be extended by: 1. Adding more test cases / xml dumps (especially more complex cases using the larger gpt4 context window) 2. Adding more obscure descriptions of ui elements 3. Extracting information other than resource ids, for example "center point" of ui elements (computable via their bounding box) ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value A lot of implicit information is conveyed in UIs. Models reading serialized UIs have the ability to understand some of that implicit information. This is quite valuable, not least because it is required in order to give good results for code completion and generation. This eval is uniquely valuable because it gives a structured way to definitively test the model's understanding of code/UIs from something like a "reading comprehension" perspective, rather than just generation. ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: :warning: I was only able to add 1 due to length limits of the PR body. Sorry! <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"ideal": "com.google.android.apps.photos:id/selected_account_disc", "input": [{"role": "system", "content": "\nYou have two tasks:\n- Identify an interactable UI element in an Android XML dump based on a description of of what it looks like, where it is, or what it does.\n- Retrieve the corresponding resource ID you would use to interact with that UI element.\nYou'll be given two inputs:\n- An XML output from the 'dumpXMLHierarchy'.\n- A description of a UI element.\nBased on these inputs your output should be as follows:\n- If you find the described UI element and it has a resource ID, return the resource ID.\n- If you don't find the described UI element, return \"element_not_found\".\n- If the described UI element is there but it doesn't have a resource ID, return \"no_resource_id\".\n"}, {"role": "user", "content": "XML output: <hierarchy rotation=\"0\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.LinearLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/action_bar_root\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"android:id/content\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/touch_capture_view\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/photo_container\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\" /><node index=\"1\" text=\"\" resource-id=\"com.google.android.apps.photos:id/drawer_layout\" class=\"androidx.drawerlayout.widget.DrawerLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/main_container\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/toolbar_parent\" class=\"android.view.ViewGroup\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/touch_capture_view\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/empty_view_container\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.ScrollView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"true\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,312][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.LinearLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,965][1080,1621]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/empty_page_image\" class=\"android.widget.ImageView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[237,965][842,1180]\" /><node index=\"1\" text=\"Take a picture. Photos & videos appear here.\" resource-id=\"com.google.android.apps.photos:id/empty_page_caption\" class=\"android.widget.TextView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[155,1230][925,1406]\" /><node index=\"2\" text=\"No Photos\" resource-id=\"com.google.android.apps.photos:id/empty_page_title_top\" class=\"android.widget.TextView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[155,1406][925,1533]\" /><node index=\"3\" text=\"\" resource-id=\"\" class=\"android.widget.LinearLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[496,1533][584,1577]\" /></node></node></node><node index=\"1\" text=\"\" resource-id=\"com.google.android.apps.photos:id/fragment_container\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/photos_photogrid_date_scrubber_view\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/recycler_view\" class=\"android.support.v7.widget.RecyclerView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"true\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,2138]\" /></node></node></node></node></node><node index=\"1\" text=\"\" resource-id=\"com.google.android.apps.photos:id/scrolling_toolbar_container\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,312]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/toolbar_container\" class=\"android.widget.LinearLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,312]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/notification_bar_spacer\" class=\"android.view.View\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,136]\" /><node index=\"1\" text=\"\" resource-id=\"com.google.android.apps.photos:id/toolbar\" class=\"android.view.ViewGroup\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,136][1080,312]\"><node index=\"0\" text=\"\" resource-id=\"\" class=\"android.widget.ImageButton\" package=\"com.google.android.apps.photos\" content-desc=\"Show Navigation Drawer\" checkable=\"false\" checked=\"false\" clickable=\"true\" enabled=\"true\" focusable=\"true\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,147][154,301]\" /><node index=\"1\" text=\"\" resource-id=\"\" class=\"android.view.ViewGroup\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[354,136][726,312]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/product_lockup_view\" class=\"android.view.ViewGroup\" package=\"com.google.android.apps.photos\" content-desc=\"Google Photos\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[354,190][726,259]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/logo\" class=\"android.widget.ImageView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[354,198][541,259]\" /><node index=\"1\" text=\"Photos\" resource-id=\"com.google.android.apps.photos:id/product_name\" class=\"android.widget.TextView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[552,190][726,259]\" /></node></node><node index=\"2\" text=\"\" resource-id=\"\" class=\"android.support.v7.widget.LinearLayoutCompat\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[925,147][1080,301]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/selected_account_disc\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"Sign in\" checkable=\"false\" checked=\"false\" clickable=\"true\" enabled=\"true\" focusable=\"true\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[925,157][1080,290]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/og_selected_account_disc_apd\" class=\"android.widget.FrameLayout\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[925,157][1058,290]\"><node index=\"0\" text=\"\" resource-id=\"com.google.android.apps.photos:id/og_apd_internal_image_view\" class=\"android.widget.ImageView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[931,163][1052,284]\" /><node index=\"1\" text=\"\" resource-id=\"com.google.android.apps.photos:id/og_apd_ring_view\" class=\"android.widget.ImageView\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[931,163][1052,284]\" /></node></node></node></node></node></node></node></node></node></node></node></node></node></node><node index=\"1\" text=\"\" resource-id=\"android:id/statusBarBackground\" class=\"android.view.View\" package=\"com.google.android.apps.photos\" content-desc=\"\" checkable=\"false\" checked=\"false\" clickable=\"false\" enabled=\"true\" focusable=\"false\" focused=\"false\" scrollable=\"false\" long-clickable=\"false\" password=\"false\" selected=\"false\" bounds=\"[0,0][1080,136]\" /></node></hierarchy>\n, target item description: clickable account disc in the top right"}]} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name chinese_homonym ### Eval description Check the model's ability to recognize Chinese homonyms, which are words that have the same pronunciation (Hànyǔ Pīnyīn) but different meanings. ### What makes this a useful eval? It's easy for beginners learning Chinese, whether children or foreign language learners, to distinguish whether two different Chinese words have the same pronunciation. However, GPT's performance on this task is noticeably poor. Recognizing Chinese homonyms is a fundamental language skill. It's used to understand the context of content, assist language learners, and recognize typos, given that the mainstream Chinese input method is based on Hànyǔ Pīnyīn (pronunciation), among other reasons. GPT-3.5 scored 0.476 on this task (even worse than a random guess), while GPT-4 achieved 0.7 through the ChatGPT Plus subscription. We can further examine GPT-4's hallucinations by diving deep into the explanations of correct answers, as shown in the following screenshot.  ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value I manually created a diverse set of tests that uncovered GPT's poor performance on Chinese pronunciation. This could lead to further evaluations from different perspectives. ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"由于她最近在减肥,所以她今晚上决定不吃鱿鱼了。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他最近在出差。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"玲儿去医院看病,医生给开了胃药,但是她不愿意吃,所以都是她妈妈给她喂药。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"欲穷千里目,更上一层楼。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他给她送回家了"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"现在电视里面演的是一场清朝的殿试。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"这家初创公司发布的新产品原型是一个圆形的扫地机器人。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"这个网站的会员太贵了,比另外一家的贵。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他在农村生活,每天都要自己生火做饭。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他在超级猩猩的徐汇区运动群里面,看到了一个昵称叫星星的女孩。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"华北工业电机厂新的行政楼会在6月20日举行奠基仪式。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"坏人是一个聋子,把男主锁在了笼子里。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"知名医生张教授说过,人的一生其实很短暂。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"老宋忘了晚上有约会,当他赶到的时候女孩已经先走了,他感到很自责。"}],"ideal":["是"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他们在玩的游戏是掼蛋,小明抓完牌了说这把有戏。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"黄土坡派出所的警察在接到报案后,来找小区的保安调取监控。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"要是你不答应,晚上我就饿着。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"生活像一把无情的雕刻刀,改变了我们的样子。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"我没什么功利心,我要的是大量的空闲时间,能读书、看电视、想想事,我已经有了,应该知足了。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"他听着雨花隐隐约约地飘落,慢慢地睡着了,雨花穿过窗外轻轻地落下,落到所有的生者和死者身上。"}],"ideal":["否"]} {"input":[{"role":"system","content":"下面这句话中是否存在发音一样的中文单词(两个汉字及以上),若存在返回是,若不存在返回否。你只需要输出`是`或者`否`"},{"role":"user","content":"厦门大学临海而建,内拥穿过芙蓉隧道,6月是这个城市一年中最美的时候,红红绿绿点缀中更显韵味,一定要去看那“凤凰花开的路口”。"}],"ideal":["否"]} ``` </details>
what are the side effects of no longer needing escape characters when
passing around message payloads?
# Thank you for contributing an eval! ♥️
🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨
__PLEASE READ THIS__:
In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.
We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**
## Eval details 📑
### gpt protocol buffers
GPT Protocol Buffers
### Eval description
Using length delimited strings, at multiple levels, we can fashion tag
value messages which do not require escape characters. Even if the
messages are nested, escape characters are not needed.
### What makes this a useful eval?
similar to the unified patch diff, the eval requires that gpt can
reliably/accurately handle offsets within text payloads.
## Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).
Your eval should be:
- [ ] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [ ] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [ ] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [ ] **Include at least 15 high quality examples.**
If there is anything else that makes your eval worth including, please
document it below.
### Unique eval value
Google Protocol Buffers are quite popular. JSON is also quite popular.
My hope is "gpt protocol buffers" finds a better "sweet spot" between
both approaches.
## Eval structure 🏗️
Your eval should
- [ ] Check that your data is in `evals/registry/data/{name}`
- [ ] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [ ] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)
## Final checklist 👀
### Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).
- [ x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.
### Email address validation
If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.
- [ x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.
### Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.
- [x ] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.
### Submit eval
- [ x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push
Failure to fill out all required fields will result in the PR being
closed.
### Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:
<details>
<summary>View evals in JSON</summary>
### Eval
```jsonl
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag91\", \"value92\"], [\"tag36\",
\"value13\"], [\"tag11\", \"value50\"], [\"tag88\", \"value28\"],
[\"tag87\",
\"value10\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(15)tag91(17)value92\n(15)tag36(17)value13\n(15)tag11(17)value50\n(15)tag88(17)value28\n(15)tag87(17)value10\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag21\", \"value3\"], [\"tag20\",
\"value58\"], [\"tag13\", \"value63\"], [\"tag46\",
\"value78\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag21(16)value3\n(15)tag20(17)value58\n(15)tag13(17)value63\n(15)tag46(17)value78\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag4\", \"value21\"], [\"tag76\",
\"value83\"], [\"tag52\", \"value2\"], [\"tag58\", \"value90\"],
[\"tag47\",
\"value84\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(14)tag4(17)value21\n(15)tag76(17)value83\n(15)tag52(16)value2\n(15)tag58(17)value90\n(15)tag47(17)value84\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag32\", \"value66\"], [\"tag50\",
\"value95\"], [\"tag40\",
\"value87\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(13)\n(15)tag32(17)value66\n(15)tag50(17)value95\n(15)tag40(17)value87\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag13\", \"value69\"], [\"tag29\",
\"value16\"], [\"tag5\", \"value82\"], [\"tag52\",
\"value30\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag13(17)value69\n(15)tag29(17)value16\n(14)tag5(17)value82\n(15)tag52(17)value30\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag78\", \"value38\"], [\"tag81\",
\"value0\"], [\"tag6\", \"value27\"], [\"tag60\", \"value22\"],
[\"tag50\",
\"value38\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(15)tag78(17)value38\n(15)tag81(16)value0\n(14)tag6(17)value27\n(15)tag60(17)value22\n(15)tag50(17)value38\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag18\", \"value61\"], [\"tag38\",
\"value68\"], [\"tag33\", \"value65\"], [\"tag64\",
\"value76\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag18(17)value61\n(15)tag38(17)value68\n(15)tag33(17)value65\n(15)tag64(17)value76\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag99\", \"value97\"], [\"tag86\",
\"value95\"], [\"tag15\", \"value79\"], [\"tag19\",
\"value69\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(14)\n(15)tag99(17)value97\n(15)tag86(17)value95\n(15)tag15(17)value79\n(15)tag19(17)value69\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag89\", \"value52\"], [\"tag6\",
\"value79\"], [\"tag71\", \"value64\"], [\"tag3\", \"value62\"],
[\"tag54\",
\"value65\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(15)tag89(17)value52\n(14)tag6(17)value79\n(15)tag71(17)value64\n(14)tag3(17)value62\n(15)tag54(17)value65\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthResult.number, value, nextOffset: offset};\n}\n\nexport function
serializeTagValue(tag: string, value: string): string {\n const
serializedTag = HexLengthDelimitedInt.serialize(tag.length) + tag;\n
const serializedValue = HexLengthDelimitedInt.serialize(value.length) +
value;\n return serializedTag + serializedValue + '\\n';\n}\n\nexport
function deserializeTagValue(input: string, offset: number): { tag:
string, value: string, nextOffset: number } {\n const tagResult =
processSerializedString(input, offset);\n offset =
tagResult.nextOffset;\n\n const valueResult =
processSerializedString(input, offset);\n offset =
valueResult.nextOffset;\n\n return {tag: tagResult.value, value:
valueResult.value, nextOffset: offset};\n}\n\n\nfunction
containsOnlyDecimalDigits(input: string): boolean {\n for (const char of
input) {\n if (char < '0' || char > '9') {\n return false;\n }\n }\n
return true;\n}\n\nexport interface DeserializationResult {\n number:
number;\n nextOffset: number;\n}\n\nexport class HexLengthDelimitedInt
{\n static deserialize(input: string, offset = 0): DeserializationResult
{\n if (input[offset] !== '(') {\n throw new Error(\"Invalid input
format\");\n }\n\n let nextOffset = -1;\n for (let i = offset + 1; i <
input.length; i++) {\n if (input[i] === ')') {\n nextOffset = i;\n
break;\n }\n }\n\n if (nextOffset === -1) {\n throw new Error(\"Invalid
input format\");\n }\n\n const length = parseInt(input.slice(offset + 1,
offset + 2), 16);\n if (isNaN(length) || length < 0 || length > 15) {\n
throw new Error(\"Invalid input length\");\n }\n\n const dec =
input.slice(offset + 2, nextOffset);\n if
(!containsOnlyDecimalDigits(dec)) {\n throw new Error(\"Invalid decimal
digits in length-delimited integer\");\n }\n\n const number =
parseInt(dec, 10);\n\n if (!Number.isInteger(number)) {\n throw new
Error(\"Deserialized number is not an integer\");\n }\n\n return
{number, nextOffset: nextOffset + 1};\n }\n\n static serialize(number:
number): string {\n if (!Number.isInteger(number)) {\n throw new
Error(\"Input number is not an integer\");\n }\n\n const dec =
number.toString(10);\n const length = dec.length;\n\n if (length > 15)
{\n throw new Error(\"Number too large to serialize\");\n }\n\n return
'(' + length.toString(16) + dec + ')';\n }\n}\n\n\nimport
{deserializeTagValue, serializeTagValue} from \"./common.impl\";\nimport
{HexLengthDelimitedInt} from \"./hexLengthDelimitedInt.impl\";\n\nexport
class TagValueArray {\n public static serialize(array: [string,
string][]): string {\n let result = '';\n\n result +=
HexLengthDelimitedInt.serialize(array.length) + '\\n';\n\n for (let i =
0; i < array.length; i++) {\n const [tag, value] = array[i];\n result +=
serializeTagValue(tag, value);\n }\n\n return result;\n }\n\n static
serializeNoLength(tagValueArray: Array<[string, string]>): string {\n
let message = '';\n\n for (const [tag, value] of tagValueArray) {\n
message += serializeTagValue(tag, value);\n }\n\n return message;\n
}\n\n public static serializeLength(array: [string, string][]): number
{\n let length = 0;\n\n length +=
HexLengthDelimitedInt.serialize(array.length).length + 1; // Add 1 for
the newline character\n\n for (let i = 0; i < array.length; i++) {\n
const [tag, value] = array[i];\n length += serializeTagValue(tag,
value).length;\n }\n\n return length;\n }\n\n private static
processTagValuePairs(input: string, count: number, offset: number):
Array<[string, string]> {\n const tagValueArray: Array<[string, string]>
= [];\n\n for (let i = 0; i < count; i++) {\n const { tag, value,
nextOffset } = deserializeTagValue(input, offset);\n offset =
nextOffset;\n\n tagValueArray.push([tag, value]);\n\n if (i < count - 1
&& input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after tag-value pair\");\n }\n offset++;\n
}\n\n return tagValueArray;\n }\n\n static deserialize(input: string): {
array: Array<[string, string]>, nextOffset: number } {\n const
countResult = HexLengthDelimitedInt.deserialize(input);\n const count =
countResult.number;\n let offset = countResult.nextOffset;\n\n if
(input[offset] !== '\\n') {\n throw new Error(\"Invalid input format:
missing newline character after count\");\n }\n offset++;\n\n const
array = TagValueArray.processTagValuePairs(input, count, offset);\n
return { array, nextOffset: offset };\n }\n\n static
deserializeNoLength(input: string, count: number): Array<[string,
string]> {\n return TagValueArray.processTagValuePairs(input, count,
0);\n }\n\n static isEmpty(tagValueArray: Array<[string, string]>):
boolean {\n return tagValueArray.length === 0;\n }\n\n static
hasTag(tagValueArray: Array<[string, string]>, tag: string): boolean {\n
return tagValueArray.some(([t, _]) => t === tag);\n }\n}\n\n\nconst
tagValueArray = new TagValueArray([[\"tag6\", \"value10\"], [\"tag1\",
\"value15\"], [\"tag13\", \"value90\"], [\"tag31\", \"value38\"],
[\"tag68\",
\"value0\"]]);\n\nTagValueArray.serialize(tagValueArray);"}],"ideal":["(15)\n(14)tag6(17)value10\n(14)tag1(17)value15\n(15)tag13(17)value90\n(15)tag31(17)value38\n(15)tag68(16)value0\n"]}
{"input":[{"role":"system","content":"You are a helpful assistant. You
are given a TagValueArray class that has a serialize method. You are
also given a TagValueArray instance. You are asked to serialize the
TagValueArray instance."},{"role":"user","content":"What is the output
of the following TypeScript code snippet?\n\nimport
{HexLengthDelimitedInt} from
\"./hexLengthDelimitedInt.impl\";\n\nfunction
processSerializedString(input: string, offset: number): { length:
number, value: string, nextOffset: number } {\n const lengthResult =
HexLengthDelimitedInt.deserialize(input, offset);\n offset =
lengthResult.nextOffset;\n const value = input.slice(offset, offset +
lengthResult.number);\n offset += lengthResult.number;\n return {length:
lengthRe…
…ls with the Raven Matrices test (openai#1078) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, pelase note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name Raven Matrices ### Eval description This benchmark evaluates the ability of a language to perform abstract reasoning using a text-based version of the Raven Matrices test. The task consist of finding a pattern from a set of choices that completes a sequence of eight previous samples. We provide various types of matrices, either under natural language or symbolic formats, with multiple-choices and open-ended settings. ### What makes this a useful eval? Abstract reasoning is an useful task to evaluate the ability of a language model to extract a pattern from few examples. The abstract nature of the pattern requires the model to find the most generic pattern, allowing to test the generalization capacities of language models. Abstract reasoning is a task on which current language models do not perform well, although they have been under-evaluated in the research. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [X] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [X] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [X] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [X] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value Our eval contains an extensive list of high-quality samples from a challenging and under-evaluated task, with several levels of difficulty and different formats. ## Eval structure 🏗️ Your eval should - [X] Check that your data is in `evals/registry/data/{name}` - [X] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [X] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [X] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [X] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [X] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [X] I have filled out all required fields of this form - [X] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Pick the letter in front of the correct pattern that logically follows in the sequence from the answer set. Patterns in the sequence are preceded by a number from 1 to 8. Patterns in the answer set are preceded by a letter from A to H. Only return the letter in front of the correct pattern."}, {"role": "user", "content": "1. On an image, a large orange circle rotated at 90 degrees. "}, {"role": "user", "content": "2. On an image, a giant orange pentagon rotated at 90 degrees. "}, {"role": "user", "content": "3. On an image, a small red triangle rotated at 90 degrees. "}, {"role": "user", "content": "4. On an image, a small orange circle rotated at 135 degrees. "}, {"role": "user", "content": "5. On an image, a large orange pentagon rotated at 135 degrees. "}, {"role": "user", "content": "6. On an image, a giant red triangle rotated at 135 degrees. "}, {"role": "user", "content": "7. On an image, a giant red circle rotated at -45 degrees. "}, {"role": "user", "content": "8. On an image, a small red pentagon rotated at -45 degrees. "}, {"role": "user", "content": "A. On an image, a large red triangle rotated at -45 degrees. "}, {"role": "user", "content": "B. On an image, a large red circle rotated at -45 degrees. "}, {"role": "user", "content": "C. On an image, a large red hexagon rotated at -45 degrees. "}, {"role": "user", "content": "D. On an image, a medium red triangle rotated at -45 degrees. "}, {"role": "user", "content": "E. On an image, a large orange triangle rotated at -45 degrees. "}, {"role": "user", "content": "F. On an image, a large red pentagon rotated at -45 degrees. "}, {"role": "user", "content": "G. On an image, a large pink triangle rotated at -45 degrees. "}, {"role": "user", "content": "H. On an image, a large lime triangle rotated at -45 degrees. "}, {"role": "user", "content": "The answer is "}], "ideal": "A"} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Pick the letter in front of the correct pattern that logically follows in the sequence from the answer set. Patterns in the sequence are preceded by a number from 1 to 8. Patterns in the answer set are preceded by a letter from A to H. Only return the letter in front of the correct pattern."}, {"role": "user", "content": "1. On an image, a huge purple triangle rotated at 180 degrees in the bottom right, a small purple triangle rotated at -45 degrees in the top left, a large purple triangle rotated at 45 degrees in the bottom left. "}, {"role": "user", "content": "2. On an image, a huge pink circle rotated at 180 degrees in the bottom right, a small pink circle rotated at -45 degrees in the bottom left, a large pink circle rotated at 45 degrees in the top right. "}, {"role": "user", "content": "3. On an image, a huge white square rotated at 180 degrees in the bottom right, a small white square rotated at -45 degrees in the top left, a large white square rotated at 45 degrees in the top right. "}, {"role": "user", "content": "4. On an image, a large lime circle rotated at 0 degrees in the bottom right, a tiny lime circle rotated at 90 degrees in the top left, a giant lime circle rotated at -45 degrees in the top right. "}, {"role": "user", "content": "5. On an image, a large green square rotated at 0 degrees in the bottom right, a tiny green square rotated at 90 degrees in the top left, a giant green square rotated at -45 degrees in the bottom left. "}, {"role": "user", "content": "6. On an image, a large cyan triangle rotated at 0 degrees in the bottom right, a tiny cyan triangle rotated at 90 degrees in the bottom left, a giant cyan triangle rotated at -45 degrees in the top right. "}, {"role": "user", "content": "7. On an image, a huge lime square rotated at 135 degrees in the bottom right, a tiny lime square rotated at 0 degrees in the bottom left, a tiny lime square rotated at 135 degrees in the top right. "}, {"role": "user", "content": "8. On an image, a huge green triangle rotated at 135 degrees in the bottom right, a tiny green triangle rotated at 0 degrees in the top left, a tiny green triangle rotated at 135 degrees in the top right. "}, {"role": "user", "content": "A. On an image, a huge cyan pentagon rotated at 135 degrees in the bottom right, a tiny cyan triangle rotated at 0 degrees in the top left, a tiny cyan triangle rotated at 135 degrees in the bottom left. "}, {"role": "user", "content": "B. On an image, a huge cyan circle rotated at 135 degrees in the top right, a tiny cyan circle rotated at 0 degrees in the bottom left, a tiny cyan circle rotated at 135 degrees in the top left. "}, {"role": "user", "content": "C. On an image, a huge cyan square rotated at 135 degrees in the bottom right, a tiny cyan hexagon rotated at 0 degrees in the top left, a tiny cyan hexagon rotated at 135 degrees in the bottom left. "}, {"role": "user", "content": "D. On an image, a huge cyan circle rotated at 135 degrees in the top left, a tiny cyan circle rotated at 0 degrees in the bottom right, a tiny cyan circle rotated at 135 degrees in the top right. "}, {"role": "user", "content": "E. On an image, a huge cyan circle rotated at 135 degrees in the bottom right, a tiny cyan circle rotated at 0 degrees in the top left, a tiny cyan circle rotated at 135 degrees in the bottom left. "}, {"role": "user", "content": "F. On an image, a huge yellow circle rotated at 135 degrees in the bottom right, a tiny lime circle rotated at 0 degrees in the top left, a tiny orange circle rotated at 135 degrees in the bottom left. "}, {"role": "user", "content": "G. On an image, a huge cyan circle rotated at 135 degrees in the bottom left, a tiny cyan circle rotated at 0 degrees in the top right, a tiny cyan circle rotated at 135 degrees in the bottom right. "}, {"role": "user", "content": "H. On an image, a huge cyan hexagon rotated at 135 degrees in the bottom right, a tiny cyan pentagon rotated at 0 degrees in the top left, a tiny cyan pentagon rotated at 135 degrees in the bottom left. "}, {"role": "user", "content": "The answer is "}], "ideal": "E"} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Pick the letter in front of the correct pattern that logically follows in the sequence from the answer set. Patterns in the sequence are preceded by a number from 1 to 8. Patterns in the answer set are preceded by a letter from A to H. Only return the letter in front of the correct pattern."}, {"role": "user", "content": "1. On an image, a small red circle rotated at -135 degrees in the top left. "}, {"role": "user", "content": "2. On an image, a small red hexagon rotated at -135 degrees in the top right. "}, {"role": "user", "content": "3. On an image, a small red triangle rotated at -135 degrees in the center. "}, {"role": "user", "content": "4. On an image, a giant cyan hexagon rotated at -135 degrees in the top center. "}, {"role": "user", "content": "5. On an image, a giant cyan triangle rotated at -135 degrees in the center left. "}, {"role": "user", "content": "6. On an image, a giant cyan circle rotated at -135 degrees in the center right. "}, {"role": "user", "content": "7. On an image, a tiny green triangle rotated at -45 degrees in the center, a tiny green triangle rotated at -45 degrees in the bottom left, a tiny green triangle rotated at -45 degrees in the center left. "}, {"role": "user", "content": "8. On an image, a tiny green circle rotated at -45 degrees in the bottom left, a tiny green circle rotated at -45 degrees in the bottom right, a tiny green circle rotated at -45 degrees in the center right. "}, {"role": "user", "content": "A. On an image, a huge yellow circle rotated at -45 degrees in the bottom center, a large green square rotated at 180 degrees in the center left, a small red triangle rotated at -45 degrees in the top center, a medium pink triangle rotated at -45 degrees in the center, a small green pentagon rotated at 135 degrees in the bottom right, a giant lime triangle rotated at 180 degrees in the top left, a large blue pentagon rotated at -90 degrees in the center right. "}, {"role": "user", "content": "B. On an image, a tiny green circle rotated at -45 degrees in the bottom right, a tiny green triangle rotated at -45 degrees in the top center, a tiny green triangle rotated at -45 degrees in the bottom center. "}, {"role": "user", "content": "C. On an image, a tiny green triangle rotated at -45 degrees in the bottom right, a tiny green square rotated at -45 degrees in the top center, a tiny green circle rotated at -45 degrees in the bottom center. "}, {"role": "user", "content": "D. On an image, a large green hexagon rotated at -45 degrees in the bottom right, a giant green hexagon rotated at -45 degrees in the top center, a small green hexagon rotated at -45 degrees in the bottom center. "}, {"role": "user", "content": "E. On an image, a huge green hexagon rotated at -45 degrees in the bottom right, a medium green hexagon rotated at -45 degrees in the top center, a large green hexagon rotated at -45 degrees in the bottom center. "}, {"role": "user", "content": "F. On an image, a tiny green hexagon rotated at -45 degrees in the center right, a tiny green hexagon rotated at -45 degrees in the center, a tiny green hexagon rotated at -45 degrees in the top center. "}, {"role": "user", "content": "G. On an image, a tiny green pentagon rotated at -45 degrees in the bottom right, a tiny green circle rotated at -45 degrees in the top center, a tiny green pentagon rotated at -45 degrees in the bottom center. "}, {"role": "user", "content": "H. On an image, a tiny green hexagon rotated at -45 degrees in the bottom right, a tiny green hexagon rotated at -45 degrees in the top center, a tiny green hexagon rotated at -45 degrees in the bottom center. "}, {"role": "user", "content": "The answer is "}], "ideal": "H"} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Pick the letter in front of the correct pattern that logically follows in the sequence from the answer set. Patterns in the sequence are preceded by a number from 1 to 8. Patterns in the answer set are preceded by a letter from A to H. Only return the letter in front of the correct pattern."}, {"role": "user", "content": "1. [(D, B, F, F,)] "}, {"role": "user", "content": "2. [(F, B, D, F,)] "}, {"role": "user", "content": "3. [(B, A, B, F,)] "}, {"role": "user", "content": "4. [(B, B, F, G,)] "}, {"role": "user", "content": "5. [(D, B, D, G,)] "}, {"role": "user", "content": "6. [(F, A, B, G,)] "}, {"role": "user", "content": "7. [(F, A, F, C,)] "}, {"role": "user", "content": "8. [(B, A, D, C,)] "}, {"role": "user", "content": "A. [(D, A, B, C,)] "}, {"role": "user", "content": "B. [(D, A, F, C,)] "}, {"role": "user", "content": "C. [(D, A, E, C,)] "}, {"role": "user", "content": "D. [(C, A, B, C,)] "}, {"role": "user", "content": "E. [(D, B, B, C,)] "}, {"role": "user", "content": "F. [(D, A, D, C,)] "}, {"role": "user", "content": "G. [(D, I, B, C,)] "}, {"role": "user", "content": "H. [(D, D, B, C,)] "}, {"role": "user", "content": "The answer is "}], "ideal": "A"} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Pick the letter in front of the correct pattern that logically follows in the sequence from the answer set. Patterns in the sequence are preceded by a number from 1 to 8. Patterns in the answer set are preceded by a letter from A to H. Only return the letter in front of the correct pattern."}, {"role": "user", "content": "1. [(E, H, B, H, BR), (B, H, B, C, TL), (D, H, B, E, BL)] "}, {"role": "user", "content": "2. [(E, I, F, H, BR), (B, I, F, C, BL), (D, I, F, E, TR)] "}, {"role": "user", "content": "3. [(E, J, C, H, BR), (B, J, C, C, TL), (D, J, C, E, TR)] "}, {"role": "user", "content": "4. [(D, D, F, D, BR), (A, D, F, F, TL), (F, D, F, C, TR)] "}, {"role": "user", "content": "5. [(D, E, C, D, BR), (A, E, C, F, TL), (F, E, C, C, BL)] "}, {"role": "user", "content": "6. [(D, F, B, D, BR), (A, F, B, F, BL), (F, F, B, C, TR)] "}, {"role": "user", "content": "7. [(E, D, C, G, BR), (A, D, C, D, BL), (A, D, C, G, TR)] "}, {"role": "user", "content": "8. [(E, E, B, G, BR), (A, E, B, D, TL), (A, E, B, G, TR)] "}, {"role": "user", "content": "A. [(E, F, D, G, BR), (A, F, B, D, TL), (A, F, B, G, BL)] "}, {"role": "user", "content": "B. [(E, F, F, G, TR), (A, F, F, D, BL), (A, F, F, G, TL)] "}, {"role": "user", "content": "C. [(E, F, C, G, BR), (A, F, E, D, TL), (A, F, E, G, BL)] "}, {"role": "user", "content": "D. [(E, F, F, G, TL), (A, F, F, D, BR), (A, F, F, G, TR)] "}, {"role": "user", "content": "E. [(E, F, F, G, BR), (A, F, F, D, TL), (A, F, F, G, BL)] "}, {"role": "user", "content": "F. [(E, C, F, G, BR), (A, D, F, D, TL), (A, B, F, G, BL)] "}, {"role": "user", "content": "G. [(E, F, F, G, BL), (A, F, F, D, TR), (A, F, F, G, BR)] "}, {"role": "user", "content": "H. [(E, F, E, G, BR), (A, F, D, D, TL), (A, F, D, G, BL)] "}, {"role": "user", "content": "The answer is "}], "ideal": "E"} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Pick the letter in front of the correct pattern that logically follows in the sequence from the answer set. Patterns in the sequence are preceded by a number from 1 to 8. Patterns in the answer set are preceded by a letter from A to H. Only return the letter in front of the correct pattern."}, {"role": "user", "content": "1. [(B, A, F, A, TL)] "}, {"role": "user", "content": "2. [(B, A, E, A, TR)] "}, {"role": "user", "content": "3. [(B, A, B, A, C)] "}, {"role": "user", "content": "4. [(F, F, E, A, TC)] "}, {"role": "user", "content": "5. [(F, F, B, A, CL)] "}, {"role": "user", "content": "6. [(F, F, F, A, CR)] "}, {"role": "user", "content": "7. [(A, E, B, C, C), (A, E, B, C, BL), (A, E, B, C, CL)] "}, {"role": "user", "content": "8. [(A, E, F, C, BL), (A, E, F, C, BR), (A, E, F, C, CR)] "}, {"role": "user", "content": "A. [(E, C, F, C, BC), (D, E, C, H, CL), (B, A, B, C, TC), (C, I, B, C, C), (B, E, D, G, BR), (F, D, B, H, TL), (D, G, D, B, CR)] "}, {"role": "user", "content": "B. [(A, E, F, C, BR), (A, E, B, C, TC), (A, E, B, C, BC)] "}, {"role": "user", "content": "C. [(A, E, B, C, BR), (A, E, C, C, TC), (A, E, F, C, BC)] "}, {"role": "user", "content": "D. [(D, E, E, C, BR), (F, E, E, C, TC), (B, E, E, C, BC)] "}, {"role": "user", "content": "E. [(E, E, E, C, BR), (C, E, E, C, TC), (D, E, E, C, BC)] "}, {"role": "user", "content": "F. [(A, E, E, C, CR), (A, E, E, C, C), (A, E, E, C, TC)] "}, {"role": "user", "content": "G. [(A, E, D, C, BR), (A, E, F, C, TC), (A, E, D, C, BC)] "}, {"role": "user", "content": "H. [(A, E, E, C, BR), (A, E, E, C, TC), (A, E, E, C, BC)] "}, {"role": "user", "content": "The answer is "}], "ideal": "H"} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Write the correct pattern with the same format as in the examples. Patterns in the sequence are preceded by a number from 1 to 8. "}, {"role": "user", "content": "1. On an image, a large orange circle rotated at 90 degrees. "}, {"role": "user", "content": "2. On an image, a giant orange pentagon rotated at 90 degrees. "}, {"role": "user", "content": "3. On an image, a small red triangle rotated at 90 degrees. "}, {"role": "user", "content": "4. On an image, a small orange circle rotated at 135 degrees. "}, {"role": "user", "content": "5. On an image, a large orange pentagon rotated at 135 degrees. "}, {"role": "user", "content": "6. On an image, a giant red triangle rotated at 135 degrees. "}, {"role": "user", "content": "7. On an image, a giant red circle rotated at -45 degrees. "}, {"role": "user", "content": "8. On an image, a small red pentagon rotated at -45 degrees. "}, {"role": "user", "content": "The pattern that logically follows is:\n9. "}], "ideal": "On an image, a large red triangle rotated at -45 degrees. "} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Write the correct pattern with the same format as in the examples. Patterns in the sequence are preceded by a number from 1 to 8. "}, {"role": "user", "content": "1. On an image, a huge purple triangle rotated at 180 degrees in the bottom right, a small purple triangle rotated at -45 degrees in the top left, a large purple triangle rotated at 45 degrees in the bottom left. "}, {"role": "user", "content": "2. On an image, a huge pink circle rotated at 180 degrees in the bottom right, a small pink circle rotated at -45 degrees in the bottom left, a large pink circle rotated at 45 degrees in the top right. "}, {"role": "user", "content": "3. On an image, a huge white square rotated at 180 degrees in the bottom right, a small white square rotated at -45 degrees in the top left, a large white square rotated at 45 degrees in the top right. "}, {"role": "user", "content": "4. On an image, a large lime circle rotated at 0 degrees in the bottom right, a tiny lime circle rotated at 90 degrees in the top left, a giant lime circle rotated at -45 degrees in the top right. "}, {"role": "user", "content": "5. On an image, a large green square rotated at 0 degrees in the bottom right, a tiny green square rotated at 90 degrees in the top left, a giant green square rotated at -45 degrees in the bottom left. "}, {"role": "user", "content": "6. On an image, a large cyan triangle rotated at 0 degrees in the bottom right, a tiny cyan triangle rotated at 90 degrees in the bottom left, a giant cyan triangle rotated at -45 degrees in the top right. "}, {"role": "user", "content": "7. On an image, a huge lime square rotated at 135 degrees in the bottom right, a tiny lime square rotated at 0 degrees in the bottom left, a tiny lime square rotated at 135 degrees in the top right. "}, {"role": "user", "content": "8. On an image, a huge green triangle rotated at 135 degrees in the bottom right, a tiny green triangle rotated at 0 degrees in the top left, a tiny green triangle rotated at 135 degrees in the top right. "}, {"role": "user", "content": "The pattern that logically follows is:\n9. "}], "ideal": "On an image, a huge cyan circle rotated at 135 degrees in the bottom right, a tiny cyan circle rotated at 0 degrees in the top left, a tiny cyan circle rotated at 135 degrees in the bottom left. "} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Write the correct pattern with the same format as in the examples. Patterns in the sequence are preceded by a number from 1 to 8. "}, {"role": "user", "content": "1. On an image, a small red circle rotated at -135 degrees in the top left. "}, {"role": "user", "content": "2. On an image, a small red hexagon rotated at -135 degrees in the top right. "}, {"role": "user", "content": "3. On an image, a small red triangle rotated at -135 degrees in the center. "}, {"role": "user", "content": "4. On an image, a giant cyan hexagon rotated at -135 degrees in the top center. "}, {"role": "user", "content": "5. On an image, a giant cyan triangle rotated at -135 degrees in the center left. "}, {"role": "user", "content": "6. On an image, a giant cyan circle rotated at -135 degrees in the center right. "}, {"role": "user", "content": "7. On an image, a tiny green triangle rotated at -45 degrees in the center, a tiny green triangle rotated at -45 degrees in the bottom left, a tiny green triangle rotated at -45 degrees in the center left. "}, {"role": "user", "content": "8. On an image, a tiny green circle rotated at -45 degrees in the bottom left, a tiny green circle rotated at -45 degrees in the bottom right, a tiny green circle rotated at -45 degrees in the center right. "}, {"role": "user", "content": "The pattern that logically follows is:\n9. "}], "ideal": "On an image, a tiny green hexagon rotated at -45 degrees in the bottom right, a tiny green hexagon rotated at -45 degrees in the top center, a tiny green hexagon rotated at -45 degrees in the bottom center. "} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Write the correct pattern with the same format as in the examples. Patterns in the sequence are preceded by a number from 1 to 8. "}, {"role": "user", "content": "1. [(D, B, F, F,)] "}, {"role": "user", "content": "2. [(F, B, D, F,)] "}, {"role": "user", "content": "3. [(B, A, B, F,)] "}, {"role": "user", "content": "4. [(B, B, F, G,)] "}, {"role": "user", "content": "5. [(D, B, D, G,)] "}, {"role": "user", "content": "6. [(F, A, B, G,)] "}, {"role": "user", "content": "7. [(F, A, F, C,)] "}, {"role": "user", "content": "8. [(B, A, D, C,)] "}, {"role": "user", "content": "The pattern that logically follows is:\n9. "}], "ideal": "[(D, A, B, C,)] "} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Write the correct pattern with the same format as in the examples. Patterns in the sequence are preceded by a number from 1 to 8. "}, {"role": "user", "content": "1. [(E, H, B, H, BR), (B, H, B, C, TL), (D, H, B, E, BL)] "}, {"role": "user", "content": "2. [(E, I, F, H, BR), (B, I, F, C, BL), (D, I, F, E, TR)] "}, {"role": "user", "content": "3. [(E, J, C, H, BR), (B, J, C, C, TL), (D, J, C, E, TR)] "}, {"role": "user", "content": "4. [(D, D, F, D, BR), (A, D, F, F, TL), (F, D, F, C, TR)] "}, {"role": "user", "content": "5. [(D, E, C, D, BR), (A, E, C, F, TL), (F, E, C, C, BL)] "}, {"role": "user", "content": "6. [(D, F, B, D, BR), (A, F, B, F, BL), (F, F, B, C, TR)] "}, {"role": "user", "content": "7. [(E, D, C, G, BR), (A, D, C, D, BL), (A, D, C, G, TR)] "}, {"role": "user", "content": "8. [(E, E, B, G, BR), (A, E, B, D, TL), (A, E, B, G, TR)] "}, {"role": "user", "content": "The pattern that logically follows is:\n9. "}], "ideal": "[(E, F, F, G, BR), (A, F, F, D, TL), (A, F, F, G, BL)] "} {"input": [{"role": "system", "content": "Find the pattern number 9 that completes the sequence. Write the correct pattern with the same format as in the examples. Patterns in the sequence are preceded by a number from 1 to 8. "}, {"role": "user", "content": "1. [(B, A, F, A, TL)] "}, {"role": "user", "content": "2. [(B, A, E, A, TR)] "}, {"role": "user", "content": "3. [(B, A, B, A, C)] "}, {"role": "user", "content": "4. [(F, F, E, A, TC)] "}, {"role": "user", "content": "5. [(F, F, B, A, CL)] "}, {"role": "user", "content": "6. [(F, F, F, A, CR)] "}, {"role": "user", "content": "7. [(A, E, B, C, C), (A, E, B, C, BL), (A, E, B, C, CL)] "}, {"role": "user", "content": "8. [(A, E, F, C, BL), (A, E, F, C, BR), (A, E, F, C, CR)] "}, {"role": "user", "content": "The pattern that logically follows is:\n9. "}], "ideal": "[(A, E, E, C, BR), (A, E, E, C, TC), (A, E, E, C, BC)] "} ``` </details>
…i#1124) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name The <eval_name> is **population_span_extraction** ID is **population_span_extraction.dev.v0** ### Eval description The model is shown abstracts of clinical drug trials and tasked with extracting the text spans that specify the population demographic of the shown abstract. The population demographic can be but is not necessarily specified in multiple seperate spans. A previous version included examples containing 'problem' as part of the population (as per PICO criteria labeling) as opposed to strictly population demographics. We are now resubmitting a different version, with different abstracts, which contains only demographics annotations. ### What makes this a useful eval? The Repository specifically asks for "Real-world use cases". Extracting population spans from clinical study trials is immensly useful to researchers who have to go over and compare large amounts of clinical drug trials. The eval dataset is generated with multiple different prompts and statisfies all further critera posed by Open AI. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Efficacy of the dorzolamide/timolol fixed combination versus latanoprost in the treatment of ocular hypertension or glaucoma: combined analysis of pooled data from two large randomized observer and patient-masked studies.\n\nIn previous analyses of primary efficacy data from two randomized clinical trials, standard dosing regimens of the dorzolamide/timolol fixed combination (COSOPT) and latanoprost (XALATAN) were shown to have equivalent efficacy with regard to reduction in mean daytime diurnal intraocular pressure (IOP). We performed additional post hoc analyses of pooled data from these studies to compare further the efficacy of the two treatments. The studies used identical 3-month, parallel group, randomized, observer-masked and patient-masked, multicenter designs. Patients with a baseline IOP > or = 24 mm Hg were randomized to either the 2% dorzolamide/0.5% timolol combination eye drops twice daily (n = 273) or 0.005% latanoprost eye drops once daily (n = 271). The IOP measurements were made at 8 AM, 10 AM, 2 PM, and 4 PM at the baseline visit and then on each of the 3 monthly assessment days. The following measures were analyzed on a post hoc basis: 1) percentages of patients meeting target levels of IOP reduction; 2) mean IOP reduction in those patients with high IOP (> or =30 mmHg) at baseline; 3) mean IOP at each of the assessment time points during a day. A total of 259 patients in the dorzolamide/timolol group and 268 patients in the latanoprost group were included in the efficacy analysis. At 3 months, both treatments showed similar efficacy with regard to the percentages of patients who achieved target levels of IOP reduction (e.g., 40% IOP reduction in 15% of dorzolamide/timolol combination patients and 13% of latanoprost patients), mean IOP reduction in those patients with high IOP at baseline (dorzolamide/ timolol combination, 12.5 mmHg, latanoprost, 12.6 mmHg), and mean IOP at each time point during the day. By the measures used in this analysis, the dorzolamide/timolol combination and latanoprost were equally effective at lowering IOP in patients with ocular hypertension or glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a baseline IOP > or = 24 mm Hg '"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "Twenty-four-hour control with latanoprost-timolol-fixed combination therapy vs latanoprost therapy.\n\nOBJECTIVE: To evaluate the 24-hour efficacy and safety of the latanoprost-timolol maleate-fixed combination vs latanoprost therapy in patients with primary open-angle glaucoma.\nMETHODS: A prospective, observer-masked, crossover, active-controlled, randomized comparison in which after a 6-week medicine-free period, patients were randomized to either latanoprost-timolol-fixed combination therapy or latanoprost therapy, both dosed once each evening, alone for 8 weeks. Patients were then switched to the opposite treatment for 8 weeks. At the end of the washout and treatment periods, a 24-hour diurnal curve was performed.\nRESULTS: The baseline untreated mean +/- SD diurnal curve in 37 patients who completed the study was 24.2 +/- 2.0 mm Hg. The mean diurnal curve was 19.2 +/- 2.6 mm Hg for those who received latanoprost therapy alone and 16.7 +/- 2.1 mm Hg for those who received the fixed combination therapy (P<.001). The fixed combination therapy also provided a lower absolute intraocular pressure level (1.5-2.9 mm Hg, P<.001) and a greater intraocular pressure reduction from the untreated baseline (P<.001). Stinging was statistically lower with latanoprost therapy alone (P = .04), but itching was statistically increased compared with the fixed combination therapy (P = .04).\nCONCLUSION: The result of this study suggests that the latanoprost-timolol-fixed combination compared with latanoprost therapy alone provides improved intraocular pressure reduction over the 24-hour diurnal curve and for each individual time point in patients with primary open-angle glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: ' patients with primary open-angle glaucoma.'"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "A 12-week, randomized, double-masked, multicenter study of the fixed combination of latanoprost and timolol in the evening versus the individual components.\n\nPURPOSE: To compare the efficacy and tolerability of fixed-combination latanoprost and timolol applied in the evening with the concomitant use of the individual components.\nDESIGN: Twelve-week, randomized, double-masked, multicenter study.\nPARTICIPANTS: Five hundred seventeen randomized patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg.\nMETHODS: Patients received either the fixed combination of latanoprost and timolol once daily in the evening and a placebo in the morning and evening or the unfixed combination of latanoprost once daily in the evening and timolol in the morning and evening. Study visits were at weeks 2, 6, and 12. MAIN OUTCOME MEASURES: The primary efficacy end point was mean change from baseline to week 12 in diurnal IOP (mean IOPs of 8 am, 12 pm, and 4 pm). The fixed combination was considered noninferior to the unfixed combination if the upper limit of the 95% confidence interval (CI) of the difference was <1.5 mmHg (analysis of covariance). Adverse events were recorded at each visit.\nRESULTS: In all, 502 patients were included in intent-to-treat analyses (fixed combination, n = 255; unfixed combination, n = 247). For the fixed- and unfixed-combination groups, mean baseline diurnal IOP levels were 25.4 mmHg and 25.2 mmHg, respectively, and mean diurnal IOP reductions were 8.7 mmHg and 9.0 mmHg (between-treatment difference, 0.3 mmHg; 95% CI, -0.1 to 0.7 mmHg; P = 0.15). Both treatments were well tolerated.\nCONCLUSIONS: The fixed combination of latanoprost and timolol administered once daily in the evening is not inferior to the unfixed combination of latanoprost once daily in the evening and timolol twice daily. The fixed combination provides an effective and well-tolerated alternative to multiple instillations."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg'"} {"input": [{"role": "system", "content": "This is from a clinical drug trial abstract. Extract the parts specifying population demographics."}, {"role": "user", "content": "Efficacy of latanoprost or fixed-combination latanoprost-timolol in patients switched from a combination of timolol and a nonprostaglandin medication.\n\nPURPOSE: To compare latanoprost with the fixed-combination latanoprost-timolol in glaucoma or ocular hypertension patients switched from a combination glaucoma therapy with timolol and another nonprostaglandin medication.\nDESIGN: Prospective randomized clinical trial.\nMETHODS: Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor) underwent a 30-day washout of their medications. A masked observer then measured their intraocular pressure (IOP). The subjects were randomized to either latanoprost or fixed-combination latanoprost-timolol eyedrops to use once daily at 7 am. The IOP was measured again 30 days after the patients started using one of the study drugs by the same examiner at the same time. MAIN OUTCOME MEASURE: Comparison of the study medications' hypotensive effect.\nRESULTS: Fifty-three eyes (28 in the latanoprost group and 25 in the latanoprost-timolol group) from 28 patients were included in the study. The IOP reduction was greater in both study groups compared with the previous combination therapy with timolol and another nonprostaglandin medication in millimeters of mercury (7.7+/-2.3 vs. 5.5+/-2.3, P<0.001, for the latanoprost group; 8.5+/-3.5 vs. 6.3+/-2.7, P<0.001, for the latanoprost-timolol group) and percentage (35.8+/-8.2% vs. 25.6+/-8.9%, P<0.001, for the latanoprost group; 38.6+/-8.7% vs. 28.6+/-9.0%, P<0.001, for the latanoprost-timolol group). There was no statistical difference between latanoprost and fixed-combination latanoprost-timolol in reducing IOP, in either millimeters of mercury (P = 0.3) or percentage (P = 0.2).\nCONCLUSIONS: Both latanoprost and fixed-combination latanoprost-timolol may be viable substitutes for timolol and another nonprostaglandin medication in glaucoma or ocular hypertension patients."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor)'"} {"input": [{"role": "system", "content": "The Following text is an abstract of a clinical drug trial that specifies a population demographic. I want you to extract the text spans that contain these informations."}, {"role": "user", "content": "A 6-week, double-masked, parallel-group study of the efficacy and safety of travoprost 0.004% compared with latanoprost 0:005%/timolol 0.5% in patients with primary open-angle glaucoma or ocular hypertension.\n\nOBJECTIVE: The objective of this study was to directly compare the intraocular pressure (IOP)-lowering efficacy and safety of travoprost 0.004% eyedrops with the fixed combination of latanoprost 0.005%/timolol 0.5% eyedrops in patients with primary open-angle glaucoma or ocular hypertension.\nMETHODS: This was a randomized, double-masked, multicenter, parallel-group, active-controlled study. Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension were eligible to participate if their IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy, as indicated by IOP of 22 to 36 mm Hg at 9 AM at screening. Patients were randomly assigned in a 1:1 ratio to receive placebo + travoprost or latanoprost/timolol + placebo. Patients in the travoprost group administered travoprost at 9 PM and placebo at 9 AM; patients in the latanoprost/timolol group administered latanoprost/timolol at 9 AM and placebo at 9 PM. IOP measurements were performed using Goldmann applanation tonometry at 9 AM and 5 PM at the week-2 and week-6 visits. Both volunteered and elicited reports of adverse events were collected; all patients who were randomized and received > or =1 dose of study drug were included in the safety analysis.\nRESULTS: One hundred ten patients were randomized, of whom 106 patients were evaluable (travoprost, n = 50; latanoprost/timolol, n = 56). There were no statistically significant differences at baseline between the treatment groups, based on age group, sex, race, iris color, or diagnosis. Mean IOP values were not statistically different between groups at baseline or during treatment. In the pooled results for 9 Am assessment at weeks 2 and 6, mean (SEM) IOP reductions for travoprost and latanoprost/timolol were 7.0 (0.5) and 6.4 (0.5) mm Hg, respectively (P = NS). Adverse events related to therapy were mild in nature, and there were no statistically significant differences between the 2 treatment groups. The most frequently experienced adverse events in the travoprost group were ocular hyperemia (9.3%), foreign body sensation (5.6%), abnormal vision (1.9%), allergic reaction (1.9%), conjunctivitis (1.9%), dacryocystitis (1.9%), eye discharge (1.9%), eye pruritus (1.9%), lid edema (1.9%), lid erythema (1.9%), and tearing (1.9%). In the latanoprost/timolol group, the most frequently experienced adverse events were cataract (1.8%), dry eyes (1.8%), eye pruritus (1.8%), foreign body sensation (1.8%), and ocular hyperemia (1.8%).\nCONCLUSIONS: Mean IOP changes from baseline for travoprost 0.004% and latanoprost 0.005%/timolol 0.5% fixed combination were not significantly different at follow-up in these patients. Both medications were well tolerated."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma or ocular hypertension.', 'Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension', 'IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Comparison of the efficacy and safety of travoprost with a fixed-combination of dorzolamide and timolol in patients with open-angle glaucoma or ocular hypertension.\n\nPURPOSE: The purpose of this study was to compare travoprost (TRAV; travoprost 0.004%) and the fixed-combination of dorzolamide/timolol (DTFC; dorzolamide 2.0%/timolol maleate 0.5%) ophthalmic solutions for reducing intraocular pressure (IOP) in patients with primary open-angle glaucoma (OAG) or ocular hypertension (OHT).\nMETHODS: This was a randomized single masked, study with parallel controls. The TRAV group (n = 29) dosed once daily at 9:00 PM while the DTFC group (n = 27) dosed twice daily at 9:00 AM and 9:00 PM. IOP was measured at baseline, and following 3 weeks and 6 weeks of treatment at 8:00 AM, 12:00 PM, 4:00 PM, and 8:00 PM.\nRESULTS: Mean average IOP reductions from baseline during the course of the day were 7.5 (32.7%) and 7.1 (30.7%) mmHg for TRAV and 4.8 (23.1%) and 4.5 (21.7%) mmHg for DTFC at 3 weeks and 6 weeks, respectively. The greater IOP reduction for patients receiving TRAV was statistically significant at both the 3 and 6 week visits when averaged across all four time points (p < 0.01). The two products were well-tolerated over the course of the 6 week study. Some factors such as taste perversion were reported more often in the DTFC group.\nCONCLUSIONS: Travoprost monotherapy provided better efficacy in terms of IOP reduction and percentage of IOP reduction compared to dorzolamide 2.0%/timolol maleate 0.5% fixed combination."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma (OAG)', 'ocular hypertension (OHT)'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Efficacy and safety of latanoprost versus travoprost in exfoliative glaucoma patients.\n\nOBJECTIVE: To evaluate 24-hour intraocular pressure (IOP) efficacy of latanoprost versus travoprost, each given every evening, in exfoliative glaucoma patients.\nDESIGN: Prospective, observer-masked, crossover comparison.\nPARTICIPANTS: Forty patients with exfoliation glaucoma.\nMETHODS: Patients with a pressure of >24 mmHg were randomized to latanoprost or travoprost for an 8-week treatment period after a 6-week medicine-free period. Patients were then switched to the opposite treatment for the second period. At untreated baseline and at the end of each treatment period the IOP was measured at 6 am, 10 am, 2 pm, 6 pm, 10 pm, and 2 am. MAIN OUTCOME MEASURE: Diurnal IOP.\nRESULTS: The mean 24-hour IOP was 25.1+/-2.5 mmHg at baseline, 17.8+/-2.1 mmHg on latanoprost, and 17.3+/-2.2 mmHg on travoprost (P = 0.001). Individual time points were similar between treatments, except at 6 pm when travoprost provided lower IOP (16.7+/-2.6 vs 17.9+/-2.5 mmHg, P<0.001). Adverse events showed more conjunctival hyperemia with travoprost (n = 15) than latanoprost (n = 6; P = 0.03).\nCONCLUSIONS: Latanoprost and travoprost both significantly reduce the 24-hour IOP from baseline in exfoliative glaucoma, but travoprost may demonstrate a greater hypotensive efficacy in the late afternoon."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a pressure of >24 mmHg', 'exfoliative glaucoma patients'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparison of the ocular hypotensive effects of bimatoprost and timolol-dorzolamide combination in patients with elevated intraocular pressure: a 6-month study.\n\nPURPOSE: To compare the ocular hypotensive efficacy and safety of topical bimatoprost and timolol-dorzolamide combination in patients with primary open-angle glaucoma (POAG) or ocular hypertension during 6 months of treatment.\nMETHODS: A sample of 65 patients with a diagnosis of POAG or ocular hypertension were randomized to receive either bimatoprost 0.03% once daily or timolol-dorzolamide combination twice daily. Study visits occurred at baseline and after 2 weeks and 1, 3 and 6 months of therapy. Intraocular pressure (IOP) measurements were performed at 12.00 hours at all study visits and also at 08.00 hours and 16.00 hours at baseline and 6-month visits. At each visit, local and systemic side-effects that occurred during the treatment period were recorded. Student's t-test was used to compare the differences between IOP values.\nRESULTS: Differences in IOP between the bimatoprost and timolol-dorzolamide groups were statistically insignificant at all study visits (p > 0.05). In the bimatoprost-treated group, the IOP reduction was 6.2 +/- 1.8 mmHg, whereas it was 6.5 +/- 2.3 mmHg in the timolol-dorzolamide group after 6 months of treatment. The difference was not statistically significant (p = 0.48).\nCONCLUSIONS: The IOP-lowering efficacies of bimatoprost and timolol-dorzolamide combination were similar over a 6-month follow-up. Both bimatoprost and the timolol-dorzolamide combination were well tolerated. Bimatoprost can be used as a longterm monotherapy agent in the treatment of POAG and ocular hypertension."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open-angle glaucoma (POAG) or ocular hypertension'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparing the fixed combination brimonidine-timolol versus fixed combination dorzolamide-timolol in patients with elevated intraocular pressure.\n\nPURPOSE: To evaluate the efficacy of fixed combination brimonidine-timolol (FCBT) versus fixed combination dorzolamide-timolol (FCDT) given twice daily in patients with primary open angle glaucoma (POAG) or ocular hypertension (OH).\nDESIGN: Prospective, multicentre, masked-observer, crossover comparison.\nPARTICIPANTS: Sixteen patients with POAG and 14 with OH.\nMETHODS: The participants of the study were washed out from their previous medication and randomized to fixed FCBT or FCDT for the first 4-week treatment period. Subjects then were washed for 4 weeks and started on the opposite medication for the second 4-week period. Intraocular pressure (IOP) was measured with a Goldmann applanation tonometer at 8:00 a.m., 12:00 noon and 4:00 p.m. at each baseline and at the end of each treatment period. Unsolicited ocular adverse events were also recorded. MAIN OUTCOME MEASURES: Comparison of the IOP lowering effect of FCBT and FCDT.\nRESULTS: The baseline mean diurnal IOP for all 30 subjects (30 eyes) was 22.9 +/- 1.6 mmHg. Both fixed combinations significantly reduced IOP compared with baseline (p < 0.00001). The mean diurnal IOP following 4 weeks of therapy was 15.0 +/- 2.1 mmHg for FCBT and 15.4 +/- 2.1 mmHg for FCDT (p = 0.510). The mean diurnal IOP reduction was 7.8 +/- 1.9 mmHg for FCBT and 7.4 +/- 1.8 mmHg for FCDT (p = 0.430). Overall, 14 subjects complained about ocular adverse events: two only for FCBT, seven only for FCDT and five for both drugs. Although there was no significant difference between the number of subjects that reported ocular adverse events with FCBT (n = 7) and FCDT (n = 12) (p = 0.359), FCDT caused more ocular stinging upon instillation (n = 9) than FCBT (n = 1) (p = 0.027).\nCONCLUSION: This study suggests that FCBT and FCDT, each given twice daily, have similar efficacy in patients with POAG or OH."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open angle glaucoma (POAG) or ocular hypertension (OH)', 'patients with POAG', 'OH'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "A comparison of the safety and intraocular pressure lowering of bimatoprost/timolol fixed combination versus latanoprost/timolol fixed combination in patients with open-angle glaucoma.\n\nPURPOSE: To compare the efficacy and tolerability of a once daily evening dose of the latanoprost/timolol fixed combination (LTFC) with that of a once-daily evening dose of the bimatoprost/timolol fixed combination (BTFC) in patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides.\nDESIGN: Prospective, randomized, evaluator masked, single-center study.\nPARTICIPANTS: 36 patients with a diagnosis of open-angle glaucoma, with or without pseudoexfoliation, and inadequate control of IOP, insufficiently responsive to monotherapy with prostaglandin analogues/prostamides. MAIN OUTCOME MEASURE: The primary end-points were the change in IOP at 9:00 am from baseline to week 4, and the difference between treatment groups in the mean diurnal IOP reduction from baseline to week 4.\nRESULTS: BTFC provided significantly greater mean diurnal IOP reduction [mean (standard deviation)] 2.8 (0.9) mmHg, compared with LTFC 2.1 (0.6) mmHg, p = 0.0214. Both treatments significantly reduced the IOP from baseline at each IOP time-point measured, p < 0.0001, and for the mean diurnal IOP; p = 0.0049 for the LTFC, and p < 0.0001 for the BTFC. There were no significant differences in average hyperemia scores among groups, 1.25 (0.5) vs. 1.62 (0.69), p = 0.3835, for the LTFC and the BTFC, respectively.\nCONCLUSIONS: The results of this study showed a significantly higher IOP-lowering effect of a once-daily evening dose of the BTFC compared to that of a once-daily evening administration of the LTFC."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides'"} ``` </details>
Minor misspelling fix --------- Co-authored-by: Alvin Wang <[email protected]> Co-authored-by: Tim <[email protected]>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name korean_dialects ### Eval description The eval aims to assess the model's ability to identify the specific South Korean dialect a sentence belongs to. ### What makes this a useful eval? This eval provides the opportunity to understand how well GPT models can classify South Korean dialects. The dialects within South Korea are fairly distinct in terms of pronunciation, vocabulary, grammar, and intonation. Being able to determine the dialect can help in providing social, cultural, and historical insights. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"모두 다 오세요."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"벌써 열시 반이에요."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"창피해서 말도 못해요."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"그럼요 가능하죠."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"똑바로 해주세요."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"이거 조금 짜."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"간신히 도착했습니다."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"야 이거 좀 별로다."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"와 정말 많다."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"거기 구멍에 잘 끼워보세요."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"진짜 피곤해요."}],"ideal":"서울"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"마카 모예."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"하마 열시 반이잖소."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"남새시러운기 왜그르나."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"해봐요 그 될껄?"}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"거 똑떼이 해야될끼라요."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"이기 쫌 짜구워."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"거 간신히 도착했잖소."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"아 매련도 읍싸."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"마이 있잖소."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"구녕에 똑띠 끼워봐요."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"쌔가 빠진다야."}],"ideal":"강원도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"야 싹 다 온나."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"벌써 열시 반이가."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"하모 된다카이."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"단디 해라이."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"이거 쫌 짭다."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"내 강가이 도착했다."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"그 영 파이다."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"억수로 많다."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"구녕에 단디 낑가라이."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"아따 대다."}],"ideal":"경상도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"싹 다 와부쇼."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"폴쎄 열시 반이되브렀네."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"아따 하먼 된당께."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"똑떨어지게 해보랑게."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"포도시 도착했시야."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"야 이거 물짜야 못 쓰것다이."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"아따 겁나게 많네."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"구녁에 잘 찡거보쇼 딱 맞게 그게 맞소?"}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"오메 된 거 걍 오만 삭신이 아퍼 죽겄소이."}],"ideal":"전라도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"거기 다 와봐유."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"벌써 열시 반인겨?"}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"될겨."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"할껴 알 할껴."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"갠시히 왔네."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"글쎄 잘 모르겄어."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"뭐여 뭐가 이렇게 많은겨."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"거기 구멍에 잘 좀 낌어봐."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"진짜 대간햐."}],"ideal":"충청도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"발써 열시 반되수과."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"키여 맞수다게."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"졸바로 해줍써."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"제우 제우 와수다."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"영 벨론게게."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"잘도 하영이 숨게데."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"트멍더래 잘 쫍지라."}],"ideal":"제주도"} {"input":[{"role":"system","content":"You will be prompted with a Korean sentence to determine which South Korean dialect the sentence belongs to. This is multiple choice problem where your answer is one of the following six dialects: 강원도, 경상도, 전라도, 제주도, 충청도, or 서울. Return just the dialect with no other words or punctuation."},{"role":"user","content":"잘도 버침게."}],"ideal":"제주도"} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name NFL Point Combinations ### Eval description This eval tests the model's ability to calculate all the possible ways to achieve a specific score by a single team in an NFL game. ### What makes this a useful eval? This eval is actually very similar to the coin change problem which GPT-4 handles very well. However, it is extremely inaccurate when asked to applied that same type of problem to a real life situation (2.5% accuracy for GPT-3.5-turbo and 12.5% accuracy for GPT-4). It is important for the model to learn how to derive logic problems from real life context. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "As of the year 2010, in American Football, how many unique, order-independent ways can an NFL (National Football League) team score exactly 4 points in a single game? Exclude one-point safeties as one of the scoring options. List out all the possible combinations and write your final answer as a single number enclosed in square brackets."}], "ideal": "[1]"} {"input": [{"role": "system", "content": "As of the year 2010, in American Football, how many unique, order-independent ways can an NFL (National Football League) team score exactly 6 points in a single game? Exclude one-point safeties as one of the scoring options. List out all the possible combinations and write your final answer as a single number enclosed in square brackets."}], "ideal": "[3]"} {"input": [{"role": "system", "content": "As of the year 2010, in American Football, how many unique, order-independent ways can an NFL (National Football League) team score exactly 7 points in a single game? Exclude one-point safeties as one of the scoring options. List out all the possible combinations and write your final answer as a single number enclosed in square brackets."}], "ideal": "[2]"} {"input": [{"role": "system", "content": "As of the year 2010, in American Football, how many unique, order-independent ways can an NFL (National Football League) team score exactly 12 points in a single game? Exclude one-point safeties as one of the scoring options. List out all the possible combinations and write your final answer as a single number enclosed in square brackets."}], "ideal": "[7]"} {"input": [{"role": "system", "content": "As of the year 2010, in American Football, how many unique, order-independent ways can an NFL (National Football League) team score exactly 25 points in a single game? Exclude one-point safeties as one of the scoring options. List out all the possible combinations and write your final answer as a single number enclosed in square brackets."}], "ideal": "[24]"} {"input": [{"role": "system", "content": "As of the year 2010, in American Football, how many unique, order-independent ways can an NFL (National Football League) team score exactly 38 points in a single game? Exclude one-point safeties as one of the scoring options. List out all the possible combinations and write your final answer as a single number enclosed in square brackets."}], "ideal": "[68]"} ``` </details>
# Thank you for contributing an eval! ♥️
🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨
__PLEASE READ THIS__:
In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.
We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. **Starting April 10, the minimum
eval count is 15 samples, we hope this makes it easier to create and
contribute evals.**
## Eval details 📑
### Eval name
Pantone To Hex
### Eval description
This converts Pantone friendly color names to their closest hex
counterparts.
### What makes this a useful eval?
Pantone colors is something that a lot of nontechnical folks use and
converting color names like "Neutral Black" is not intuitive.
## Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).
Your eval should be:
- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] **Include at least 15 high quality examples.**
If there is anything else that makes your eval worth including, please
document it below.
### Unique eval value
> Insert what makes your eval high quality that was not mentioned above.
(Not required)
## Eval structure 🏗️
Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.yaml`
- [x] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)
## Final checklist 👀
### Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).
- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.
### Email address validation
If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.
- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.
### Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.
- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.
### Submit eval
- [x] I have filled out all required fields in the evals PR form
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push
Failure to fill out all required fields will result in the PR being
closed.
### Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:
<details>
<summary>View evals in JSON</summary>
### Eval
```jsonl
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Yellow
C"}],"ideal":"#FEDD00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Yellow 012
C"}],"ideal":"#FFD700"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Orange 021
C"}],"ideal":"#FE5000"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Warm Red
C"}],"ideal":"#F9423A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Red 032
C"}],"ideal":"#EF3340"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Rubine Red
C"}],"ideal":"#CE0058"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Rhodamine Red
C"}],"ideal":"#E10098"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Purple
C"}],"ideal":"#BB29BB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Violet
C"}],"ideal":"#440099"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Blue 072
C"}],"ideal":"#10069F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Reflex Blue
C"}],"ideal":"#001489"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Process Blue
C"}],"ideal":"#0085CA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Green
C"}],"ideal":"#00AB84"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Black
C"}],"ideal":"#2D2926"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Yellow 0131
C"}],"ideal":"#F2F0A1"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Red 0331
C"}],"ideal":"#FCAEBB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Magenta 0521
C"}],"ideal":"#F1B2DC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Violet 0631
C"}],"ideal":"#BF9BDE"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Blue 0821
C"}],"ideal":"#74D1EA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Green 0921
C"}],"ideal":"#9DE7D7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Black 0961
C"}],"ideal":"#9E978E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"801 C"}],"ideal":"#009ACE"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"802 C"}],"ideal":"#44D62C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"803 C"}],"ideal":"#FFE900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"804 C"}],"ideal":"#FFAA4D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"805 C"}],"ideal":"#FF7276"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"806 C"}],"ideal":"#FF3EB5"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"807 C"}],"ideal":"#EA27C2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"871 C"}],"ideal":"#84754E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"872 C"}],"ideal":"#85714D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"873 C"}],"ideal":"#866D4B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"874 C"}],"ideal":"#8B6F4E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"875 C"}],"ideal":"#87674F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"876 C"}],"ideal":"#8B634B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"877 C"}],"ideal":"#8A8D8F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Medium Yellow
C"}],"ideal":"#FFD900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Bright Orange
C"}],"ideal":"#FF5E00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Bright Red
C"}],"ideal":"#F93822"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Strong Red
C"}],"ideal":"#CE0056"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Pink C"}],"ideal":"#D62598"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Medium Purple
C"}],"ideal":"#4E008E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Dark Blue
C"}],"ideal":"#00239C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Medium Blue
C"}],"ideal":"#0084CA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Bright Green
C"}],"ideal":"#00B08B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"Neutral Black
C"}],"ideal":"#222223"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"100 C"}],"ideal":"#F6EB61"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"101 C"}],"ideal":"#F7EA48"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"102 C"}],"ideal":"#FCE300"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"103 C"}],"ideal":"#C5A900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"104 C"}],"ideal":"#AF9800"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"105 C"}],"ideal":"#897A27"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7401 C"}],"ideal":"#F5E1A4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7402 C"}],"ideal":"#ECD898"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7403 C"}],"ideal":"#EED484"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7404 C"}],"ideal":"#F4DA40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7405 C"}],"ideal":"#F2CD00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7406 C"}],"ideal":"#F1C400"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7407 C"}],"ideal":"#CBA052"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"106 C"}],"ideal":"#F9E547"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"107 C"}],"ideal":"#FBE122"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"108 C"}],"ideal":"#FEDB00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"109 C"}],"ideal":"#FFD100"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"110 C"}],"ideal":"#DAAA00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"111 C"}],"ideal":"#AA8A00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"112 C"}],"ideal":"#9C8412"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"113 C"}],"ideal":"#FAE053"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"114 C"}],"ideal":"#FBDD40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"115 C"}],"ideal":"#FDDA24"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"116 C"}],"ideal":"#FFCD00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"117 C"}],"ideal":"#C99700"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"118 C"}],"ideal":"#AC8400"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"119 C"}],"ideal":"#897322"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"127 C"}],"ideal":"#F3DD6D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"128 C"}],"ideal":"#F3D54E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"129 C"}],"ideal":"#F3D03E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"130 C"}],"ideal":"#F2A900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"131 C"}],"ideal":"#CC8A00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"132 C"}],"ideal":"#A07400"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"133 C"}],"ideal":"#6C571B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1205 C"}],"ideal":"#F8E08E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1215 C"}],"ideal":"#FBD872"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1225 C"}],"ideal":"#FFC845"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1235 C"}],"ideal":"#FFB81C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1245 C"}],"ideal":"#C69214"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1255 C"}],"ideal":"#AD841F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1265 C"}],"ideal":"#886B25"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"120 C"}],"ideal":"#FBDB65"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"121 C"}],"ideal":"#FDD757"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"122 C"}],"ideal":"#FED141"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"123 C"}],"ideal":"#FFC72C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"124 C"}],"ideal":"#EAAA00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"125 C"}],"ideal":"#B58500"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"126 C"}],"ideal":"#9A7611"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7548 C"}],"ideal":"#FFC600"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7549 C"}],"ideal":"#FFB500"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7550 C"}],"ideal":"#D19000"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7551 C"}],"ideal":"#B47E00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7552 C"}],"ideal":"#73531D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7553 C"}],"ideal":"#5A4522"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7554 C"}],"ideal":"#4B3D2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7555 C"}],"ideal":"#D29F13"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7556 C"}],"ideal":"#B78B20"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7557 C"}],"ideal":"#9F7D23"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7558 C"}],"ideal":"#967126"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7559 C"}],"ideal":"#8F6A2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7560 C"}],"ideal":"#7D622E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7561 C"}],"ideal":"#6C5D34"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"134 C"}],"ideal":"#FDD26E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"135 C"}],"ideal":"#FFC658"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"136 C"}],"ideal":"#FFBF3F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"137 C"}],"ideal":"#FFA300"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"138 C"}],"ideal":"#DE7C00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"139 C"}],"ideal":"#AF6D04"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"140 C"}],"ideal":"#74531C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1345 C"}],"ideal":"#FDD086"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1355 C"}],"ideal":"#FFC56E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1365 C"}],"ideal":"#FFB549"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1375 C"}],"ideal":"#FF9E1B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1385 C"}],"ideal":"#D57800"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1395 C"}],"ideal":"#996017"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1405 C"}],"ideal":"#6E4C1E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"141 C"}],"ideal":"#F2C75C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"142 C"}],"ideal":"#F1BE48"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"143 C"}],"ideal":"#F1B434"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"144 C"}],"ideal":"#ED8B00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"145 C"}],"ideal":"#CF7F00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"146 C"}],"ideal":"#A76D11"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"147 C"}],"ideal":"#715C2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7408 C"}],"ideal":"#F6BE00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7409 C"}],"ideal":"#F0B323"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7410 C"}],"ideal":"#FEAD77"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7411 C"}],"ideal":"#E6A65D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7412 C"}],"ideal":"#D38235"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7413 C"}],"ideal":"#DC8633"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7414 C"}],"ideal":"#C16C18"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7562 C"}],"ideal":"#BD9B60"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7563 C"}],"ideal":"#D69A2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7564 C"}],"ideal":"#DB8A06"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7565 C"}],"ideal":"#CD7925"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7566 C"}],"ideal":"#AD6433"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7567 C"}],"ideal":"#89532F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7568 C"}],"ideal":"#775135"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7569 C"}],"ideal":"#D78825"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7570 C"}],"ideal":"#D3832B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7571 C"}],"ideal":"#C67D30"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7572 C"}],"ideal":"#B67233"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7573 C"}],"ideal":"#A7662B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7574 C"}],"ideal":"#9E6A38"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7575 C"}],"ideal":"#835D32"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"712 C"}],"ideal":"#FCC89B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"713 C"}],"ideal":"#FDBE87"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"714 C"}],"ideal":"#FDAA63"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"715 C"}],"ideal":"#F68D2E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"716 C"}],"ideal":"#EA7600"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"717 C"}],"ideal":"#D45D00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"718 C"}],"ideal":"#BE4D00"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"148 C"}],"ideal":"#FECB8B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"149 C"}],"ideal":"#FFC27B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"150 C"}],"ideal":"#FFB25B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"151 C"}],"ideal":"#FF8200"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"152 C"}],"ideal":"#E57200"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"153 C"}],"ideal":"#BE6A14"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"154 C"}],"ideal":"#9B5A1A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"155 C"}],"ideal":"#EFD19F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"156 C"}],"ideal":"#EFBE7D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"157 C"}],"ideal":"#ECA154"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"158 C"}],"ideal":"#E87722"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"159 C"}],"ideal":"#CB6015"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"160 C"}],"ideal":"#A1561C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"161 C"}],"ideal":"#603D20"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1485 C"}],"ideal":"#FFAE62"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1495 C"}],"ideal":"#FF8F1C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1505 C"}],"ideal":"#FF6900"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1525 C"}],"ideal":"#B94700"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1535 C"}],"ideal":"#94450B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1545 C"}],"ideal":"#653819"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1555 C"}],"ideal":"#FFB990"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1565 C"}],"ideal":"#FFA06A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1575 C"}],"ideal":"#FF7F32"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1585 C"}],"ideal":"#FF6A13"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1595 C"}],"ideal":"#D86018"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1605 C"}],"ideal":"#A65523"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1615 C"}],"ideal":"#8B4720"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"162 C"}],"ideal":"#FFBE9F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"163 C"}],"ideal":"#FF9D6E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"164 C"}],"ideal":"#FF7F41"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"165 C"}],"ideal":"#FF671F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"166 C"}],"ideal":"#E35205"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"167 C"}],"ideal":"#BE531C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"168 C"}],"ideal":"#73381D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7576 C"}],"ideal":"#DB864E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7577 C"}],"ideal":"#E07E3C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7578 C"}],"ideal":"#DC6B2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7579 C"}],"ideal":"#DC582A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7580 C"}],"ideal":"#C05131"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7581 C"}],"ideal":"#864A33"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7582 C"}],"ideal":"#674736"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1625 C"}],"ideal":"#FFA38B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1635 C"}],"ideal":"#FF8D6D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1645 C"}],"ideal":"#FF6A39"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1655 C"}],"ideal":"#FC4C02"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1665 C"}],"ideal":"#DC4405"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1675 C"}],"ideal":"#A9431E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1685 C"}],"ideal":"#833921"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"169 C"}],"ideal":"#FFB3AB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"170 C"}],"ideal":"#FF8674"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"171 C"}],"ideal":"#FF5C39"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"172 C"}],"ideal":"#FA4616"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"173 C"}],"ideal":"#CF4520"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"174 C"}],"ideal":"#963821"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"175 C"}],"ideal":"#6B3529"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7583 C"}],"ideal":"#C4622D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7584 C"}],"ideal":"#BA5826"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7585 C"}],"ideal":"#AF5C37"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7586 C"}],"ideal":"#9E5330"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7587 C"}],"ideal":"#924C2E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7588 C"}],"ideal":"#7B4D35"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7589 C"}],"ideal":"#5C4738"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7590 C"}],"ideal":"#D4B59E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7591 C"}],"ideal":"#C07D59"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7592 C"}],"ideal":"#B15533"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7593 C"}],"ideal":"#9D432C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7594 C"}],"ideal":"#7C3A2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7595 C"}],"ideal":"#6B3D2E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7596 C"}],"ideal":"#5C3D31"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7597 C"}],"ideal":"#D14124"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7598 C"}],"ideal":"#BD472A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7599 C"}],"ideal":"#B33D26"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7600 C"}],"ideal":"#8D3F2B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7601 C"}],"ideal":"#83412C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7602 C"}],"ideal":"#7B4931"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7603 C"}],"ideal":"#674230"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7604 C"}],"ideal":"#E4D5D3"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7605 C"}],"ideal":"#E1BBB4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7606 C"}],"ideal":"#D6938A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7607 C"}],"ideal":"#C26E60"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7608 C"}],"ideal":"#A4493D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7609 C"}],"ideal":"#823B34"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7610 C"}],"ideal":"#683431"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7611 C"}],"ideal":"#DDBCB0"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7612 C"}],"ideal":"#CA9A8E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7613 C"}],"ideal":"#BC8A7E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7614 C"}],"ideal":"#A37F74"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7615 C"}],"ideal":"#866761"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7616 C"}],"ideal":"#6B4C4C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7617 C"}],"ideal":"#583D3E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7520 C"}],"ideal":"#EABEB0"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7521 C"}],"ideal":"#C09C83"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7522 C"}],"ideal":"#B46A55"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7523 C"}],"ideal":"#AB5C57"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7524 C"}],"ideal":"#A45248"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7525 C"}],"ideal":"#9A6A4F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7526 C"}],"ideal":"#8A391B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"489 C"}],"ideal":"#ECC3B2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"488 C"}],"ideal":"#ECBAA8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"487 C"}],"ideal":"#EAA794"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"486 C"}],"ideal":"#E8927C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"485 C"}],"ideal":"#DA291C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"484 C"}],"ideal":"#9A3324"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"483 C"}],"ideal":"#653024"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"176 C"}],"ideal":"#FFB1BB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"177 C"}],"ideal":"#FF808B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"178 C"}],"ideal":"#FF585D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"179 C"}],"ideal":"#E03C31"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"180 C"}],"ideal":"#BE3A34"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"181 C"}],"ideal":"#81312F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1765 C"}],"ideal":"#FFA3B5"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1775 C"}],"ideal":"#FF8DA1"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1785 C"}],"ideal":"#F8485E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1788 C"}],"ideal":"#EE2737"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1795 C"}],"ideal":"#D22630"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1805 C"}],"ideal":"#AF272F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1815 C"}],"ideal":"#7C2529"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1767 C"}],"ideal":"#FCAFC0"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1777 C"}],"ideal":"#FB637E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1787 C"}],"ideal":"#F4364C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1797 C"}],"ideal":"#CB333B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1807 C"}],"ideal":"#A4343A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1817 C"}],"ideal":"#643335"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7618 C"}],"ideal":"#C66E4E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7619 C"}],"ideal":"#C04C36"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7620 C"}],"ideal":"#B7312C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7621 C"}],"ideal":"#AB2328"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7622 C"}],"ideal":"#93272C"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7623 C"}],"ideal":"#8A2A2B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7624 C"}],"ideal":"#802F2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7625 C"}],"ideal":"#E1523D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7626 C"}],"ideal":"#C63527"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7627 C"}],"ideal":"#A72B2A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7628 C"}],"ideal":"#9E2A2B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7629 C"}],"ideal":"#6D3332"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7630 C"}],"ideal":"#633231"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7631 C"}],"ideal":"#572D2D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7415 C"}],"ideal":"#E6BAA8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7416 C"}],"ideal":"#E56A54"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7417 C"}],"ideal":"#E04E39"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7418 C"}],"ideal":"#CD545B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7419 C"}],"ideal":"#B04A5A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7420 C"}],"ideal":"#9B2242"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7421 C"}],"ideal":"#651D32"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"182 C"}],"ideal":"#FABBCB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"183 C"}],"ideal":"#FC9BB3"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"184 C"}],"ideal":"#F65275"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"185 C"}],"ideal":"#E4002B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"186 C"}],"ideal":"#C8102E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"187 C"}],"ideal":"#A6192E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"188 C"}],"ideal":"#76232F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"196 C"}],"ideal":"#ECC7CD"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"197 C"}],"ideal":"#E89CAE"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"198 C"}],"ideal":"#DF4661"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"199 C"}],"ideal":"#D50032"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"200 C"}],"ideal":"#BA0C2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"201 C"}],"ideal":"#9D2235"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"202 C"}],"ideal":"#862633"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"189 C"}],"ideal":"#F8A3BC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"190 C"}],"ideal":"#F67599"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"191 C"}],"ideal":"#EF426F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"192 C"}],"ideal":"#E40046"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"193 C"}],"ideal":"#BF0D3E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"194 C"}],"ideal":"#9B2743"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"195 C"}],"ideal":"#782F40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1895 C"}],"ideal":"#F5B6CD"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1905 C"}],"ideal":"#F59BBB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1915 C"}],"ideal":"#EF4A81"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1925 C"}],"ideal":"#E0004D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1935 C"}],"ideal":"#C5003E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1945 C"}],"ideal":"#A6093D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"1955 C"}],"ideal":"#8A1538"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"705 C"}],"ideal":"#F5DADF"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"706 C"}],"ideal":"#F7CED7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"707 C"}],"ideal":"#F9B5C4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"708 C"}],"ideal":"#F890A5"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"709 C"}],"ideal":"#EF6079"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"710 C"}],"ideal":"#E03E52"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"711 C"}],"ideal":"#CB2C30"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"698 C"}],"ideal":"#F2D4D7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"699 C"}],"ideal":"#F4C3CC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"700 C"}],"ideal":"#F2ACB9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"701 C"}],"ideal":"#E68699"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"702 C"}],"ideal":"#D25B73"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"703 C"}],"ideal":"#B83A4B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"704 C"}],"ideal":"#9E2A2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"203 C"}],"ideal":"#ECB3CB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"204 C"}],"ideal":"#E782A9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"205 C"}],"ideal":"#E0457B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"206 C"}],"ideal":"#CE0037"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"207 C"}],"ideal":"#A50034"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"208 C"}],"ideal":"#861F41"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"209 C"}],"ideal":"#6F263D"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"210 C"}],"ideal":"#F99FC9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"211 C"}],"ideal":"#F57EB6"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"212 C"}],"ideal":"#F04E98"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"213 C"}],"ideal":"#E31C79"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"214 C"}],"ideal":"#CE0F69"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"215 C"}],"ideal":"#AC145A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"216 C"}],"ideal":"#7D2248"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7422 C"}],"ideal":"#F4CDD4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7423 C"}],"ideal":"#E06287"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7424 C"}],"ideal":"#E24585"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7425 C"}],"ideal":"#B52555"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7426 C"}],"ideal":"#A4123F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7427 C"}],"ideal":"#971B2F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7428 C"}],"ideal":"#6A2C3E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7632 C"}],"ideal":"#D6C9CA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7633 C"}],"ideal":"#C4A4A7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7634 C"}],"ideal":"#C16784"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7635 C"}],"ideal":"#C63663"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7636 C"}],"ideal":"#BC204B"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7637 C"}],"ideal":"#912F46"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7638 C"}],"ideal":"#7E2D40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"217 C"}],"ideal":"#EABEDB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"218 C"}],"ideal":"#E56DB1"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"219 C"}],"ideal":"#DA1884"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"220 C"}],"ideal":"#A50050"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"221 C"}],"ideal":"#910048"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"222 C"}],"ideal":"#6C1D45"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7639 C"}],"ideal":"#936D73"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7640 C"}],"ideal":"#934054"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7641 C"}],"ideal":"#8E2C48"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7642 C"}],"ideal":"#732E4A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7643 C"}],"ideal":"#672E45"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7644 C"}],"ideal":"#582D40"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"7645 C"}],"ideal":"#502B3A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"223 C"}],"ideal":"#EF95CF"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"224 C"}],"ideal":"#EB6FBD"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"225 C"}],"ideal":"#DF1995"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"226 C"}],"ideal":"#D0006F"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"227 C"}],"ideal":"#AA0061"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"228 C"}],"ideal":"#890C58"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"229 C"}],"ideal":"#672146"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"230 C"}],"ideal":"#F4A6D7"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"231 C"}],"ideal":"#F277C6"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"232 C"}],"ideal":"#E93CAC"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"233 C"}],"ideal":"#C6007E"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"234 C"}],"ideal":"#A20067"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"235 C"}],"ideal":"#840B55"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"670 C"}],"ideal":"#EAD3E2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"671 C"}],"ideal":"#E6BCD8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"672 C"}],"ideal":"#DFA0C9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"673 C"}],"ideal":"#D986BA"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"674 C"}],"ideal":"#C6579A"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"675 C"}],"ideal":"#AE2573"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"676 C"}],"ideal":"#960051"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"677 C"}],"ideal":"#E5CEDB"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"678 C"}],"ideal":"#E3C8D8"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"679 C"}],"ideal":"#DEBED2"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"680 C"}],"ideal":"#C996B6"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"681 C"}],"ideal":"#B06C96"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"682 C"}],"ideal":"#994878"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"683 C"}],"ideal":"#7C2855"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"684 C"}],"ideal":"#E4C6D4"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"685 C"}],"ideal":"#DCB6C9"}
{"input":[{"role":"system","content":"Convert pantone color to its hex
representation."},{"role":"user","content":"686 C"}],"ideal":"#D0A1…
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name "Job Title to SOC Title Classifier" ### Eval description This evaluation involves a machine learning model trained to classify job titles into their relevant Standard Occupational Classification (SOC) title from the Bureau of Labor Statistics (BLS). The model uses historical job title data and associated SOC titles to accurately predict the SOC title for any given job title. ### What makes this a useful eval? This evaluation is incredibly valuable because it opens up a wealth of data possibilities tied to job titles. By accurately classifying job titles into their relevant SOC titles, we can access and leverage related data from resources like ONET, BLS, and census data. This can be particularly useful in labor market analyses, economic research, HR analytics, and other fields. Moreover, with an accurate SOC title classification, we can study trends, make predictions, and generate insights about various occupations, which could be beneficial for job seekers, employers, policymakers, and researchers alike. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Metal Worker'"}], "ideal": " Metal Workers and Plastic Workers, All Other"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Bread Baker'"}], "ideal": " Bakers"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Malt Liquors Sales Supervisor'"}], "ideal": " First-Line Supervisors of Non-Retail Sales Workers"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Duck Driver'"}], "ideal": " Tour Guides and Escorts"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Architect Specialist'"}], "ideal": " Marine Engineers and Naval Architects"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Golf Course Ranger'"}], "ideal": " Amusement and Recreation Attendants"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Sewing Supervisor'"}], "ideal": " First-Line Supervisors of Production and Operating Workers"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Screener and Blender'"}], "ideal": " Mixing and Blending Machine Setters, Operators, and Tenders"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Field Marketing Representative'"}], "ideal": " Sales Engineers"} {"input": [{"role": "system", "content": "You are an expert in Standard Occupation Code (SOC) labor classifications issued by the Bureau of Labor Statistics. When give a job title to classify, respond with the correct BLS Title classification"}, {"role": "user", "content": "What is the SOC Code for the job title Surveyor'"}], "ideal": " Surveyors"} ``` </details>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name korean-phonetics ### Eval description The eval aims to assess the model's proficiency in identifying phonetic transcriptions of Korean words. To measure accuracy, the model is given [word, phonetic transcription] pairs and the test utilizes Match. The phonetic transcription was taken from the most commonly used online dictionaries by Naver: [Naver Korean Dictionary](https://ko.dict.naver.com) ### What makes this a useful eval? Accurately representing and recognizing phonetic transcription of Korean words is important for several reasons: 1. Pronunciation Accuracy: Phonetic transcription helps in accurately representing the sounds of Korean words. By understanding and recognizing the correct phonetic transcription, learners can ensure they pronounce the words correctly, which is crucial for effective communication in Korean. 2. Language Standardization: Phonetic transcription plays a role in standardizing the pronunciation of Korean words. By adhering to a consistent system, it helps maintain clarity and avoids misinterpretation of words, especially in educational materials, dictionaries, and linguistic research. 3. Linguistic Analysis: For linguists and researchers studying the Korean language, phonetic transcription provides a precise way to analyze and compare different speech sounds. It aids in phonological studies, dialect research, and language documentation. In summary, accurate representation and recognition of phonetic transcription in Korean contribute to improved pronunciation, effective language learning, better communication, standardization, and linguistic analysis of the language. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [X] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [X] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [X] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [X] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [X] Check that your data is in `evals/registry/data/{name}` - [X] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [X] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [X] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [X] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [X] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [X] I have filled out all required fields of this form - [X] I have used **Git LFS** for the Eval JSON data - [X] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Korean is written using a phonetic alphabet called hangul. You will be given a pair of Korean words. Is the second word the correct phonetic transcription of the first word? Answer with exactly one of the following: 'yes' or 'no'. Don't add anything else to the response."}, {"role": "user", "content": "여덟, 여덜"}], "ideal": "yes"} {"input": [{"role": "system", "content": "Korean is written using a phonetic alphabet called hangul. You will be given a pair of Korean words. Is the second word the correct phonetic transcription of the first word? Answer with exactly one of the following: 'yes' or 'no'. Don't add anything else to the response."}, {"role": "user", "content": "값, 갑"}], "ideal": "yes"} {"input": [{"role": "system", "content": "Korean is written using a phonetic alphabet called hangul. You will be given a pair of Korean words. Is the second word the correct phonetic transcription of the first word? Answer with exactly one of the following: 'yes' or 'no'. Don't add anything else to the response."}, {"role": "user", "content": "닭, 닥"}], "ideal": "yes"} {"input": [{"role": "system", "content": "Korean is written using a phonetic alphabet called hangul. You will be given a pair of Korean words. Is the second word the correct phonetic transcription of the first word? Answer with exactly one of the following: 'yes' or 'no'. Don't add anything else to the response."}, {"role": "user", "content": "앉아, 안자"}], "ideal": "yes"} {"input": [{"role": "system", "content": "Korean is written using a phonetic alphabet called hangul. You will be given a pair of Korean words. Is the second word the correct phonetic transcription of the first word? Answer with exactly one of the following: 'yes' or 'no'. Don't add anything else to the response."}, {"role": "user", "content": "젊어, 절머"}], "ideal": "yes"} {"input": [{"role": "system", "content": "Korean is written using a phonetic alphabet called hangul. You will be given a pair of Korean words. Is the second word the correct phonetic transcription of the first word? Answer with exactly one of the following: 'yes' or 'no'. Don't add anything else to the response."}, {"role": "user", "content": "겉옷, 거돋"}], "ideal": "yes"} ``` </details> Co-authored-by: Lena H <[email protected]>
Return bootstrap acc std
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name path_enclosed_area ### Eval description This eval tests the model's ability to calculate the total area enclosed by a path walked on a flat plane; the path moves only north, south, east, or west. These problems are extremely simple for any human, but the model has a lot of difficulty with them. The paths were hand constructed to test across a variety of scenarios including: - One path with multiple discrete enclosed areas - One path with zero enclosed areas - Irrelevant segmentation of an enclosed area into smaller enclosed areas - Going back and forth across the same segment - Different initial directions - Symmetrical and asymmetrical ### What makes this a useful eval? This kind of geometric reasoning and calculation is important for simple tasks across mathematics, game design, engineering, and various other fields. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "On a flat plane, I walk 5 miles north, 3 miles east, 2 miles south, 2 miles east, 4 miles south, 6 miles west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[0 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 3 miles east, 3 miles south, 2 miles east, 3 miles west, 1 mile east, 5 miles north, 4 miles west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[0 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 5 miles north, 3 miles east, 2 miles south, 1 mile east, 2 miles south, 5 miles west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[14 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 2 miles south, 3 miles west, 2 miles north, 2 miles east, 2 miles south, 1 mile west, 2 miles north, 2 miles east. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[6 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 3 miles south, 2 miles east, 1 mile north, 3 miles west, 1 mile north, 2 miles east, 4 miles south. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[4 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile west, 1 mile north, 1 mile east, 1 mile south, 1 mile east, 1 mile north, 1 mile west, 2 miles north. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[2 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile south, 1 mile east, 1 mile south, 1 mile east, 1 mile south, 1 mile east, 2 miles north, 1 mile east, 1 mile north, 4 miles west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[7 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile west, 1 mile south, 1 mile north, 2 miles east, 1 mile north, 3 miles west, 3 miles south, 1 mile east, 1 mile south, 1 mile west, 2 miles north. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[1 square mile]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile north, 2 miles east, 1 mile north, 1 mile east, 3 miles south, 2 miles west, 2 miles north. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[5 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile east, 1 mile north, 1 mile west, 1 mile north, 1 mile east, 1 mile north, 1 mile west, 1 mile north, 1 mile east, 5 miles south. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[2 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 5 miles south, 5 miles east, 4 miles north, 4 miles west, 3 miles south, 3 miles east, 2 miles north, 2 miles west, 1 mile south, 1 mile east, 3 miles north. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[8 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile west, 1 mile south, 1 mile north, 1 mile south, 1 mile east, 1 mile south, 1 mile west, 1 mile south, 2 miles east, 3 miles north. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[0 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 4 miles north, 2 miles south, 2 miles west, 4 miles east, 1 mile west, 1 mile north, 2 miles west, 2 miles south, 1 mile east. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[3 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 2 miles south, 1 mile east, 1 mile north, 1 mile south, 1 mile east, 1 mile north, 1 mile south, 1 mile east, 2 miles north, 1 mile east, 1 mile south, 3 miles west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[3 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 3 miles east, 1 mile south, 1 mile west, 1 mile south, 1 mile east, 1 mile south, 3 miles west, 1 mile north, 1 mile east, 1 mile north, 1 mile west, 2 miles north, 1 mile west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[7 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile south, 1 mile east, 1 mile north, 2 miles west, 2 miles south, 3 miles east, 2 miles north, 1 mile west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[6 square miles]"} {"input": [{"role": "system", "content": "On a flat plane, I walk 1 mile south, 1 mile east, 1 mile north, 1 mile west. What is the area, in square miles, of the area that is completely enclosed by my path? If no area is completely enclosed, then there are 0 square miles enclosed. If multiple discrete areas are enclosed, then please sum their areas for the final answer. Explain your reasoning step-by-step, and then provide your final answer in the exact following format surrounded by brackets: \"[X square miles]\" where X is the integer number of total square miles enclosed by the path, and where miles should be written as mile if X=1."}], "ideal": "[1 square mile]"} ``` </details> Co-authored-by: Ahmed Allawi <[email protected]>
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name Consensus Summary ### Eval description Utilize the model's ability to produce a Scientific Consensus in response to a scientific inquiry using the provided claims. ### What makes this a useful eval? This is a useful eval because it evaluates the model's ability to produce a scientific consensus in response to a given set of claims. This is important because scientific consensus is the result of multiple studies and data that may or may not support the same conclusion. A model that can accurately produce scientific consensus can help in making informed decisions and policies based on scientific evidence. Hence, evaluating a model's ability to produce a scientific consensus using the Consensus Summary eval can be useful in assessing its reliability and potential for practical applications. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: Two doses of mRNA covid-19 vaccines were observed to be highly effective against symptomatic infection and severe outcomes.\nclaim: COVID-19 vaccines currently authorized in the United States are highly effective in preventing COVID-19-associated hospitalizations in older adults.\nclaim: In summary, vaccines are a powerful tool that can be used to control the COVID-19 pandemic, with high efficacy and tolerable ADRs.\nclaim: Conclusion Overall, we conclude that vaccination against COVID-19 in patients with active malignancies using activated and inactivated vaccines is a safe and tolerable procedure that is also accompanied by a high efficacy.\nclaim: COVID-19 vaccines provide good protection against COVID-19 presentation at primary care/outpatient level, particularly among fully vaccinated individuals.\nquestion: are covid-19 vaccines effective?"}], "ideal": "Summary: Covid-19 vaccines are highly effective at protecting against infection and hospitalization."} {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: Lower zinc is a hallmark of depression, while increments in serum zinc and attenuation of the immune-inflammatory response during treatment appear to play a role in the clinical efficacy of sertraline.\nclaim: An increase in dietary zinc and higher plasma zinc levels may reduce the risk of depressive symptoms.\nclaim: Although decreased zinc levels have been implicated in the genesis of depression in animal models and in major depressive disorder in humans, this study provides the first evidence of a role for zinc in depression in people with dementia and highlights zinc metabolism as a therapeutic target.\nclaim: The results of this study show that long-term intake of zinc may modulate symptoms of depression.\nclaim: The reported results indicated that the serum zinc level might be a marker of depression as a state (state marker) in treatment responsive patients.\nquestion: can zinc help treat depression?"}], "ideal": "Summary: All of these studies suggest that low zinc levels are a marker of depression and that intake of zinc may have the ability to help reduce symptoms of depression"} {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: The findings suggest that the following characteristics of the founder significantly influence the success potential of an incubated venture: entrepreneurial personality, motivation for starting the venture, managerial skills, and approach towards innovation.\nclaim: Using a sample of 384 entrepreneurs selected from the two leading business districts in Uganda, we observe that optimism is the component of psychological capital that significantly moderates the relationship between startup capital and entrepreneurial success.\nclaim: Both startup capital and psychological capital are significant predictors of entrepreneurial success; however, psychological capital is the better predictor.\nclaim: Entrepreneurially self\u2010efficacious founder/managers may help improve the performance of very young firms but such benefits dissipate over time.\nclaim: This finding indicates that the entrepreneurial team\u2019s startup experience plays stronger roles in venturing profitable startups when the amount of financial resources and initial firm size are small; however, the team\u2019s startup experience and intangible resources have positive interaction effects on new-born startups\u2019 profitability.\nquestion: what predicts success as a startup founder?"}], "ideal": "Summary: Things like entrepreneurial personality, motivation for starting the venture, managerial skills, previous start-up experience, startup and psychological capital and optimism all predict success as a startup founder"} {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: While homelessness is ultimately the result of a severe and chronic shortage of affordable housing, creating accessible, safe, pet-friendly shelter and safe haven options and instituting a smoother, more transparent process for moving from the streets could substantially reduce street homelessness.\nclaim: - To prevent the revolving door to homelessness, it is necessary to remove the barriers that hinder access to normal health resources which are experienced by people suffering from social exclusion, while implementing ongoing support programmes for homeless people or those at risk of homelessness, which primarily deal with health issues.\nclaim: We conclude that overcoming homelessness requires policies and practices that give a greater focus to non-material aspects of homelessness through an emphasis on empowerment, self-respect and autonomy.\nclaim: This finding suggests that homelessness can be reduced by appropriate clinical interventions if housing is available.\nclaim: For homelessness prevention, systematic and outreach social medical care before and during homelessness should be provided.\nquestion: What are effective ways to prevent homelessness?"}], "ideal": "Summary: Ways to prevent homelessness include creating accessible, safe shelter and safe haven options, removing barriers to health resources, giving a greater focus to non-material aspects of homelessness, and providing systematic and outreach social medical care."} {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: While homelessness is ultimately the result of a severe and chronic shortage of affordable housing, creating accessible, safe, pet-friendly shelter and safe haven options and instituting a smoother, more transparent process for moving from the streets could substantially reduce street homelessness.\nclaim: - To prevent the revolving door to homelessness, it is necessary to remove the barriers that hinder access to normal health resources which are experienced by people suffering from social exclusion, while implementing ongoing support programmes for homeless people or those at risk of homelessness, which primarily deal with health issues.\nclaim: We conclude that overcoming homelessness requires policies and practices that give a greater focus to non-material aspects of homelessness through an emphasis on empowerment, self-respect and autonomy.\nclaim: This finding suggests that homelessness can be reduced by appropriate clinical interventions if housing is available.\nclaim: For homelessness prevention, systematic and outreach social medical care before and during homelessness should be provided.\nquestion: How to prevent homelessness?"}], "ideal": "Summary: Ways to prevent homelessness include creating accessible, safe shelter and safe haven options, removing barriers to health resources, giving a greater focus to non-material aspects of homelessness, and providing systematic and outreach social medical care."} {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: The findings revealed that the factor that contributes the most to entrepreneurship intention is Locus of control, followed by Need of Achievement and Subjective Norms.\nclaim: It was found that entrepreneurial skill, environmental factors and entrepreneurial orientation have a positive influence on entrepreneurial intention.\nclaim: The findings indicate that entrepreneurial motivation has a significant correlation with entrepreneurial intention and its three determinants, social valuation of entrepreneurship, having entrepreneurial role models, knowledge of entrepreneurial support and perceived barriers to starting a business.\nclaim: Research finding revealed that entrepreneurial intention is indirectly affected by entrepreneurship education, meaning that students\u2019 entrepreneurial motivation and attitude are two important mediating variables.\nclaim: Findings confirm the influence of individual and socio-cultural factors on entrepreneurial intention.\nquestion: What are the factors of entrepreneurship intention"}], "ideal": "Summary: Studies find that intrinsic factors, such as entrepreneurial skill and motivation, as well as extrinsic variables, such as the environmental support of entrepreneurship, mediate entrepreneurship intention."} {"input": [{"role": "system", "content": "Generate a brief answer using only the provided claims, with no personal opinions or outside knowledge. If there is no answer based on the claims, write 'N-A'."}, {"role": "user", "content": "claim: The results show that digital agriculture is able to help users to increase productivity in a sustainable way.\nclaim: Digital agriculture technologies continue the centralization of economic knowledge and power as they facilitate the transformation of vast territories into \u201coperational landscapes\u201d that provide the material, energy, and labor for a rapidly expanding urban system.\nclaim: The digital agriculture system is an effective tool for insurance industry to use to develop a dynamical business plan for the changing climate.\nclaim: The technical fitting-out of agriculture in the digital economy should be considered as a set of measures to prepare the industry for the production of high-quality products, which implies the use of digital technologies that minimize human participation in the production process.\nclaim: Consequently, the initial Mobile-based Information System evolved into a Digital Knowledge Ecosystem that can predict current production situation in near real enabling government agencies to dynamically adjust the incentives offered to farmers for growing different types of crops to achieve sustainable agriculture production through crop diversification.\nquestion: What is digital agriculture?"}], "ideal": "Summary: N-A"} ``` </details>
**What:** Adds support for `gpt-3.5-turbo-16k` to `n_ctx_from_model_name`. **Why:** Currently `n_ctx_from_model_name` returns 4096 for `gpt-3.5-turbo-16k`. Co-authored-by: Ian McKenzie <[email protected]>
**What:** Adds a recorder for function calls made by models. **Why:** Currently function calls can be logged using `record_event` but it'd be convenient for function calls to logged consistently.
Simple change to fix openai#1394 .
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name japanese_prime_minister ### Eval description I would like to know the calculation of the number of days in office of successive prime ministers and the ranking of the number of days in office. ### What makes this a useful eval? I'm almost done calculating tenure, but trying to rank it doesn't work. There seems to be a demand for ranking a lot of different things. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `mypy`, `black`, `isort`, `autoflake` and `ruff` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"}, {"role": "user", "content": "通算在籍日数が1番目に長い総理大臣"}], "ideal": "安倍晋三"} {"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"}, {"role": "user", "content": "通算在籍日数が2番目に長い総理大臣"}], "ideal": "桂太郎"} {"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"}, {"role": "user", "content": "通算在籍日数が3番目に長い総理大臣"}], "ideal": "佐藤栄作"} {"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"}, {"role": "user", "content": "通算在籍日数が4番目に長い総理大臣"}], "ideal": "伊藤博文"} {"input": [{"role": "system", "content": "あなたは日本の歴代総理大臣の名前を回答します"}, {"role": "user", "content": "通算在籍日数が5番目に長い総理大臣"}], "ideal": "吉田茂"} ``` </details>
With this improvement we now have a 0-shot performance of 59.6% (averaged over 3 eval runs) on the MMMU validation set, which beats the 56.8% reported in the [MMMU paper](https://arxiv.org/pdf/2311.16502.pdf)
In [the previous PR](openai#1405) adding the Theory of Mind eval, the `evals/registry/evals/theory_of_mind.yaml` was mistakenly not added, so the eval couldn't be run. This PR adds this file. Test with: ``` oaieval gpt-3.5-turbo theory_of_mind ```
# Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name icelandic-sentences-gec - Grammatical error correction for Icelandic sentences ### Eval description The eval contains Icelandic sentences with and without grammatical errors, spelling errors or other linguistic errors. There are a total of 200 sentences, 100 with errors and 100 where these same errors have been corrected. The model then predicts whether a particular sentence contains an error or not, and accuracy is measured. ### What makes this a useful eval? This is a good measure of the ability of a model to correct grammatical errors in the Icelandic language. The sentences contain errors which go against Icelandic language standards, and which a language expert of Icelandic would correct before a text is published. In addition, it can serve to measure the general linguistic competence of Icelandic. The sentences are sourced from the web, and the test set of the Icelandic Error Corpus (IceEC), which can be freely used. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `mypy`, `black`, `isort`, `autoflake` and `ruff` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Þú ert sérfræðingur í íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu, stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu 'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role": "user", "content": "Leiksviðið átti svo hug hans allann við heimkomuna, og árið 1961 hófst sjónvarpsferill hans."}], "ideal": ["rangt", "Rangt"]} {"input": [{"role": "system", "content": "Þú ert sérfræðingur í íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu, stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu 'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role": "user", "content": "Baráttusamtök frumbyggja í Hondúras, sem Caceres átti þátt í að stofna, fagnaði dómsúrskurðinum í gær og sagði hann sigur fyrir þjóðir Hondúras."}], "ideal": ["rangt", "Rangt"]} {"input": [{"role": "system", "content": "Þú ert sérfræðingur í íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu, stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu 'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role": "user", "content": "Sú var naumast býsperrt."}], "ideal": ["rangt", "Rangt"]} {"input": [{"role": "system", "content": "Þú ert sérfræðingur í íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu, stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu 'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role": "user", "content": "Fólk er beðið um að fylgjast vel með veðurspám þar sem breytingar gætu orðið þegar nær dregur."}], "ideal": ["rétt", "Rétt"]} {"input": [{"role": "system", "content": "Þú ert sérfræðingur í íslensku. Þú færð málsgrein á íslensku sem annaðhvort inniheldur villu(r) eða ekki. Ef málsgreinin inniheldur málfræðivillu, stafsetningarvillu eða aðra málfarsvillu sem ætti að leiðrétta, skrifaðu 'rangt'. Ef málsgreinin er villulaus skaltu skrifa 'rétt'. Þú verður að gefa svar og 'rangt' og 'rétt' eru einu valkostirnir."}, {"role": "user", "content": "Gjaldmiðlasamningunum var ætlað að tryggja að Exista gæti keypt gjaldeyri á fyrir fram ákveðnum dagsetningum á fyrir fram ákveðnu gengi svo að félagið gæti greitt af skuldum sínum í erlendri mynt með þeim hagnaði sem til varð í íslenskum krónum eins og segir í grein Lýðs."}], "ideal": ["rétt", "Rétt"]} ``` </details>
(Not an eval)
**One-line summary**: Pre-commit hooks were failing. I identified the
main cause, and then fixed all secondary pre-commit issues. I only
changed the logic in one place, `oiaevalset.py`.
I was having issues with type-hinting and identified that the old
`typings` directory was causing the `from openai import OpenAI` import
to register as an error. I decided to go through and fix all the issues
that appeared in `pre-commit run --all-files`.
NOTE:
- I changed the logic in `oaievalset.py` by adding a `continue`
statement if an `eval` or `eval.key` was missing.
- As far as I can tell this should basically never happen, but is
correct behavior.
- Another option would be to assert that `eval` and `eval.key` are not
`None` but forcing an error here doesn't match what I interpret as
intended behavior.
The manual work involved was mainly:
1. Deleting the `typings` directory, which was interfering with `openai`
type-hints (such as `from openai import OpenAI`)
2. Fixing type issues in `oaievalset.py`.
3. Moving the `client =
OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))` line below all the
imports.
4. Breaking lines of length >767 into smaller chunks using line
continuation.
Thus this PR is broken into three parts:
1. Deleting `typings` (first commit)
2. Manually cleaning up issues (middle commits)
3. Applying autofixes from the pre-commit hooks (last commit)
* zhishu completion function * trial implementation of table_extract tasks * bugfixes and add retrieve_native completion_fn * add fuzzy_compare for table content * add fuzzy_normalize for table headers * add uni-finder completion_fn and separated format tests (json/csv) * basic mlops loggers * bugfixes on example showcase * add rag to openai native completion_fns * add RAG for match, modelgraded_classify, table_extract evals * add scipaper_tag2mol, scipaper_hasmol, scipaper_targets and markush2mol evals * add Chemistry evalset * bugfixes * table comparison with self-defined index * fix table extraction with detailed csv text processing and edit-distance comparison * fix match_field compare logic to edit-distance * fixes on data and details for good scipaper_affinity performance * update uni_finder api with pdf_parse_mode * update Zhishu completion_fn with common chat (no file_link) support * split test sets into general_chemistry and drug_discovery * fix Zhishu for mocked GPT-4 * move --mlops option into llmreport entrypoint
add various functions, include: same_triplets() pick_same_turples_in_pred() same_turples() pick_same_turples_in_pred() entity_match() pick_most_similar_entity_in_pred() macro_f1_score_2() macro_f1_score_3()
add evals for GDAS task
…tion.yaml add eval task for GDAS
add eval task for B5CDR
add eval task for B5CDR
add eval task for B5CDR
add eval task for DDI
update biomedicine_comprehension tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thank you for contributing an eval!♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.
Also, please note that we're using Git LFS for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available here.
Eval details 📑
Eval name
add evals:
evals:
Eval description
[Insert a short description of what your eval does here]
What makes this a useful eval?
[Insert why this eval is worth including and any additional context]
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
Basicevals or theFactModel-graded eval, or an exhaustive rubric for evaluating answers for theCriteriaModel-graded eval.If there is anything else that makes your eval worth including, please document it below.
Unique eval value
Eval structure 🏗️
Your eval should
evals/registry/data/{name}evals/registry/evals/{name}.yaml(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request.
Limited availability acknowledgment
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
Submit eval
pip install pre-commit; pre-commit installand have verified thatmypy,black,isort,autoflakeandruffare running when I commit and pushFailure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval
INSERT_EVAL_HERE