Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
668 commits
Select commit Hold shift + click to select a range
a736723
Allow string type for ideal in Match (#1109)
jwang47 Jun 6, 2023
0177536
add abc.abstractmethod at eval (#1111)
gklab Jun 7, 2023
345a545
Russian verse (#979)
Nazar1997 Jun 7, 2023
b9fbed8
Add cybersecurity-filepaths eval test (#957)
glmcdona Jun 7, 2023
c24abed
medication_dose (#1095)
mickaw2 Jun 7, 2023
2ccbfed
Context-free-grammars (#1097)
HorizonAuto Jun 7, 2023
e45d21d
[evals] Add 3D Globe Movement Eval (#1100)
thair116 Jun 7, 2023
609c58f
fix #616 faliekant (#1107)
laszlovandenhoek Jun 7, 2023
57253e3
Bahtiyar vl (#1112)
hbVLMedia Jun 7, 2023
21450f6
[Eval] Mapping arrays to matricies (#1115)
daniel-um Jun 7, 2023
336c3cf
[bugfix] Fix a bug in --registry_path handling (#1101)
pgarbacki Jun 7, 2023
86c7e40
Add Sindarin Fluency eval (#1116)
aaronsmithtv Jun 7, 2023
4e34f8e
[Eval] Identifying UI Elements and Extracting Resource IDs (#1117)
AndrewCEmil Jun 7, 2023
de184a2
[Evals] Add Chinese homonym evals (#1119)
yayachenyi Jun 7, 2023
5055233
Update PULL_REQUEST_TEMPLATE.md
andrew-openai Jun 7, 2023
1cd269b
Add download command for dataset (#23)
zfb132 Jun 8, 2023
4d2e37e
Fix bug in metrics.py (#863)
WingsDrafterwork Jun 8, 2023
d862d2b
for lack of a better name, "gpt protocol buffers" (#771)
unicomp21 Jun 8, 2023
ee15071
[Eval] Evaluation of abstract reasoning capabilities of language mode…
ggendro Jun 8, 2023
8a53f87
Contributing a custum eval to the repository. Updated Version. (#1124)
IGES-Institut Jun 8, 2023
1da9d51
Minor fix for api.py (#864)
WingsDrafterwork Jun 8, 2023
e687286
Add korean_dialects eval (#1127)
racheroni Jun 11, 2023
210b8a3
NFL Point Combinations Eval (#1129)
dougkwanna Jun 11, 2023
74b1ebc
[Evals] Add pantone hex eval (#1130)
bahar-shah Jun 11, 2023
9a37f28
eval(soc-code-classifier) (#1128)
lrperkins Jun 12, 2023
24134f6
[eval] Korean Phonetic Transcription (#1133)
jaiwonrhi2 Jun 12, 2023
b3dddcf
boostrap std (#1148)
andrew-openai Jun 12, 2023
ce8e144
Fix custom-eval doc (#1113)
jwang47 Jun 12, 2023
58a955c
[Eval] Calculate area enclosed by path (#1137)
AhmedA77 Jun 14, 2023
b4ec31b
Eval: Consensus Summary (#1140)
Ali-consensus Jun 14, 2023
c1af37c
[Evals] Add hard Chinese translations (#1141)
nicognaW Jun 14, 2023
4041303
Add Belarusian rhyme eval (#1143)
somerandomguyontheweb Jun 14, 2023
92c3c45
Add Hebrew Talmud eval (#1144)
ysrael-reflectiz Jun 14, 2023
fd1576f
[Eval] Math in polish (#1145)
krzycho1024 Jun 14, 2023
6bff19d
[Evals] Add eval Japanese decimal units. (#1146)
hirosyrup Jun 14, 2023
238c59c
[eval] Extracting text from SVGs (#1147)
AndrewCEmil Jun 14, 2023
b4808a9
[Updated Eval] math_logic_operations (#1154)
nathanstew7 Jun 14, 2023
f93045e
Utah Real Estate Knowledge (#1122)
TheChainCollective Jun 14, 2023
b4477ce
Logical Reasoning Letter Series Test (#1123)
hussnainghani Jun 14, 2023
c9f5458
Eval addition : Describe the meaning of Japanese onomatopoeia (#1033)
a-c-jltd Jun 14, 2023
2deebe1
EvalSet for 2D Maze Solving Performance Across Multiple Difficulties …
douglasmonsky Jun 14, 2023
3b37f6c
Unfamiliar Chinese Character Pronunciation and Meaning Retrieval Eval…
Marc518 Jun 14, 2023
3525ffe
[evals] add ascii-art-digit-recognition (#509)
dddraxxx Jun 14, 2023
2c99f93
Add Belarusian numerals eval (#1162)
somerandomguyontheweb Jun 15, 2023
ef52f0a
[Eval] portuguese-kinship-riddles (#1152)
zhaoxingbu Jun 15, 2023
3219486
Update Registry.make_completion_fn to support new OpenAI models (#1185)
jwang47 Jun 16, 2023
bc433f0
Add --registry_path option to oaievalset.py (#1180)
robatwilliams Jun 20, 2023
948cbfd
Add 2 backgammon evals (#573)
bakebrain Jun 22, 2023
e814616
[Evals] Add Chinese Marxist Philosophy Exam (#885)
cxjwin Jun 22, 2023
9353e60
recurrence-relation (#1134)
Omar-HeshamR Jun 22, 2023
6802fb2
[Eval] Identify Chinese Chu Ci title (#1135)
arvinxx Jun 22, 2023
30aaf1d
[Eval] Determine a gear rotation given a layout (#1136)
xh3xy Jun 22, 2023
1b52980
[eval] Chinese Idioms evulation (#1163)
robinzixuan Jun 22, 2023
52c42af
Ordering Randomised VersionList (#1164)
jjyuhub Jun 22, 2023
8f0ae05
Simple block puzzles (#1167)
birdsean Jun 22, 2023
3ef7a81
add benjamin moore hex eval (#1168)
bryanvaz Jun 22, 2023
aa8261e
[Eval] Add Chinese Homophonic (#1169)
hello-oscar Jun 22, 2023
e0812cd
Add korean_date_counting test (#1171)
jess-hwang Jun 22, 2023
3befe61
[Eval] Chinese lantern riddles (#1176)
ChenZhao44 Jun 22, 2023
ccae151
Add Blackfoot numerals (modern; roman-based) eval (#1179)
ylluminate Jun 22, 2023
7aca889
Add Korean honorific sentence classification eval (#1181)
greenmonn Jun 22, 2023
3e055d4
[Eval Set] : 4 evals on converting between number and Chinese number …
yayachenyi Jun 22, 2023
421ca8e
[Eval] Identify the dynasty of Chinese ancient masterpieces (#1183)
meganjohnson96 Jun 22, 2023
58374d4
Add Reasoning with Contradictory Statements Eval (#1184)
RishTheFish21 Jun 22, 2023
f2f3fb9
dutch rhymes eval (#1187)
verheesj Jun 22, 2023
1b4f7b0
Add Belarusian orthography eval (#1188)
somerandomguyontheweb Jun 22, 2023
a1d8704
[Eval] Evaluate knowledge of LINQ operators and deferred execution in…
joshdixon Jun 22, 2023
390b3e0
number series test (#1191)
Hammad-programmer Jun 22, 2023
548cd8c
Korean postposition 'jo-sa' particles (#1195)
lenahong Jun 22, 2023
8f77305
Singapore data protection decisions (#1196)
iamkaiwei Jun 22, 2023
e880bd3
Station numbering for Tokyo Metro and Tokyu Railways. (#1197)
torufuru Jun 22, 2023
001ceca
use abstract to generate title (#1198)
piupiupiuu Jun 22, 2023
f7efb8e
coq-editing: An eval for basic Coq proof handling (#1200)
JasonGross Jun 22, 2023
10edb5b
Urdu language lexicons knowledge (#1207)
usamanwar Jun 22, 2023
805d76f
Iqbal poetry translation (#1214)
usamanwar Jun 22, 2023
7cb7a99
Urdu transliteration eval (#1215)
usamanwar Jun 22, 2023
dd3f049
Gregorian date to Hebrew date conversion (#1217)
orelb Jun 22, 2023
a4d2f59
[Eval] Add RAL to hex eval (#1218)
Emperorlou Jun 22, 2023
ec77a94
Add automata-and-complexity eval (#1192)
pstakoun Jun 28, 2023
8896344
Add arithmetic-expression modelgraded eval (#1206)
gssantos Jun 28, 2023
55972f2
Add nepali numerals eval (#1211)
samyok Jun 28, 2023
ad10ad2
Add Japanese English numerals eval (#1212)
ki-suzuki Jun 28, 2023
67096c4
[Eval] reverse shell detection (#1220)
robinzixuan Jun 28, 2023
fa08d89
[Eval] Add Casanova's Numerical Cabbala evals (#1222)
giacomoran Jun 28, 2023
9933b28
[Eval] Add english words sharing the same pronunciation eval (#1223)
ki-suzuki Jun 28, 2023
341a1f2
[Eval] Add coding progress assessment eval (#1224)
danielstrizhevsky Jun 28, 2023
20ef89b
[Eval] Evaluation of Interlingual Homographs in Japanese and Simplifi…
y-nakamura-github Jun 28, 2023
a6a2a29
add japanese_approval (#1229)
omonao-public Jun 28, 2023
18dccdc
[Eval] Adds Korean foreign words evaluation (#1230)
Pringlers Jun 28, 2023
0ff2903
Singlestore vectorsearch (#1231)
pvgenflowai Jun 28, 2023
9ade25e
[Eval] Add Base64 Decode Eval (#1235)
AlessioGr Jun 28, 2023
de5b911
[Eval] Persian kinship riddles (#1236)
evalevalian Jun 28, 2023
7a7cec6
japanese-station (#1242)
pabst2009 Jun 28, 2023
cca5856
[Eval] Japanese Mahjong discard strategy eval (#1243)
1nformal Jun 28, 2023
4c88928
[eval] norwegian rhymes (#1248)
monocle-pastels Jun 28, 2023
c35d7ca
Proofreader (#1225)
ramiel Jun 28, 2023
8954034
Eval: Relative orientations (#1000)
dbautista1 Jul 4, 2023
a8e8661
Eval addition: AI vs Human Text Detector (#1021)
udaykumar1997 Jul 4, 2023
14892b0
[Eval] Viewport to grid size (#1083)
AaronGoldsmith Jul 4, 2023
5684ed7
add eval_confusing_korean (#1201)
ywkim Jul 4, 2023
f438a06
[Eval] Tricky word problems (#1227)
JackUrb Jul 4, 2023
f42b3e1
Word association eval (#1237)
douglasmonsky Jul 4, 2023
4196024
Added svg_alphabet eval (#1244)
tudoratlumiai Jul 4, 2023
2f48bfd
[Eval]Identify Chinese Shi Jing Title (#1245)
netsailer Jul 4, 2023
74bdc08
Hebrew plurals (#1247)
relvok Jul 4, 2023
cf520e1
[Eval] SMILES to molecular formula (#1252)
glichtner Jul 4, 2023
6d63e03
[Eval] Add NER for finance (#1255)
adimaggio2021 Jul 4, 2023
74a45ae
[Eval]Identify the author and title of Chinese modern poem (#1256)
netsailer Jul 4, 2023
d318d9b
[Eval] Add French homonym and homograph pronunciation distinction eva…
Yannl Jul 4, 2023
77aeccc
[Eval] Hebrew homophone mistakes (#1260)
relvok Jul 4, 2023
0a59d35
[Eval] Identify Dhammapada Pali reference (#1261)
NobleTruths Jul 4, 2023
a7ad0fc
[Eval] Adding CoSQL eval (#1268)
pybae Jul 4, 2023
1a5dce5
add evals for derivatives of functions (#710)
m0nhawk Jul 4, 2023
6c5a1d7
probabilities-word-problems (#941)
Omar-HeshamR Jul 4, 2023
dd55227
Bug fix: gpt-4-base runs with ChatCompletion (#1300)
jasony123123 Jul 6, 2023
9c3e16b
add chinese_guess_lantern_riddles (#1249)
Tesla2678 Jul 13, 2023
48344a1
[Eval] Basic data visualization (#1262)
ryandao Jul 13, 2023
aac73bd
[Eval] SEO keywords (#1263)
gerdemann Jul 13, 2023
a7c8d17
[Eval] Explain and solve math equations described in words (#1269)
raxityo Jul 13, 2023
b019b1d
[EVAL] Italian Big Math Expression (#1271)
danielepoterti Jul 13, 2023
de9c7ad
Chinese remainder theorem (#1273)
carloshellin Jul 13, 2023
def5072
[Eval] bias detection (Updated version of #1253) (#1276)
DomenicoMireles Jul 13, 2023
7705bde
Create latin-grammar.yaml (#1279)
d3287t328 Jul 13, 2023
53a6b97
Eval that checks ability to do logical problems involving jars with w…
osecen Jul 13, 2023
21b2536
[Eval] Add thirty six stratagems eval (#1281)
cookfish Jul 13, 2023
dd7ed4e
Add Belarusian inflectional morphology eval (#1287)
somerandomguyontheweb Jul 13, 2023
435421d
add chinese famous novel eval (#1288)
l1905 Jul 13, 2023
8e99b22
[Eval] Test inferring causation from correlation (#1289)
vasarmilan Jul 13, 2023
6113418
Css selectors (#1290)
ilanh Jul 13, 2023
835d026
[Eval] Add financial reasoning and calculation eval (#1291)
ChristopherGondek Jul 13, 2023
fe1c8ec
add eval of math_for_5th-grader (#1293)
mochisky Jul 13, 2023
f6f1cfe
[Eval] Korean Romanization eval (#1296)
kyeongsoosoo Jul 13, 2023
cde4137
[Eval] Irish Plural Nouns (#1297)
AaronBrennan1 Jul 13, 2023
66d42d4
Add my eval about premature conclusions (#1299)
natanaelwf Jul 13, 2023
0ac0f34
Add eval for finishing Polish proverbs (#1301)
KatKlo Jul 13, 2023
5340968
[Eval] Add eval for Romanian homonyms distinction (#1305)
AdrianApan Jul 13, 2023
082a244
[Eval] Finger Tracking (#1278)
chris-ccm Jul 13, 2023
192a00c
[Eval] Chinese ancient poetry (#1307)
sherdencooper Jul 13, 2023
804cdd3
Add astrological routes eval (#1309)
arbreton Jul 13, 2023
e6c9a94
Romanian mathematical, logical and grammatical evaluation (#1313)
mariuspatru Jul 13, 2023
1abaefc
[Eval] 3-dimensional object manipulation of generic irregular polygon…
spomichter Jul 13, 2023
684dac4
Added coq-proof-step-match.dev.v0 eval (#1317)
amit9oct Jul 13, 2023
419655b
add LangChain chat model completion fn (#1311)
agola11 Jul 13, 2023
58a2282
All other yaml files use 2 spaces (#1204)
CholoTook Jul 13, 2023
8e4d627
ADDED OS.PATH.JOIN() TO SCRIPTS (#1155)
nickabooch Jul 13, 2023
0da1917
Add pre-commit config for mypy (#1029)
pan93412 Jul 13, 2023
f4ac62a
adding logical-black-scholes (#1295)
dsims21 Jul 13, 2023
738e50b
Ignore spurious mypy error (#1320)
jwang47 Jul 19, 2023
53c253f
Update README.md
andrew-openai Jul 20, 2023
4639cf6
add eval of Japanese romantic context (#1314)
Missionteam Jul 20, 2023
181f9e5
Irrelevant negative diversion (#1318)
AndersWangRask Jul 20, 2023
63f29cf
Hebrew grammar (#1322)
idoyana Aug 2, 2023
7e74cc7
Add HTTP recorder for evals; introduce --http-run flag (#1312)
nstankov-bg Aug 3, 2023
e0ad3b1
Hard russian computer science tasks (#1323)
Nazar1997 Aug 10, 2023
eafe22a
Add eval : Product Information Extraction (#1251)
abrinkmann Aug 10, 2023
acbbd16
Update LocalRecorder to filter data fields before recording (#1330)
michaelAlvarino Aug 17, 2023
bb2e477
Fixes a bug in LocalRecorder (#1337)
michaelAlvarino Aug 18, 2023
44ef45e
Multilingual EXAMS and Arabic Literature Question Answers (By IIAI-G4…
samta-kamboj Aug 29, 2023
5202d72
Add eval : Research Question Extraction (#1334)
Aug 29, 2023
d289612
Update README.md
andrew-openai Aug 29, 2023
fb4649d
[For Issue #1284] Allow match_fn to be set in modelgraded eval .yaml …
sohenze Sep 18, 2023
e1a030c
add workflow_dispatch for manual triggering & add paths to target reg…
jonathanagustin Sep 18, 2023
2e4f5e7
fixing the wording for the modelgraded 'best' model (#1210)
CholoTook Sep 18, 2023
5d3bbd4
Update run-evals.md (#1339)
tinycrops Sep 18, 2023
b9af5d3
Changing accounting_audit filename extension... (#1106)
jorge-openai Sep 18, 2023
014ecaa
Fixes syntax of fewshot file (#1104)
jorge-openai Sep 18, 2023
ce38999
Dynamic Argument Integration for Registered Completion Functions (#1347)
douglasmonsky Sep 18, 2023
0ca655e
add Schelling Point eval (#1353)
ianmckenzie-oai Sep 19, 2023
6680790
Add 3rd party dataset licenses (#1357)
ianmckenzie-oai Sep 19, 2023
b95981b
add gujarati numerals eval (#1343)
rohoswagger Sep 19, 2023
c630f7f
[Eval] Add eval for fixing word spacing for Korean sentences (#1345)
woniesong92 Sep 19, 2023
992e746
add MakeMeSay eval (#1351)
ianmckenzie-oai Sep 19, 2023
7e5a177
add Ballot Proposal eval (#1352)
ianmckenzie-oai Sep 19, 2023
feb8f9b
Add MakeMePay eval (#1354)
ianmckenzie-oai Sep 19, 2023
732618f
add Steganography eval (#1355)
ianmckenzie-oai Sep 19, 2023
bfb609f
Add text compression eval (#1356)
ianmckenzie-oai Sep 19, 2023
a99a710
Add README for schelling point (#1358)
JunShern Sep 19, 2023
391baa1
Minor wording tweaks to makemesay readme (#1360)
RosieCampbell Sep 19, 2023
f2dcf30
Minor wording tweaks to schelling point readme (#1359)
RosieCampbell Sep 19, 2023
f6742f2
Amend contribution statement for make_me_say (#1361)
james-aung-aisi Sep 19, 2023
d012ef2
Remove setuptools_scm dependency (#1364)
jwang47 Sep 25, 2023
cfa97a2
Update README.md to link to W&B UI (#1365)
logankilpatrick Sep 25, 2023
28bbcf5
Check `--registry_path` for `samples_jsonl` data (#1277)
lukevs Sep 26, 2023
4d14963
Adding ruff, running pre-commit hooks, small fixes and documentation …
benomahony Sep 26, 2023
2f79783
add belarusian antonyms eval (#1368)
tanyashagova Oct 27, 2023
6138aa7
adding eval osm_mapping (#1349)
adrianmargin Oct 27, 2023
9407bae
Add A is B and B is A Eval (#1366)
mmtmn Oct 27, 2023
b3733fd
Added Icelandic inflection eval; JsonMatch eval function (#1387)
vthorsteinsson Oct 27, 2023
1f4f5f3
Add new Solvers framework (#1397)
JunShern Nov 9, 2023
4c58416
[Evals] Update the errors we except for retries (#1406)
andrew-openai Nov 13, 2023
b5414a1
Revert "[Evals] Update the errors we except for retries (#1406)"
andrew-openai Nov 15, 2023
b85a07b
Self-Prompting eval (#1401)
JunShern Nov 15, 2023
616320c
Add theory of mind eval (#1405)
inwaves Nov 15, 2023
85f6043
MMP v2 eval (#1403)
ojaffe Nov 15, 2023
9ebc23b
Sandbagging eval (#1409)
ojaffe Nov 15, 2023
796f222
Fix the OpenAI Version to <=0.28.1 (#1410)
andrew-openai Nov 15, 2023
52ce4e5
Bluff eval (#1402)
johny-b Nov 15, 2023
bec5500
Amend contribution statements for Bluff and ToM from PolRes team (#1413)
james-aung-aisi Nov 15, 2023
f4b12f2
Sandbagging readme (#1412)
ojaffe Nov 15, 2023
46b5af2
Upgrade openai to >=1.0.0 (#1420)
etr2460 Dec 5, 2023
e34e6ac
Fix commandline --help exception (#1381)
skyan Dec 5, 2023
c2312ef
[ci] Fix referencing API key for unit tests (#1425)
etr2460 Dec 8, 2023
a3be3ec
Docs typos (#1415)
krychu Dec 10, 2023
086c2eb
docs: documentation out of date/sync with inlined example code. (#1417)
tregoning Dec 10, 2023
c80ed2e
Update README.md (#1429)
logankilpatrick Dec 11, 2023
f4ef973
Update CODEOWNERS to new maintainers (#1431)
etr2460 Dec 11, 2023
0b8ffd6
Fix bluff for openai >= 1.0.0 and unbreak tests (#1427)
etr2460 Dec 11, 2023
fd5a3be
Schelling Point v2 (#1391)
james-aung-aisi Dec 15, 2023
cff96e6
Ballots v2 (#1390)
james-aung-aisi Dec 15, 2023
0c52bc2
Fix branch tests with empty API Key (#1440)
etr2460 Dec 20, 2023
b85b91c
Fix make decision prompt in ballots to send from system, not assistan…
james-aung-aisi Dec 20, 2023
80f0b99
Run tests on all commits to main (#1441)
etr2460 Dec 20, 2023
b266532
Add complete list of errors to MakeMeSay utils (#1436)
inwaves Dec 20, 2023
f0e0f1c
Use the API key for testing evals in CI (#1443)
etr2460 Dec 21, 2023
b4be822
Add MMMU evals and runner (#1442)
etr2460 Dec 21, 2023
34ba3bd
Release 2.0.0 (#1444)
etr2460 Dec 21, 2023
89392dd
Fix small typo in oaieval run function (#1438)
inwaves Dec 21, 2023
d05ae59
Fix Pydantic warning on data_test run (#1445)
inwaves Dec 21, 2023
959d02b
Change wrong kwargs name (#1435)
Dec 21, 2023
4e864cd
Randomly select MMMU answer when none is returned from the model (#1447)
etr2460 Dec 24, 2023
e9a01c1
Fixed parameter incorrect (#1378)
assert6 Jan 3, 2024
e1800a8
Add gpt-3.5-turbo-16k support to ctx len getter (#1388)
danesherbs Jan 3, 2024
5a45cd1
Add a recorder for function calls (#1389)
danesherbs Jan 3, 2024
b21e1f8
Solve #1394 (#1395)
Jan 3, 2024
4deb71e
Add eval japanese prime minister (#1422)
return-nil Jan 3, 2024
334ec44
Improve MMMU performance with prompt engineering (#1450)
etr2460 Jan 3, 2024
d628eb3
Add eval yaml for Theory of Mind eval (#1453)
ojaffe Jan 9, 2024
5c6c57d
icelandic gec eval (#1400)
svanhvitlilja Jan 10, 2024
f588280
Fix formatting/typing so pre-commit hooks pass (#1451)
ianmckenzie-oai Jan 10, 2024
5231b55
LLM eval for RAGTasks in Science topics (#1)
TablewareBox Jan 26, 2024
e3b587d
Merge remote-tracking branch 'upstream/main'
TablewareBox Jan 26, 2024
616a07f
update uni-finder pdf-parse-mode to v1.26 (#2)
TablewareBox Jan 26, 2024
72dc5f2
update evals for AGAC task in biomedical evals
Linmj-Judy Jan 26, 2024
fc58e2c
add evals and new metrics for biomedicine tasks
Naplessss Feb 27, 2024
8358134
update metrics and evals for biomedicine tasks
Naplessss Feb 29, 2024
9641a35
Update utils.py
Linmj-Judy Mar 3, 2024
ec12ff3
Create 31_GDAS_function.yaml
Linmj-Judy Mar 3, 2024
4fff417
Update and rename 31_semantic_role_recognition.yaml to 31_GDAS_regula…
Linmj-Judy Mar 3, 2024
75da117
Update 32_CDR.yaml
Linmj-Judy Mar 3, 2024
87610f3
Update 32_C_entities_recognition.yaml
Linmj-Judy Mar 3, 2024
2061128
Update 32_D_entities_recognition.yaml
Linmj-Judy Mar 3, 2024
50a6881
Create 33_DDI.yaml
Linmj-Judy Mar 3, 2024
f5a3eba
Update and rename AGAC_CHIP2022.yaml to biomedicine_comprehension.yaml
Linmj-Judy Mar 3, 2024
4811ca3
Update function.jsonl
Linmj-Judy Mar 3, 2024
7cf1d51
Create regulation.jsonl
Linmj-Judy Mar 3, 2024
d34487a
Update chemical_samples.jsonl
Linmj-Judy Mar 3, 2024
bd06804
Update disease_samples.jsonl
Linmj-Judy Mar 3, 2024
a280eaa
Update relations_samples.jsonl
Linmj-Judy Mar 3, 2024
b9075e7
Create samples.jsonl
Linmj-Judy Mar 3, 2024
b7dd91c
Update samples.jsonl
Linmj-Judy Mar 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @andrew-openai @rlbayes @jwang47 @logankilpatrick @etr2460 @katyhshi
File renamed without changes.
File renamed without changes.
47 changes: 36 additions & 11 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨

**PLEASE READ THIS**:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.**

Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com).

## Eval details 📑

### Eval name

[Insert Eval name here]

### Eval description
Expand All @@ -20,10 +30,10 @@ Below are some of the criteria we look for in a good eval. In general, we are se

Your eval should be:

- [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [ ] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [ ] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [ ] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval.
- [ ] Include at least 100 high quality examples
- [ ] **Include at least 15 high-quality examples.**

If there is anything else that makes your eval worth including, please document it below.

Expand All @@ -34,8 +44,9 @@ If there is anything else that makes your eval worth including, please document
## Eval structure 🏗️

Your eval should

- [ ] Check that your data is in `evals/registry/data/{name}`
- [ ] Check that your yaml is registered at `evals/registry/evals/{name}.jsonl`
- [ ] Check that your YAML is registered at `evals/registry/evals/{name}.yaml`
- [ ] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Expand All @@ -44,25 +55,39 @@ Your eval should

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>).

- [ ] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request.

- [ ] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement
### Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

- [ ] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
- [ ] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted.

### Submit eval

- [ ] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push
- [ ] I have filled out all required fields of this form
- [ ] I have used **Git LFS** for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `mypy`, `black`, `isort`, `autoflake` and `ruff` are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

### Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

<details>
<summary>View evals in JSON</summary>

### Eval
```jsonl
INSERT_EVAL_HERE
```
</details>
15 changes: 15 additions & 0 deletions .github/workflows/parse_yaml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import sys

import yaml


def get_first_key(file_path):
with open(file_path, "r") as yaml_file:
content = yaml.safe_load(yaml_file)
first_key = next(iter(content))
return first_key


if __name__ == "__main__":
yaml_file_path = sys.argv[1]
print(get_first_key(yaml_file_path))
38 changes: 38 additions & 0 deletions .github/workflows/run_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Run unit tests

on:
pull_request:
branches:
- main
push:
branches:
- main

jobs:
check_files:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v2
with:
fetch-depth: 0
lfs: true

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pyyaml
pip install pytest
pip install -e .

- name: Run unit tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pytest
59 changes: 59 additions & 0 deletions .github/workflows/test_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Run new evals

on:
workflow_dispatch:
pull_request:
branches:
- main
paths:
- 'evals/registry/**'

jobs:
check_files:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v2
with:
fetch-depth: 0
lfs: true

- name: Install Git LFS
run: |
sudo apt-get install git-lfs
git lfs install

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pyyaml
pip install -e .

- name: Get list of new YAML files in evals/registry/evals
id: get_files
run: |
# Use environment files to store the output
git diff --name-only --diff-filter=A ${{ github.event.pull_request.base.sha }} ${{ github.sha }} | grep '^evals/registry/evals/.*\.yaml$' | xargs > new_files
echo "new_files=$(cat new_files)" >> $GITHUB_ENV

- name: Run oaieval command for each new YAML file
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
files="${{ env.new_files }}"
if [ -n "$files" ]; then
for file in $files; do
echo "Processing $file"
first_key=$(python .github/workflows/parse_yaml.py $file)
echo "Eval Name: $first_key"
oaieval dummy $first_key --max_samples 10
done
else
echo "No new YAML files found in evals/registry/evals"
fi
28 changes: 28 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,30 @@
__pycache__/
evals.egg-info/

.venv/
venv/

# MacOS folder metadata
.DS_Store
.vscode/

# PyCharm folder metadata
.idea/

build

openai-key.txt
*.code-workspace

# Ignore run_experiments.sh results
evals/elsuite/**/logs/
evals/elsuite/**/outputs/

#Ignore the large model
scripts/model/GoogleNews_vectors_negative300.bin
scripts/model/GoogleNews-vectors-negative300.bin.gz
scripts/metrics/GoogleNews-vectors-negative300.bin.gz

raw_data

wandb/**/*
15 changes: 15 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
repos:
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v1.3.0'
hooks:
- id: mypy
args: ["--config-file=mypy.ini", "--no-site-packages"]

- repo: https://github.com/psf/black
rev: 22.8.0
hooks:
Expand Down Expand Up @@ -27,3 +33,12 @@ repos:
- "--remove-unused-variables"
- "--remove-all-unused-imports"
exclude: "evals/__init__.py"

# This allows ruff to run and autofix the code
# The line length is so high because some of the evals are very long
# TODO: fix the evals and then reduce the line length here
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.0.277
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix, --line-length=767]
21 changes: 0 additions & 21 deletions LICENSE

This file was deleted.

95 changes: 95 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
MIT License

Copyright (c) 2023 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

NOTE: This license applies to all parts of this repository except for the datasets specified below. See the respective datasets for their individual licenses.

### Dataset Licenses

#### Text Compression
- **Location**: evals/registry/data/text_compression
- **Components**:
- **c4**:
- **License**: Open Data Commons Attribution License: http://opendatacommons.org/licenses/by/1.0/
- **Source**: https://huggingface.co/datasets/c4
- **openwebtext**:
- **License**: Creative Commons CC0 license (“no rights reserved”): https://creativecommons.org/share-your-work/public-domain/cc0/
- **Source**: https://huggingface.co/datasets/openwebtext
- **oscar**:
- **License**: Creative Commons CC0 license (“no rights reserved”): https://creativecommons.org/share-your-work/public-domain/cc0/
- **Source**: https://huggingface.co/datasets/oscar
- **wikipedia**:
- **License**: Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA): https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License and the GNU Free Documentation License (GFDL): https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License
- **Source**: https://huggingface.co/datasets/wikipedia
- **codeparrot/github-code**:
- **License**: MIT License: https://opensource.org/license/mit/
- **Source**: https://huggingface.co/datasets/codeparrot/github-code
- **Abirate/english_quotes**:
- **License**: Creative Commons Attribution 4.0 International License: https://creativecommons.org/licenses/by/4.0/legalcode.txt
- **Source**: https://huggingface.co/datasets/Abirate/english_quotes

#### Steganography
- **Location**: evals/registry/data/steganography
- **Components**:
- **Abirate/english_quotes**:
- **License**: Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/legalcode.txt
- **Source**: https://huggingface.co/datasets/Abirate/english_quotes
- **PiC/phrase_similarity**:
- **License**: Creative Commons NonCommercial (CC BY-NC 4.0) https://creativecommons.org/licenses/by-nc/4.0/legalcode
- **Source**: https://huggingface.co/datasets/PiC/phrase_similarity
- **wikipedia**:
- **License**: Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA): https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License and the GNU Free Documentation License (GFDL): https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License
- **Source**: https://huggingface.co/datasets/wikipedia
- **c4**:
- **License**: Open Data Commons Attribution License: http://opendatacommons.org/licenses/by/1.0/
- **Source**: https://huggingface.co/datasets/c4
- **akoksal/LongForm**:
- **License**: MIT License https://opensource.org/license/mit/
- **Source**: https://huggingface.co/datasets/akoksal/LongForm
- **alespalla/chatbot_instruction_prompts**:
- **License**: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt
- **Source**: https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts
- **lighteval/mmlu**:
- **License**: MIT License https://opensource.org/license/mit/
- **Source**: https://huggingface.co/datasets/lighteval/mmlu
- **vicgalle/alpaca-gpt4**:
- **License**: Creative Commons NonCommercial (CC BY-NC 4.0) https://creativecommons.org/licenses/by-nc/4.0/legalcode
- **Source**: https://huggingface.co/datasets/vicgalle/alpaca-gpt4

#### Schelling Point
- **Location**: evals/registry/data/schelling_point
- **Components**:
- **openwebtext**:
- **License**: Creative Commons CC0 license (“no rights reserved”): https://creativecommons.org/share-your-work/public-domain/cc0/
- **Source**: https://huggingface.co/datasets/openwebtext
- **wikipedia**:
- **License**: Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA): https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License and the GNU Free Documentation License (GFDL): https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License
- **Source**: https://huggingface.co/datasets/wikipedia

#### Ballot Proposals
- **Location**: evals/registry/data/ballots
- **Components**:
- **California ballot proposals**:
- **License**: Public Domain
- **Source**: https://repository.uclawsf.edu/ca_ballot_props/


Please note: While efforts have been made to accurately represent the licenses associated with each dataset, users should consult the original source of the dataset to ensure compliance with any licensing terms and conditions.
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
recursive-include evals *.py
recursive-include evals *.yaml
recursive-include evals *.sql
recursive-include evals/registry/data *.jsonl
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
.PHONY: mypy
mypy:
mypy --config-file=mypy.ini --no-site-packages .
Loading