This project is a julia version of HumanEval. Our goal is to gain a better understanding of latest LLMs' performance with the Julia programming language.
| model | evalplus * | basic ** |
|---|---|---|
| gpt-4-0125-preview | 0.774 | 0.823 |
| gpt-4-turbo | 0.756 | 0.823 |
| mistral-large-instruct-2407 | 0.744 | 0.823 |
| gpt-4o | 0.738 | 0.817 |
| claude-3-5-sonnet-20240620 | 0.72 | 0.823 |
| gpt-4-1106-preview | 0.72 | 0.805 |
| DeepSeek-Coder-V2-Instruct | 0.695 | 0.774 |
| DeepSeek-V2-Chat | 0.689 | 0.756 |
| Llama-3.1-405B-Instruct | 0.628 | 0.744 |
| claude-3-opus-20240229 | 0.61 | 0.689 |
| Qwen2-72B-Instruct | 0.598 | 0.665 |
| Phind-CodeLlama-34B-v2 | 0.591 | 0.659 |
| gpt-3.5-turbo-0125 | 0.591 | 0.652 |
| mistral-large-latest | 0.573 | 0.659 |
| gpt-3.5-turbo-0613 | 0.567 | 0.64 |
| gpt-3.5-turbo-1106 | 0.555 | 0.628 |
| DeepSeek-Coder-33B-instruct | 0.543 | 0.598 |
| Magicoder-S-DS-6.7B | 0.543 | 0.616 |
| WizardCoder-33B-V1.1 | 0.543 | 0.604 |
| Qwen1.5-110B-Chat | 0.53 | 0.598 |
| yi-large | 0.524 | 0.652 |
| deepseek-coder-6.7b-instruct | 0.488 | 0.549 |
| CodeLlama-70b-Instruct-hf | 0.457 | 0.561 |
| code-millenials-34b | 0.439 | 0.5 |
| Magicoder-S-CL-7B | 0.402 | 0.463 |
| CodeLlama-34b-Instruct-hf | 0.311 | 0.366 |
| Starling-LM-7B-alpha | 0.299 | 0.354 |
| Yi-34B-Chat | 0.232 | 0.317 |
** basic: scores are calculated based on test cases from HumanEval only.
By default, all results are calculated by
pass@1 using greedy decoding. Models are deployed with vllm which uses a predefined chat template stored in the tokenizer. Feel free to create an issue if you'd like to evaluate some other models. First, deploy the model you'd like to evaluate with a OpenAI compatible endpoint, like vLLM or Ollama. We'll need the OPENAI_API_KEY and OPENAI_BASE_URL in the next step.
To test models from Anthropic, you should set ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL instead.
docker run -it --rm \
-v /PATH/TO/SAVE/RESULTS/generations:/workspace/HumanEval.jl/generations \
-e OPENAI_API_KEY=YOUR_SECRET \
-e OPENAI_BASE_URL=http://localhost:8000/v1 \
-e RETESTITEMS_NWORKERS=16 \
-e RETESTITEMS_TESTITEM_TIMEOUT=15 \
-e MODEL=gpt-3.5-turbo-0613 \
ghcr.io/01-ai/humaneval.jl:latest/PATH/TO/SAVE/RESULTS/generations, this folder will contain raw responses from the model, extracted julia code snippets, and unit test results.YOUR_SECRET, it should be the same with the one you provided when deploying the server.RETESTITEMS_NWORKERS, adjust it to the number of cores with your test environment. It specifies how many workers we use to run tests.RETESTITEMS_TESTITEM_TIMEOUT, the default15seconds should be enough to pass all the test cases.MODEL, the model name you specified when deploying models. If you usevLLM, it should be the same with the value of--served-model-name
- Make sure you have the latest Julia installed.
- Clone and enter the root of this project.
- Start the Julia REPL with the following command
OPENAI_API_KEY=debug OPENAI_BASE_URL=http://localhost:8000/v1 RETESTITEMS_NWORKERS=16 RETESTITEMS_TESTITEM_TIMEOUT=15 MODEL=gpt-3.5-turbo-0613 julia --projectThe meaning of the environment variables are the same with above.
- Execute following commands in the Julia REPL.
julia> import Pkg; Pkg.instantiate();
julia> include("src/evaluation.jl")
julia> evaluate("YOUR_MODEL_NAME")Once finished, the results will be displayed. You may find more details under the generations directory.
- nuprl/MultiPL-E contains Julia version prompts transformed from the original Python version HumanEval. However, based on my limited Julia programming experience, the prompts are not that accurate and conventional.
- Julia-LLM-Leaderboard, which focuses on practicality and simplicity.
- EvalPlus Leaderboard
- Explore advanced techniques to improve LLM's performance with code in general. Especially how to iteratively refine code.
- Julia specific LLM training/finetuning. We want to know the minimum requirement to train a code LLM.
- Improve Yi series models' performance with code.
We're hiring! If you're interested in working on code LLM at 01.ai, please contact [email protected].
- What are the differences compared to the original Python version?
- What are the limitations of this project?
- How do LLMs perform compared to human?
- How difficult is each problem?
- Is GPT4 good enough?
- How to make this evaluation higher quality?
- How should we measure hallucinations?
- Are there any other metrics we should care beyond pass@k?
- Why does Yi-34B-Chat perform so poor?
- This project heavily relies on many features provided by ReTestItems.jl. Great thanks to Nick Robinson's help during the development.