Skip to content

Commit 60f82ca

Browse files
author
John
committed
Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp
2 parents f5dd2b4 + 60ea10a commit 60f82ca

File tree

1 file changed

+15
-10
lines changed

1 file changed

+15
-10
lines changed

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,33 @@
1-
ggllm.cpp is a ggml-based tool to run quantized Falcon Models on CPU and GPU
1+
ggllm.cpp is a ggml-backed tool to run quantized Falcon 7B and 40B Models on CPU and GPU
22

3-
For detailed (growing) examples and help check the new Wiki:
3+
For growing examples and help check the new Wiki:
44
https://github.com/cmp-nct/ggllm.cpp/wiki
55

66
**Features that differentiate from llama.cpp for now:**
77
- Support for Falcon 7B and 40B models (inference, quantization and perplexity tool)
8-
- Fully automated GPU offloading based on available and total VRAM
8+
- Fully automated CUDA-GPU offloading based on available and total VRAM
9+
- Run any Falcon Model at up to 16k context without losing sanity
10+
- Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated
11+
- Supports running Falcon 40B on a single 4090/3090 (24tk/sec, 15tk/sec), even on a 3080 with a bit of quality sacrifice
12+
- Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add `-ins` for a chat or `-enc -p "Question"` and optional -sys "System prompt")
913
- Higher efficiency in VRAM usage when using batched processing (more layers being offloaded)
1014
- 16 bit cuBLAs support (takes half the VRAM for those operations)
1115
- Improved loading screen and visualization
1216
- New tokenizer with regex emulation and BPE merge support
13-
- Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add `-ins` for a chat or `-enc -p "Question"` and optional -sys "System prompt")
14-
- Stopwords support (-S)
15-
- Optimized RAM and VRAM calculation with batch processing support up to 8k
16-
- More command line parameter options (like disabling GPUs)
17-
- Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B-4-5bit and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated
17+
18+
- Optimized RAM and VRAM calculation with batch processing support
19+
- More command line selective features (like disabling GPUs, system prompt, stopwords)
20+
1821

1922
**What is missing/being worked on:**
23+
- priority: performance
24+
- web frontend example
2025
- Full GPU offloading of Falcon
2126
- Optimized quantization versions for Falcon
2227
- A new instruct mode
2328
- Large context support (4k-64k in the work)
2429

30+
2531
**Old model support**
2632
If you use GGML type models (file versions 1-4) you need to place tokenizer.json into the model directory ! (example: https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226/blob/main/tokenizer.json)
2733
If you use updated model binaries they are file version 10+ and called "GGCC", those do not need the load and convert that json file
@@ -49,10 +55,9 @@ https://huggingface.co/tiiuae/falcon-7b-instruct
4955
https://huggingface.co/OpenAssistant
5056
https://huggingface.co/OpenAssistant/falcon-7b-sft-mix-2000
5157
https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226
58+
_The sft-mix variants appear more capable than the top variants._
5259
_Download the 7B or 40B Falcon version, use falcon_convert.py (latest version) in 32 bit mode, then falcon_quantize to convert it to ggcc-v10_
5360

54-
**Prompting finetuned models right:**
55-
https://github.com/cmp-nct/ggllm.cpp/discussions/36
5661

5762
**Conversion of HF models and quantization:**
5863
1) use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly

0 commit comments

Comments
 (0)