|
1 |
| -ggllm.cpp is a ggml-based tool to run quantized Falcon Models on CPU and GPU |
| 1 | +ggllm.cpp is a ggml-backed tool to run quantized Falcon 7B and 40B Models on CPU and GPU |
2 | 2 |
|
3 |
| -For detailed (growing) examples and help check the new Wiki: |
| 3 | +For growing examples and help check the new Wiki: |
4 | 4 | https://github.com/cmp-nct/ggllm.cpp/wiki
|
5 | 5 |
|
6 | 6 | **Features that differentiate from llama.cpp for now:**
|
7 | 7 | - Support for Falcon 7B and 40B models (inference, quantization and perplexity tool)
|
8 |
| -- Fully automated GPU offloading based on available and total VRAM |
| 8 | +- Fully automated CUDA-GPU offloading based on available and total VRAM |
| 9 | +- Run any Falcon Model at up to 16k context without losing sanity |
| 10 | +- Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated |
| 11 | +- Supports running Falcon 40B on a single 4090/3090 (24tk/sec, 15tk/sec), even on a 3080 with a bit of quality sacrifice |
| 12 | +- Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add `-ins` for a chat or `-enc -p "Question"` and optional -sys "System prompt") |
9 | 13 | - Higher efficiency in VRAM usage when using batched processing (more layers being offloaded)
|
10 | 14 | - 16 bit cuBLAs support (takes half the VRAM for those operations)
|
11 | 15 | - Improved loading screen and visualization
|
12 | 16 | - New tokenizer with regex emulation and BPE merge support
|
13 |
| -- Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add `-ins` for a chat or `-enc -p "Question"` and optional -sys "System prompt") |
14 |
| -- Stopwords support (-S) |
15 |
| -- Optimized RAM and VRAM calculation with batch processing support up to 8k |
16 |
| -- More command line parameter options (like disabling GPUs) |
17 |
| -- Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B-4-5bit and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated |
| 17 | + |
| 18 | +- Optimized RAM and VRAM calculation with batch processing support |
| 19 | +- More command line selective features (like disabling GPUs, system prompt, stopwords) |
| 20 | + |
18 | 21 |
|
19 | 22 | **What is missing/being worked on:**
|
| 23 | +- priority: performance |
| 24 | +- web frontend example |
20 | 25 | - Full GPU offloading of Falcon
|
21 | 26 | - Optimized quantization versions for Falcon
|
22 | 27 | - A new instruct mode
|
23 | 28 | - Large context support (4k-64k in the work)
|
24 | 29 |
|
| 30 | + |
25 | 31 | **Old model support**
|
26 | 32 | If you use GGML type models (file versions 1-4) you need to place tokenizer.json into the model directory ! (example: https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226/blob/main/tokenizer.json)
|
27 | 33 | If you use updated model binaries they are file version 10+ and called "GGCC", those do not need the load and convert that json file
|
@@ -49,10 +55,9 @@ https://huggingface.co/tiiuae/falcon-7b-instruct
|
49 | 55 | https://huggingface.co/OpenAssistant
|
50 | 56 | https://huggingface.co/OpenAssistant/falcon-7b-sft-mix-2000
|
51 | 57 | https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226
|
| 58 | +_The sft-mix variants appear more capable than the top variants._ |
52 | 59 | _Download the 7B or 40B Falcon version, use falcon_convert.py (latest version) in 32 bit mode, then falcon_quantize to convert it to ggcc-v10_
|
53 | 60 |
|
54 |
| -**Prompting finetuned models right:** |
55 |
| -https://github.com/cmp-nct/ggllm.cpp/discussions/36 |
56 | 61 |
|
57 | 62 | **Conversion of HF models and quantization:**
|
58 | 63 | 1) use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly
|
|
0 commit comments