For an operator to be eligible for fusion, it must meet the following conditions:
- It has only one input, excluding
Constantandinitializertype tensors. - It has only one output.
- The first dimension of both input and output shapes is annotated with "batch_size".
Therefore, we must first perform a more accurate shape inference, i.e., symbolic shape infer. Run the following command:
python ./tools/symbolic_shape_infer.py --input [input model path] --output [output model path]-
Download the onnxruntime project from https://github.com/microsoft/onnxruntime and build it from source by executing the following commands:
git clone https://github.com/microsoft/onnxruntime.git cd onnxruntime git apply ./runtime/ort/changes.patches -
Install the Python package:
pip install -e .
We have currently implemented custom CPU ops [Merge and Route] for onnxruntime.
In the ./example/micro directory, you can find some files. Follow these instructions to test the functionality for microbenchmark:
cd example/micro
python generate.py
./convert.sh
python fuse.py --num 2
python fuse.py
python test_runtime.pyIn the ./example/transformer directory, follow these instructions to test the functionality. We use two decode layers of the LLaMA model and its LoRA variant as our test models:
cd example/transformer
python generate.py
./convert.sh
python fuse.py
python test_runtime.py- Generalize input assumptions to handle multiple inputs
- Refactor the single Route Op into multiple specialized Route Ops.
- Fix height = 256 and width = 256 to obeserve the effect.