Phi Family Quantization Using Generative AI Extensions for onnxruntime

What Are Generative AI Extensions for onnxruntime?

These extensions enable the use of generative AI with ONNX Runtime (https://github.com/microsoft/onnxruntime-genai). They provide a generative AI loop for ONNX models, encompassing inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Developers can leverage a high-level generate() method or execute each model iteration in a loop to generate tokens step by step, with the flexibility to update generation parameters within the loop. The extensions support greedy/beam search, TopP, TopK sampling for token sequence generation, and built-in logits processing such as repetition penalties. Additionally, custom scoring can be easily integrated.

At the application level, Generative AI extensions for onnxruntime allow developers to build applications in C++/C#/Python. At the model level, they facilitate merging fine-tuned models and performing related quantitative deployment tasks.

Quantizing Phi-3.5 with Generative AI Extensions for onnxruntime

Supported Models

Generative AI extensions for onnxruntime support quantization for Microsoft Phi, Google Gemma, Mistral, and Meta LLaMA.

Model Builder in Generative AI Extensions for onnxruntime

The Model Builder simplifies and accelerates the creation of optimized and quantized ONNX models compatible with the ONNX Runtime generate() API.

With Model Builder, you can quantize models to INT4, INT8, FP16, FP32, and combine various hardware acceleration methods such as CPU, CUDA, DirectML, Mobile, etc.

To use Model Builder, you need to install:

pip install torch transformers onnx onnxruntime

pip install --pre onnxruntime-genai

After installation, you can execute the Model Builder script in the terminal to convert model formats and apply quantization.

python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files

Understanding Key Parameters

model_name: The model's identifier on Hugging Face, such as microsoft/Phi-3.5-mini-instruct or microsoft/Phi-3.5-vision-instruct. Alternatively, this can be the local path where the model is stored.
path_to_output_folder: The directory where the quantized model will be saved.
execution_provider: Specifies the hardware acceleration to be used, such as CPU, CUDA, or DirectML.
cache_dir_to_save_hf_files: The directory where models downloaded from Hugging Face will be cached locally.

Note:

How to Use Model Builder to Quantize Phi-3.5

The Model Builder now supports ONNX model quantization for Phi-3.5 Instruct and Phi-3.5-Vision.

Phi-3.5-Instruct

CPU-Accelerated Quantization to INT4

python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct  -o ./onnx-cpu -p int4 -e cpu -c ./Phi-3.5-mini-instruct

CUDA-Accelerated Quantization to INT4

python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct  -o ./onnx-cpu -p int4 -e cuda -c ./Phi-3.5-mini-instruct

python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct  -o ./onnx-cpu -p int4 -e cuda -c ./Phi-3.5-mini-instruct

Phi-3.5-Vision

Phi-3.5-vision-instruct-onnx-cpu-fp32

Set up the environment in the terminal:

mkdir models

cd models

Download the microsoft/Phi-3.5-vision-instruct model to the models folder:
https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Download the following files to your Phi-3.5-vision-instruct folder:
Download this file to the models folder:
https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/blob/main/onnx/build.py
Open the terminal and convert ONNX support to FP32:

python build.py -i .\Your Phi-3.5-vision-instruct Path\ -o .\vision-cpu-fp32 -p f32 -e cpu

Note:

The Model Builder currently supports Phi-3.5-Instruct and Phi-3.5-Vision but does not support Phi-3.5-MoE.
You can utilize ONNX's quantized models through the Generative AI extensions for onnxruntime SDK.
Responsible AI considerations are crucial. After quantizing a model, it is recommended to conduct comprehensive testing to validate results.
By quantizing the CPU INT4 model, deployment on edge devices becomes feasible, offering better application scenarios. Therefore, Phi-3.5-Instruct has been prioritized for INT4 quantization.

Resources

Learn more about Generative AI extensions for onnxruntime:
https://onnxruntime.ai/docs/genai/
Explore the Generative AI extensions for onnxruntime GitHub repository:
https://github.com/microsoft/onnxruntime-genai

It seems you are asking for a translation into "mo." Could you clarify what "mo" refers to? For instance, is it a specific language or abbreviation? Common possibilities might include Mongolian, Maori, or something else. Let me know so I can assist you accurately!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UsingORTGenAIQuantifyingPhi.md

UsingORTGenAIQuantifyingPhi.md

Phi Family Quantization Using Generative AI Extensions for onnxruntime

What Are Generative AI Extensions for onnxruntime?

Quantizing Phi-3.5 with Generative AI Extensions for onnxruntime

Supported Models

Model Builder in Generative AI Extensions for onnxruntime

Understanding Key Parameters

How to Use Model Builder to Quantize Phi-3.5

Phi-3.5-Instruct

Phi-3.5-Vision

Note:

Resources

Files

UsingORTGenAIQuantifyingPhi.md

Latest commit

History

UsingORTGenAIQuantifyingPhi.md

File metadata and controls

Phi Family Quantization Using Generative AI Extensions for onnxruntime

What Are Generative AI Extensions for onnxruntime?

Quantizing Phi-3.5 with Generative AI Extensions for onnxruntime

Supported Models

Model Builder in Generative AI Extensions for onnxruntime

Understanding Key Parameters

How to Use Model Builder to Quantize Phi-3.5

Phi-3.5-Instruct

Phi-3.5-Vision

Note:

Resources