int4-quantization

Here are 2 public repositories matching this topic...

g023 / xinf

g023's TurboXInf 🚀: 2x+ faster inference for Qwen3-1.77B or Qwen3.5-2B on RTX 3060! Custom Triton INT8 GEMV kernels halve memory traffic by fusing dequantization, paired with torch.compile. Hits 113 tok/s (vs 56.4 baseline) with no quality loss with INT8 even better results for INT4. MIT License.

ai kernel gpu gpu-acceleration int8-inference int8-quantization ai-tools ai-inference ai-inference-server int4-quantization

Updated Apr 16, 2026
Python

gauranggupta0786 / Quantisation-on-reasoning-models

Star

Implemented post-training quantisation (PTQ) on transformer-based reasoning models using 8-bit and 4-bit weight quantisation (INT8, INT4) with frameworks like PyTorch and Hugging Face Transformers. Leveraged libraries such as bitsandbytes to reduce model size and accelerate inference, while evaluating performance degradation on reasoning tasks. Com

machine-learning deep-learning transformers pytorch quantization low-precision huggingface model-optimization post-training-quantization int8-quantization bitsandbytes reasoning-models int4-quantization

Updated Apr 21, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the int4-quantization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the int4-quantization topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly