A minimal, inference-only Python package for RAM++ (Recognize Anything Plus Model) — an open-set image tagger that can recognize any category with high accuracy using zero-shot generalization.
Based on the original recognize-anything by Xinyu Huang et al.
RAM++ is a vision-language model that generates semantic tags for images. It covers 4,585 common categories out of the box and generalizes to open-set categories it has never seen during training — significantly outperforming CLIP on tag recognition tasks.
This package strips out training, finetuning, and demo code — leaving a clean API for running inference.
pip install git+https://github.com/aka-vm/ram.gitWith uv:
uv pip install git+https://github.com/aka-vm/ram.gitFrom source:
git clone https://github.com/aka-vm/ram
cd ram
uv sync # or: pip install -e .The model (~850 MB) is hosted on HuggingFace and can be downloaded automatically:
from ram_plus import download_model
model_path = download_model() # saves to ~/.cache/ram_plus/
model_path = download_model("./models") # or a custom directoryOr manually from HuggingFace.
import cv2
from ram_plus import RamTagGenerator
# Auto-downloads model if model_path is not provided
generator = RamTagGenerator(device="cuda")
# Or point to a local checkpoint
generator = RamTagGenerator(model_path="./models/ram_plus_swin_large_14m.pth", device="cuda")
# Run on a single image (numpy HWC BGR, as returned by cv2)
image = cv2.imread("photo.jpg")
tags = generator(image)
print(tags) # ['dog', 'grass', 'outdoors', ...]
# Batch inference
images = [cv2.imread(p) for p in image_paths]
batch_tags = generator(images)
# Sort tags by confidence
generator = RamTagGenerator(device="cuda", sort_tags=True)
# Pass a pre-normalized torch.Tensor directly (NCHW, float32, ImageNet-normalized)
tags = generator(tensor_batch)| Input | Supported |
|---|---|
np.ndarray (HWC BGR uint8) |
Single image |
List[np.ndarray] |
Batch |
torch.Tensor (NCHW float32, ImageNet-normalized) |
Pre-processed batch |
- Original model and training: Xinyu Huang et al.
- Paper: Recognize Anything: A Strong Image Tagging Model (RAM++)