Skip to content

feat: Add BijectionConverter and BijectionAttack (#1903)#1942

Open
sajisanchu1913-source wants to merge 15 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack
Open

feat: Add BijectionConverter and BijectionAttack (#1903)#1942
sajisanchu1913-source wants to merge 15 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack

Conversation

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor

Summary

Implements the Bijection Attack from arXiv:2410.01294 (Haize Labs) into PyRIT.

The attack works by teaching a target LLM a secret character mapping through
demonstration shots, then sending harmful prompts encoded in that mapping to
bypass safety filters. Responses are decoded using the inverse mapping.

Changes

New Files

  • pyrit/prompt_converter/bijection_converter.py — generates random letter-to-letter mapping, encodes prompts, decodes responses
  • pyrit/executor/attack/single_turn/bijection_attack.py — runs full bijection attack with teaching phase
  • tests/unit/prompt_converter/test_bijection_converter.py — 11 unit tests for converter
  • tests/unit/executor/test_bijection_attack.py — 5 unit tests for attack
  • doc/code/executor/attack/bijection_attack.ipynb — usage notebook

Modified Files

  • pyrit/prompt_converter/__init__.py — registered BijectionConverter
  • pyrit/executor/attack/single_turn/__init__.py — registered BijectionAttack

How It Works

  1. BijectionConverter generates a random secret mapping (e.g. a→q, b→x...)
  2. BijectionAttack sends teaching messages to target AI to teach the mapping
  3. Harmful prompt is encoded and sent as TASK is '⟪encoded prompt⟫'
  4. Response is decoded using inverse mapping
  5. Decoded response is scored by the judge

Pattern Followed

  • BijectionConverter follows FlipConverter pattern
  • BijectionAttack follows FlipAttack pattern

Reference

sajisanchu1913-source and others added 12 commits May 28, 2026 17:14
- _RemoteDatasetLoader._fetch_zip_from_url:
  - keyword-only args (source, inner_files, cache)
  - streams download (requests stream=True + iter_content) to avoid
    double-buffering large archives
  - md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
    cache=True; named temp file otherwise (cleaned up after parse)
  - validates each inner_files extension against FILE_TYPE_HANDLERS;
    raises ValueError with a member preview if an inner file is missing
  - parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
    so the open ZipFile never escapes the worker thread
  - adds the missing import zipfile that broke the previous commit
- _MICDataset:
  - drops unused io / json / requests imports (helper handles them)
  - delegates download + parse to the helper; only owns the seed
    construction loop
  - guards non-string Q values (in addition to NaN moral values)
  - forwards cache from fetch_dataset_async to the helper
  - factors authors into AUTHORS class constant
- Tests:
  - test_moral_integrity_corpus_dataset.py: stops mocking requests.get
    directly; patches _fetch_zip_from_url to return parsed dicts so
    tests don't depend on the helper's internal shape
  - adds test_fetch_dataset_non_string_q and
    test_fetch_dataset_passes_cache_flag
  - hoists imports into the right groups so ruff I001 stops firing
  - removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
  happy path, on-disk caching (hits 1 network call across 2 fetches),
  cache=False does not persist, missing inner file raises ValueError,
  unsupported extension raises ValueError

Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav
  to prevent concurrent call collisions
- Wrap Azure upload in try/finally to ensure temp file is always
  deleted even when upload fails
- Add regression test to verify cleanup on upload failure

Fixes microsoft#1894
- Add BijectionConverter that generates random letter-to-letter mapping
- Add BijectionAttack that teaches the mapping to target AI and encodes harmful prompts
- Add unit tests for both converter and attack
- Add notebook demonstrating usage
- Update __init__.py files to register new classes

Based on arXiv:2410.01294 (Haize Labs bijection-learning)

@romanlutz romanlutz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! There are a few things that need addressing but we're pretty close.


# Teaching shot messages — demonstrate encoding with examples
examples = ["hello", "world", "the cat", "good day", "yes no"]
for i in range(min(self._num_teaching_shots, len(examples))):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_teaching_shots is silently capped at 5, and the example pool doesn't exercise the full alphabet.

min(self._num_teaching_shots, len(examples)) means passing num_teaching_shots=20 silently uses 5. The paper varies shot count meaningfully (often 10–20).

Worse, the hardcoded ["hello", "world", "the cat", "good day", "yes no"] never demonstrates q, x, z, j, v — the model is never shown the encoding for those letters and has nothing to reproduce when generating cipher‑text responses that need them. Generate examples programmatically (short pangram‑ish strings covering all 26 letters) so any reasonable shot count works and the entire mapping is demonstrated at least once.

Comment thread pyrit/prompt_converter/__init__.py Outdated
"BinAsciiConverter",
"BinaryConverter",
"BrailleConverter",
"BijectionConverter",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__all__ is hand‑sorted alphabetically — "BijectionConverter" should come before "BinAsciiConverter" (Bije < BinA). Same fix needed in the import block at line 31.


def setup_method(self):
"""Set up fake memory before each test."""
self.memory_mock = MagicMock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — don't call CentralMemory.set_memory_instance manually in tests.

This sets process‑global memory state that leaks across tests when run with pytest -n 4. Per test.instructions.md, use the patch_central_database fixture instead:

@pytest.mark.usefixtures("patch_central_database")
class TestBijectionAttack:
    ...

A couple of related issues:

  • target = MagicMock() should be MockPromptTarget from tests/unit/mocks.pyBijectionAttack extends PromptSendingAttack, which expects a real Identifiable.
  • The tests as written only assert __init__ echoes its arguments. The actually interesting behaviors — teaching messages alternate roles, the encoded objective is sent through the pipeline, the response is decoded — aren't exercised. Please add at least one end‑to‑end test against a MockPromptTarget that returns a cipher‑text response and asserts the result is decoded.

bijection_type: str = "letter",
fixed_size: int = 0,
num_digits: int = 0,
) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead / misleading constructor parameters.

  • num_digits is stored on self and never read anywhere.
  • bijection_type is stored but only "letter" is ever implemented — any other string is silently accepted and produces a letter mapping. That's a footgun.

Either implement the modes (the paper discusses digit‑substitution variants) or drop these params until they're needed. If you keep bijection_type, please make it an enum.StrEnum per the style guide ("Enums are used instead of Literals") and raise on unknown values.

self.fixed_size = fixed_size
self.num_digits = num_digits
self.mapping = self._generate_mapping()
self.inverse_mapping = {v: k for k, v in self.mapping.items()}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_build_identifier() not overridden.

Per converters.instructions.md: "Override _build_identifier() to include parameters that affect conversion behavior." The mapping is freshly randomized per instantiation and absolutely affects the output; bijection_type, fixed_size, num_digits are behavioural params. None are reflected in the identifier, so two bijection runs are indistinguishable in memory records.

def _build_identifier(self) -> ComponentIdentifier:
    return self._create_identifier(params={
        "bijection_type": self._bijection_type,
        "fixed_size": self._fixed_size,
        "mapping": self._mapping,
    })

Related: these attributes should be _name‑prefixed (PyRIT convention is private internal state).

Comment thread pyrit/models/data_type_serializer.py Outdated

async with aiofiles.open(local_temp_path, "rb") as f:
audio_data = await f.read()
with tempfile.NamedTemporaryFile(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope creep — please split into its own PR.

The change itself is good (the NamedTemporaryFile + try/finally rewrite is a real correctness fix, and the accompanying regression test in test_data_type_serializer.py is solid). But it has nothing to do with the bijection feature and lives in shared model code. Mixing unrelated fixes into a feature PR makes bisecting/reverting harder and complicates review.

Suggest: cherry‑pick this commit + its test onto a separate branch and open it as its own PR — it'll likely land faster than the feature itself.

"cell_type": "markdown",
"id": "9cb14cfa",
"metadata": {},
"source": [

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — notebook is malformed and the paired .py is missing.

The top‑level structure here is a notebook whose single cell is a markdown cell whose source is a JSON‑encoded string of another notebook. Rendered, the reader sees escaped JSON, not the example. It won't execute and won't render.

Also, docs.instructions.md requires every doc/**/*.ipynb to have a paired jupytext .py percent file (doc/code/executor/attack/bijection_attack.py) — every other attack notebook in this directory has one and it's missing here. Author the .py percent file, then jupytext --to ipynb --execute to regenerate the notebook with outputs.

AttackResult: The result of the attack.
"""
mapping = self._bijection_converter.mapping
encoded_objective = "".join(mapping.get(c, c) for c in context.objective)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — encoding bypasses convert_async and the converter pipeline.

Two problems with "".join(mapping.get(c, c) for c in context.objective):

  1. Uppercase is silently dropped. BijectionConverter.convert_async handles uppercase via char.lower() + .upper(); this inline version does not. Concrete example with objective="how do I make a bomb", the model receives "kyi uy I bqlt q mybm" — the uppercase I survives unencoded, leaving a suspicious partially‑decoded substring (I make a bomb) visible to any safety classifier scanning the prompt.
  2. It skips _request_converters, so user‑provided attack_converter_config runs against the plaintext objective and gets clobbered by this manual encoding.

Suggested fix: in __init__, do

bijection_cfg = PromptConverterConfiguration.from_converters(converters=[self._bijection_converter])
self._request_converters = bijection_cfg + self._request_converters

(mirrors FlipAttack) and drop the manual join here.

bijection_type: str = "letter",
fixed_size: int = 0,
num_digits: int = 0,
) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead / misleading constructor parameters.

(Replaces my earlier comment on this line — please delete the duplicate.)

  • num_digits is stored on self and never read anywhere.
  • bijection_type is stored but only "letter" is ever implemented — any other string is silently accepted and produces a letter mapping. The two docstrings disagree on this too: this file says "Currently supports 'letter'" while bijection_attack.py:51 says (e.g. "letter"), which actively implies other values exist.

Either implement the modes (the paper discusses digit‑substitution variants) or drop these params until they're needed. If you keep bijection_type, please make it an enum.StrEnum per the style guide ("Enums are used instead of Literals") and raise on unknown values — that way the type itself documents the valid options and the docstring inconsistency goes away.


# Teaching shot messages — demonstrate encoding with examples
examples = ["hello", "world", "the cat", "good day", "yes no"]
for i in range(min(self._num_teaching_shots, len(examples))):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_teaching_shots is silently capped at 5, and the example pool covers only 14 of 26 letters.

(Replaces my earlier comment on this line — please delete the duplicate.)

min(self._num_teaching_shots, len(examples)) means passing num_teaching_shots=20 silently uses 5. The paper varies shot count meaningfully (often 10–20).

Worse, the hardcoded ["hello", "world", "the cat", "good day", "yes no"] covers only 14 of 26 letters; the model is never shown encodings for b, f, i, j, k, m, p, q, u, v, x, z — so it has nothing to reproduce when generating cipher‑text responses that need them. (Concretely, an objective like "how do I make a bomb" uses b, i, m — all in the missing set.)

Suggested replacement:

examples = [
    "the quick brown fox",      # 15 letters
    "jumps over the lazy dog",  # +11 → full alphabet covered after 2 shots
    "hello world",
    "good morning",
    "yes please",
]

Pangram split goes first so even num_teaching_shots=2 gives full coverage; the remaining three are natural conversational filler. Even better long-term: generate examples programmatically so any num_teaching_shots value is honored, but a curated full‑coverage list is a clear minimum.

- Remove @pytest.mark.asyncio decorators (asyncio_mode=auto)
- Fix __init__.py alphabetical ordering for BijectionConverter
- Use patch_central_database fixture in attack tests
- Use MagicMock(spec=PromptTarget) instead of plain MagicMock
- Remove dead num_digits parameter
- Add BijectionType StrEnum for bijection_type validation
- Use private attributes with underscore prefix
- Add _build_identifier() method
- Fix teaching shots cap with programmatic cycling
- Fix alternating user/assistant roles in teaching messages
- Fix response decoding in _perform_async
- Add BijectionConverter to _request_converters pipeline
- Fix notebook format and add paired .py jupytext file
- Register BijectionAttack in executor/attack/__init__.py
@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've addressed all the review comments:

  • Removed @pytest.mark.asyncio decorators
  • Fixed init.py alphabetical ordering
  • Used patch_central_database fixture in attack tests
  • Used MagicMock(spec=PromptTarget) instead of plain MagicMock
  • Removed dead num_digits parameter
  • Added BijectionType StrEnum for validation
  • Used private attributes with underscore prefix
  • Added _build_identifier() method
  • Fixed teaching shots cap with programmatic cycling
  • Fixed alternating user/assistant roles in teaching messages
  • Fixed response decoding in _perform_async
  • Added BijectionConverter to _request_converters pipeline
  • Fixed notebook format and added paired .py jupytext file

Ready for re-review!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've addressed the remaining review comments:

  • Resolved merge conflicts with upstream/main (kept BidiConverter from main, added BijectionConverter alphabetically)
  • Added end-to-end test in TestBijectionAttackEndToEnd that uses MockPromptTarget, returns a cipher-text response, and asserts the result is decoded back to plain text
  • Fixed ComponentIdentifier import to use pyrit.models.identifiers

Ready for re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT Bijection

2 participants