Skip to content

textfilters/profanity

@textfilters/profanity

Profanity filtering primitives for composable text moderation.

Installation

Add the GitHub Packages registry for the @textfilters scope:

@textfilters:registry=https://npm.pkg.github.com

Install with GitHub npm authentication configured. GitHub Packages requires authentication for npm installs, including public packages.

npm install @textfilters/core @textfilters/profanity

Usage

Quick Start

import { createProfanityFilter, filter } from "@textfilters/profanity";

const safeText = filter.censor("message text");
const hasProfanity = filter.check("message text");
const matches = filter.analyze("message text");

const tenantFilter = createProfanityFilter(["strict-term"], ["loose-term"]);
const tenantSafeText = tenantFilter.censor("message text");

The default shared instance is exported as filter and uses the built-in strict and loose term lists. It is mutable through setStrict, setLoose, addStrict, and addLoose, so changes affect later calls that use the same shared instance.

Use createProfanityFilter(...) when per-request, per-tenant, or test-local dictionaries must be isolated from the shared mutable filter.

API

filter.analyze(text, options?): ProfanityMatchRange[]

Returns accepted match ranges as UTF-16 offsets into the original input. Each range is an array-like [start, end] value with mode and optional rule metadata:

const matches = filter.analyze("blocked text");

for (const match of matches) {
  console.log(match[0], match[1], match.mode);
  console.log(match.ruleId, match.category, match.severity);
}

ruleId, category, and severity are present when the matched rule has taxonomy metadata. Built-in Russian dictionary rules include semantic rule ids and taxonomy metadata. Runtime string terms remain unclassified and omit those fields unless callers provide structured runtime rules with metadata.

Taxonomy options can narrow matches to rules with specific metadata:

const vulgarMatches = filter.analyze("blocked text", {
  categories: ["VULGAR"],
});

const highSeverityMatches = filter.analyze("blocked text", {
  severities: ["high"],
});

const mediumOrHigherMatches = filter.analyze("blocked text", {
  minSeverity: "medium",
});

const hasHighSeverityMatch = filter.check("blocked text", {
  severities: ["high"],
});

const censoredVulgarText = filter.censor("blocked text", {
  categories: ["VULGAR"],
  minSeverity: "low",
});

Severity thresholds use this package-defined order: soft < low < medium < high. minSeverity matches rules whose severity is equal to or stronger than the requested threshold, and applies only to taxonomy-metadata-backed rules. When both severities and minSeverity are provided, a match must satisfy the exact severity set and the threshold intersection. When categories is combined with severity filters, a match must satisfy every requested taxonomy filter.

Taxonomy metadata-backed filters only match rules where the requested metadata is available. Omitting taxonomy options preserves the default matching behavior.

The taxonomy filtering contract is:

  • categories, severities, and minSeverity are exposed on ProfanityMatchOptions.
  • Calls without taxonomy options keep the same default analyze(), check(), and censor() behavior.
  • Taxonomy filters exclude metadata-less string-backed matches.
  • categories combined with severities is an intersection.
  • categories combined with minSeverity is an intersection.
  • severities combined with minSeverity is the intersection between the exact severity set and the threshold.
  • The severity order is soft < low < medium < high.

For taxonomy-backed rules, runtime match output includes the available metadata:

const strict = createProfanityFilter(
  [{ source: "абв", category: "STRONG_INSULT", severity: "medium" }],
  [],
);

strict.analyze("абв ok");
// [Object.assign([0, 3], {
//   mode: "strict",
//   category: "STRONG_INSULT",
//   severity: "medium",
// })]

filter.censor(text, options?): string

Returns a censored copy of text. Matching is performed on a normalized same-length copy of the input, and mask ranges are applied back to the original UTF-16 string. Taxonomy options censor only matching metadata-backed ranges.

filter.check(text, options?): boolean

Returns true when the current filter instance would censor at least one range. Use this when a boolean moderation decision is enough and the masked text is not needed. Taxonomy options apply the same match narrowing as analyze().

createProfanityFilter(strict?, loose?): ProfanityFilter

Creates a new mutable filter instance. Without arguments it uses compiled views of the built-in Russian dictionary. Passing arrays replaces that side with runtime dictionary terms:

const strictOnly = createProfanityFilter(["blocked"], []);
const looseOnly = createProfanityFilter([], ["banned"]);
const builtIn = createProfanityFilter();

All filter instances expose stable name: "profanity" plus check, censor, analyze, setStrict, setLoose, addStrict, and addLoose.

Language Dictionaries

The package exports a minimal language dictionary API for callers that need an isolated filter built from a maintained language dictionary:

import {
  createProfanityFilterFromDictionary,
  russianProfanityDictionary,
  validateProfanityLanguageDictionary,
  type ProfanityLanguageDictionary,
} from "@textfilters/profanity";

const dictionary: ProfanityLanguageDictionary = russianProfanityDictionary;
const issues = validateProfanityLanguageDictionary(dictionary);
const russianFilter = createProfanityFilterFromDictionary(dictionary);

if (issues.length > 0) {
  throw new Error(JSON.stringify(issues, null, 2));
}

russianFilter.analyze("message text");

createProfanityFilterFromDictionary(dictionary) compiles strict and loose views from the dictionary and returns a mutable ProfanityFilter instance. The instance is isolated from the shared filter export, so later calls to setStrict, setLoose, addStrict, or addLoose affect only that instance.

Dictionary-backed matches preserve semantic rule ids, categories, and severities in analyze() output, and taxonomy filters apply to those metadata fields. Runtime dictionary terms remain normalized literals; language dictionaries are the supported boundary for maintained language-specific rule data. This release intentionally keeps the public surface small and does not add new languages or separate packages.

validateProfanityLanguageDictionary(dictionary) checks the source dictionary contract and returns stable issues with path, code, and message fields. Valid dictionaries return []; ordinary validation errors are reported as issues instead of thrown exceptions. The validator does not judge moderation quality, false-positive behavior, language coverage, taxonomy choices, or whether a rule should exist.

The package also includes a small CLI for validating a JSON source dictionary:

profanity-validate-language-dictionary path/to/profanity.json

The command exits 0 for valid dictionaries, 1 when validation issues are found, and 2 for usage, file read, or JSON parse errors. Validation issue output includes the same stable path, code, and message fields as the programmatic validator.

Text output is the default:

Dictionary validation failed:
- rules[0].source source_not_trimmed: Rule source must not include leading or trailing whitespace.

Machine-readable JSON output is available for CI and authoring tools:

profanity-validate-language-dictionary --format json --pretty path/to/profanity.json

The JSON report always includes ok, file, issueCount, issues, and summary. Validation failures exit 1 and print the report to stdout with stable issue objects:

{
  "ok": false,
  "file": "path/to/profanity.json",
  "issueCount": 1,
  "issues": [
    {
      "path": "rules[0].source",
      "code": "source_not_trimmed",
      "message": "Rule source must not include leading or trailing whitespace."
    }
  ],
  "summary": {
    "status": "invalid",
    "message": "Dictionary validation failed with 1 issue."
  }
}

For future external language pack guidance, see the language pack authoring guide. It covers source dictionary shape, stable ids, taxonomy metadata, strict and loose views, human-maintained JSON, and conformance expectations. The external language pack policy defines when the project is ready to create a real external package and keeps the built-in Russian dictionary in this package for now.

Taxonomy Metadata Types

The package also exports type-only taxonomy metadata names for callers that need to type local metadata alongside profanity filtering code:

import type {
  ProfanityCategory,
  ProfanityMatchRange,
  ProfanitySeverity,
  ProfanityTaxonomyMetadata,
} from "@textfilters/profanity";

const ranges: ProfanityMatchRange[] = filter.analyze("message text");
const category: ProfanityCategory = "VULGAR";
const severity: ProfanitySeverity = "high";

const metadata: ProfanityTaxonomyMetadata = {
  category,
  severity,
};

filter.analyze() exposes taxonomy metadata on match ranges when the matched rule carries it. Taxonomy options are optional, so check() results, censor() output, and mutable dictionary methods keep their existing behavior when those options are omitted.

Strict Vs Loose

Mode Runtime term example Matches Does not match
Strict bad bad as a full normalized token badminton, _bad, -bad
Loose bad bad, b-a-d, b a d prefixes inside words

Strict matching is token-oriented. Loose matching allows separators between letters, then still applies token-boundary checks before masking.

Runtime Dictionary Terms

Runtime dictionary terms are normalized literals, not regular expressions. A term such as foo|bar matches the literal text foo|bar, not foo or bar. Escaped punctuation from older literal spellings is accepted, so foo\\.bar matches the literal text foo.bar.

The built-in Russian dictionary is different: package-owned data may use controlled internal rules to represent existing behavior compactly. The JSON dictionary is the human-maintained source of truth; strict and loose entries are compiled matcher views, not serialized matcher output. That internal rule syntax is not part of the public API and is not applied to runtime dictionaries.

Built-in internal rules can also carry compact, meaningful compiler metadata, such as loose stretch matching for repeated word-like atoms. Language-specific roots, aliases, guards, morphology, taxonomy, loose behavior, and false-positive protections belong in the Russian dictionary profile; generated rule ids and matcher ordering are owned by the generic compilation layer.

Generated built-in rule ids are diagnostic metadata, not stable policy or allowlist keys. They may change when the package-owned corpus is reorganized into different compiled matcher views.

Known Limitations And Behavior Notes

  • Censored output preserves JavaScript string length, including astral code points.
  • Ranges are UTF-16 offsets into the original source string.
  • Runtime dictionaries do not support caller-provided regular expressions.
  • Runtime string terms do not receive taxonomy metadata.
  • The shared filter instance is mutable; use createProfanityFilter() for isolated state.
  • Built-in corpus behavior is intentionally locked by compatibility tests.

Compatibility And Intentional Changes

This package keeps the built-in corpus behavior covered by compatibility tests.

Intentional public-package changes:

  • Runtime dictionary terms are treated as normalized literals, not arbitrary regular expressions.
  • Built-in package-owned rules use an internal rule compiler that is not exposed to callers.
  • The filter exposes stable name: "profanity".
  • The filter exposes analyze(text): ProfanityMatchRange[] for accepted match ranges and optional taxonomy metadata.
  • The filter exposes check(text): boolean for boolean-only detection.
  • createProfanityFilter() without arguments creates an instance with compiled views of the built-in Russian dictionary.
  • Masking preserves JavaScript string length for astral code points.

Architecture

See the architecture guide for the matching pipeline, Mermaid diagrams, and the rationale behind the strict separation between runtime literals and internal corpus rules.

See the invariants guide for a short maintenance checklist covering normalization, source ranges, boundaries, loose matching, false-positive locks, and hyphen-tail behavior.

Release

Releases are managed by Release Please from Conventional Commit history on main. When a Release Please release is created, the workflow runs npm run check and publishes the package to GitHub Packages. Release tags keep the v* pattern.

The package is prepared for publication to GitHub Packages, not the public npm registry.

Contributing

See CONTRIBUTING.md for pull request scope guidance.

License

MIT

About

Profanity filtering primitives for composable text moderation.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors