From 23ba94215bb69fac1cb783dbadc015e2255dbd24 Mon Sep 17 00:00:00 2001 From: Dachary Carey Date: Thu, 2 Jul 2026 10:22:04 -0400 Subject: [PATCH 1/2] feat(generate): add generate llms command for per-project llms.txt Add a `generate llms` command that produces an llms.txt file for each documentation project, supporting a progressive-disclosure setup where a master llms.txt links to each project's own llms.txt. For each project's current version (and non-versioned projects), it enumerates pages, extracts the page title and meta :description:, resolves the production URL (with .md appended), and writes content//llms.txt. After writing, it prints a per-project character-count summary (with and without descriptions) flagging files over the 50k llms.txt guideline. Details: - Root landing pages use the /index.md markdown form (no .md). - Snooty {+name+} substitutions in titles and descriptions are resolved from the project's snooty.toml [constants]. - Pages without a description omit the trailing ": description". - includes/ and code-examples/ dirs and the deprecated app-services and realm projects are excluded. - --no-descriptions flag omits descriptions from written files. Add internal/rst meta-description and page-title parsers, snooty constants parsing + ResolveSubstitutions, tests, and README documentation. --- README.md | 58 +++++ commands/generate/generate.go | 30 +++ commands/generate/llms/generator.go | 279 +++++++++++++++++++++++ commands/generate/llms/generator_test.go | 72 ++++++ commands/generate/llms/llms.go | 133 +++++++++++ internal/rst/meta_parser.go | 94 ++++++++ internal/rst/meta_parser_test.go | 91 ++++++++ internal/rst/page_title.go | 61 +++++ internal/rst/page_title_test.go | 72 ++++++ internal/snooty/snooty.go | 38 ++- internal/snooty/snooty_test.go | 61 +++++ main.go | 2 + 12 files changed, 988 insertions(+), 3 deletions(-) create mode 100644 commands/generate/generate.go create mode 100644 commands/generate/llms/generator.go create mode 100644 commands/generate/llms/generator_test.go create mode 100644 commands/generate/llms/llms.go create mode 100644 internal/rst/meta_parser.go create mode 100644 internal/rst/meta_parser_test.go create mode 100644 internal/rst/page_title.go create mode 100644 internal/rst/page_title_test.go diff --git a/README.md b/README.md index 5513f37..ce61b82 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,7 @@ A Go CLI tool for performing audit-related tasks in the MongoDB documentation mo - [Count Commands](#count-commands) - [Report Commands](#report-commands) - [Resolve Commands](#resolve-commands) + - [Generate Commands](#generate-commands) - [Development](#development) - [Project Structure](#project-structure) - [Adding New Commands](#adding-new-commands) @@ -1832,6 +1833,63 @@ The command supports all projects defined in the documentation monorepo's table- - Connectors (Kafka, Spark, BI Connector) - And many more +### Generate Commands + +#### `generate llms` + +Generate a per-project `llms.txt` file for every documentation project. + +This command supports an [`llms.txt`](https://llmstxt.org/) progressive-disclosure setup: a master `llms.txt` acts like a sitemap that links to each project's own `llms.txt`, which in turn lists that project's pages. For each project, this command enumerates the pages of its **current version** (and non-versioned projects), extracts each page's title and `meta` description, resolves its production URL (with `.md` appended), and writes the project's `llms.txt`. + +Each line follows the standard format: + +``` +- [Page Title](https://www.mongodb.com/docs/manual/core/document.md): Definition, structure, and limitations of documents in MongoDB. +``` + +**Behavior details:** + +- **Version scope:** Only the current version of each project is included, plus projects that are not versioned. Older versions and `upcoming` are skipped. +- **Root landing pages:** A project's root landing page has no `.md` markdown form; its markdown lives at `/index.md`, so that form is emitted. Nested section index pages resolve to the normal `
.md` form. +- **Missing descriptions:** Pages without a `meta` `:description:` are emitted without the trailing `: description`. +- **Substitutions:** Snooty constant references (`{+name+}`) in titles and descriptions are resolved from the project's `snooty.toml` `[constants]`. +- **Excluded content:** Partial (`includes/`) and `code-examples/` directories are skipped since they are not standalone pages. The deprecated `app-services` and `realm` projects, along with non-project directories (`404`, `docs-platform`, `meta`, `table-of-contents`), are excluded. +- **Character-count summary:** After writing the files, the command prints a per-project table showing the character count both **with** and **without** descriptions, flagging any file that exceeds the 50,000-character `llms.txt` guideline. This helps decide whether descriptions fit for larger projects. + +**Basic Usage:** + +```bash +# Generate llms.txt for all projects (uses the configured monorepo path) +./audit-cli generate llms +# Writes files to ./llms-output//llms.txt and prints a summary + +# Generate for a single project +./audit-cli generate llms --for-project atlas + +# Omit descriptions (useful for oversized projects or while iterating on docs) +./audit-cli generate llms --for-project cloud-docs --no-descriptions + +# Point at a specific monorepo and output directory +./audit-cli generate llms /path/to/docs-mongodb-internal --output-dir build/llms +``` + +**Flags:** + +- `--output-dir ` - Directory to write per-project `llms.txt` files into (default: `llms-output`) +- `--for-project ` - Limit generation to a single project (content directory name) +- `--no-descriptions` - Omit `meta` descriptions from the written files +- `--base-url ` - Base URL for production documentation (default: `https://www.mongodb.com/docs`) + +**Output layout:** + +``` +llms-output/ + atlas/llms.txt + manual/llms.txt + node/llms.txt + ... +``` + ## Development ### Project Structure diff --git a/commands/generate/generate.go b/commands/generate/generate.go new file mode 100644 index 0000000..455b520 --- /dev/null +++ b/commands/generate/generate.go @@ -0,0 +1,30 @@ +// Package generate provides the parent command for generating documentation artifacts. +// +// This package serves as the parent command for generation operations. +// Currently supports: +// - llms: Generate per-project llms.txt files +package generate + +import ( + "github.com/grove-platform/audit-cli/commands/generate/llms" + "github.com/spf13/cobra" +) + +// NewGenerateCommand creates the generate parent command. +// +// This command serves as a parent for various generation operations. +// It doesn't perform any operations itself but provides a namespace for subcommands. +func NewGenerateCommand() *cobra.Command { + cmd := &cobra.Command{ + Use: "generate", + Short: "Generate documentation artifacts", + Long: `Generate artifacts derived from the documentation monorepo. + +Currently supports: + - llms: Generate per-project llms.txt files for progressive disclosure`, + } + + cmd.AddCommand(llms.NewLLMSCommand()) + + return cmd +} diff --git a/commands/generate/llms/generator.go b/commands/generate/llms/generator.go new file mode 100644 index 0000000..166dfd8 --- /dev/null +++ b/commands/generate/llms/generator.go @@ -0,0 +1,279 @@ +// Package llms provides generation of per-project llms.txt files. +package llms + +import ( + "fmt" + "os" + "path/filepath" + "sort" + "strings" + "unicode/utf8" + + resolveurl "github.com/grove-platform/audit-cli/commands/resolve/url" + "github.com/grove-platform/audit-cli/internal/projectinfo" + "github.com/grove-platform/audit-cli/internal/rst" + "github.com/grove-platform/audit-cli/internal/snooty" +) + +// CharLimit is the maximum recommended size (in characters) for an llms.txt file. +const CharLimit = 50000 + +// defaultExclusions are content-directory children that are not real docs +// projects and should never produce an llms.txt file. +var defaultExclusions = map[string]bool{ + "404": true, + "docs-platform": true, + "meta": true, + "table-of-contents": true, + "code-examples": true, + // Deprecated projects: no useful content for agents. + "app-services": true, + "realm": true, +} + +// PageEntry holds the data needed to render one llms.txt line. +type PageEntry struct { + Title string + URL string // production URL with .md appended + Description string // meta description ("" if none) + SourcePath string +} + +// ProjectResult is the outcome of generating one project's llms.txt. +type ProjectResult struct { + Project string + Version string // "" for non-versioned projects + Pages []PageEntry + OutputPath string + CharsWith int // character count including descriptions + CharsNoDesc int // character count omitting descriptions + MissingDesc int // number of pages lacking a meta description +} + +// Options configures a generation run. +type Options struct { + MonorepoPath string + BaseURL string + OutputDir string + ForProject string // limit to a single content-dir name; "" for all + NoDescriptions bool // omit descriptions from the written files +} + +// Generate builds llms.txt files for the current + non-versioned pages of each +// documentation project and writes them under opts.OutputDir. It returns one +// ProjectResult per project processed. +func Generate(opts Options) ([]*ProjectResult, error) { + contentDir := filepath.Join(opts.MonorepoPath, "content") + if _, err := os.Stat(contentDir); err != nil { + return nil, fmt.Errorf("content directory not found: %s", contentDir) + } + + entries, err := os.ReadDir(contentDir) + if err != nil { + return nil, fmt.Errorf("failed to read content directory: %w", err) + } + + var results []*ProjectResult + for _, entry := range entries { + if !entry.IsDir() { + continue + } + project := entry.Name() + if defaultExclusions[project] { + continue + } + if opts.ForProject != "" && project != opts.ForProject { + continue + } + + projectDir := filepath.Join(contentDir, project) + sourceDir, version, err := currentSourceDir(projectDir) + if err != nil { + return nil, fmt.Errorf("project %s: %w", project, err) + } + if sourceDir == "" { + // No resolvable current source directory; skip. + continue + } + + result, err := generateProject(project, version, sourceDir, opts) + if err != nil { + return nil, fmt.Errorf("project %s: %w", project, err) + } + if result != nil { + results = append(results, result) + } + } + + sort.Slice(results, func(i, j int) bool { + return results[i].Project < results[j].Project + }) + return results, nil +} + +// currentSourceDir returns the source directory to use for a project along with +// its version label. Non-versioned projects (content//source) return +// an empty version. Versioned projects return the current version's source dir. +func currentSourceDir(projectDir string) (sourceDir string, version string, err error) { + // Non-versioned project. + directSource := filepath.Join(projectDir, "source") + if info, statErr := os.Stat(directSource); statErr == nil && info.IsDir() { + return directSource, "", nil + } + + // Versioned project: pick the current version. + versions, err := projectinfo.DiscoverAllVersions(projectDir) + if err != nil || len(versions) == 0 { + return "", "", nil + } + for _, v := range versions { + if projectinfo.IsCurrentVersion(v) { + candidate := filepath.Join(projectDir, v, "source") + if info, statErr := os.Stat(candidate); statErr == nil && info.IsDir() { + return candidate, v, nil + } + } + } + return "", "", nil +} + +// generateProject collects pages for a single project and writes its llms.txt. +func generateProject(project, version, sourceDir string, opts Options) (*ProjectResult, error) { + var pages []PageEntry + + // Load the project's snooty constants once so {+name+} substitutions in + // page titles can be resolved. Absence of constants is not fatal. + constants := loadConstants(sourceDir) + + err := filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error { + if err != nil { + return err + } + if info.IsDir() { + // Skip partial/include and code-example directories: these are not + // standalone pages and don't have their own production URLs. + name := info.Name() + if name == "includes" || name == "code-examples" { + return filepath.SkipDir + } + return nil + } + if filepath.Ext(path) != ".txt" { + return nil + } + + url, err := resolveurl.ResolveFileToURL(path, opts.BaseURL) + if err != nil { + // A page we can't map to a URL isn't useful in llms.txt; skip it. + return nil + } + + title, err := rst.ExtractPageTitle(path) + if err != nil { + return err + } + if title == "" { + // Without a title there's nothing meaningful to link; skip. + return nil + } + title = snooty.ResolveSubstitutions(title, constants) + + description, err := rst.ExtractMetaDescription(path) + if err != nil { + return err + } + description = snooty.ResolveSubstitutions(description, constants) + + // The project's root landing page has no ".md" markdown form; its + // markdown lives at "/index.md" instead. Nested section index + // pages already resolve to a normal "
.md" URL. + isRootIndex := path == filepath.Join(sourceDir, "index.txt") + + pages = append(pages, PageEntry{ + Title: title, + URL: toMarkdownURL(url, isRootIndex), + Description: description, + SourcePath: path, + }) + return nil + }) + if err != nil { + return nil, err + } + + if len(pages) == 0 { + return nil, nil + } + + sort.Slice(pages, func(i, j int) bool { + return pages[i].URL < pages[j].URL + }) + + result := &ProjectResult{ + Project: project, + Version: version, + Pages: pages, + CharsWith: utf8.RuneCountInString(renderContent(project, pages, true)), + CharsNoDesc: utf8.RuneCountInString(renderContent(project, pages, false)), + } + for _, p := range pages { + if p.Description == "" { + result.MissingDesc++ + } + } + + // Write the file. + content := renderContent(project, pages, !opts.NoDescriptions) + outPath := filepath.Join(opts.OutputDir, project, "llms.txt") + if err := os.MkdirAll(filepath.Dir(outPath), 0o755); err != nil { + return nil, err + } + if err := os.WriteFile(outPath, []byte(content), 0o644); err != nil { + return nil, err + } + result.OutputPath = outPath + + return result, nil +} + +// loadConstants reads the substitution constants from the project's snooty.toml, +// which sits in the directory containing the source directory. Returns nil if +// the file is missing or cannot be parsed. +func loadConstants(sourceDir string) map[string]string { + snootyPath := filepath.Join(filepath.Dir(sourceDir), "snooty.toml") + if _, err := os.Stat(snootyPath); err != nil { + return nil + } + config, err := snooty.ParseFile(snootyPath) + if err != nil { + return nil + } + return config.Constants +} + +// renderContent builds the llms.txt body. When withDesc is true, descriptions +// are appended as ": " when present. +func renderContent(project string, pages []PageEntry, withDesc bool) string { + var b strings.Builder + b.WriteString(fmt.Sprintf("# %s\n\n", project)) + for _, p := range pages { + if withDesc && p.Description != "" { + b.WriteString(fmt.Sprintf("- [%s](%s): %s\n", p.Title, p.URL, p.Description)) + } else { + b.WriteString(fmt.Sprintf("- [%s](%s)\n", p.Title, p.URL)) + } + } + return b.String() +} + +// toMarkdownURL converts a production page URL to its Markdown (.md) form. +// +// Regular pages and nested section indexes use the ".md" form (the URL +// with its trailing slash replaced by ".md"). The project's root landing page +// has no ".md" form and instead uses "/index.md". +func toMarkdownURL(url string, isRootIndex bool) string { + if isRootIndex { + return strings.TrimSuffix(url, "/") + "/index.md" + } + return strings.TrimSuffix(url, "/") + ".md" +} diff --git a/commands/generate/llms/generator_test.go b/commands/generate/llms/generator_test.go new file mode 100644 index 0000000..8294531 --- /dev/null +++ b/commands/generate/llms/generator_test.go @@ -0,0 +1,72 @@ +package llms + +import ( + "strings" + "testing" +) + +func TestToMarkdownURL(t *testing.T) { + tests := []struct { + name string + url string + isRootIndex bool + want string + }{ + { + name: "regular page", + url: "https://www.mongodb.com/docs/manual/core/document/", + want: "https://www.mongodb.com/docs/manual/core/document.md", + }, + { + name: "nested section index", + url: "https://www.mongodb.com/docs/manual/crud/", + want: "https://www.mongodb.com/docs/manual/crud.md", + }, + { + name: "root landing page uses index.md", + url: "https://www.mongodb.com/docs/manual/", + isRootIndex: true, + want: "https://www.mongodb.com/docs/manual/index.md", + }, + { + name: "versioned root landing page", + url: "https://www.mongodb.com/docs/atlas/cli/current/", + isRootIndex: true, + want: "https://www.mongodb.com/docs/atlas/cli/current/index.md", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := toMarkdownURL(tt.url, tt.isRootIndex) + if got != tt.want { + t.Errorf("toMarkdownURL(%q, %v) = %q, want %q", tt.url, tt.isRootIndex, got, tt.want) + } + }) + } +} + +func TestRenderContent(t *testing.T) { + pages := []PageEntry{ + {Title: "Documents", URL: "https://ex.com/a.md", Description: "About documents."}, + {Title: "No Desc", URL: "https://ex.com/b.md", Description: ""}, + } + + withDesc := renderContent("manual", pages, true) + if !strings.Contains(withDesc, "- [Documents](https://ex.com/a.md): About documents.") { + t.Errorf("expected description line, got:\n%s", withDesc) + } + // A page without a description must not emit a trailing ": ". + if !strings.Contains(withDesc, "- [No Desc](https://ex.com/b.md)\n") || + strings.Contains(withDesc, "- [No Desc](https://ex.com/b.md):") { + t.Errorf("page without description should have no trailing colon, got:\n%s", withDesc) + } + + noDesc := renderContent("manual", pages, false) + if strings.Contains(noDesc, "About documents.") { + t.Errorf("descriptions should be omitted, got:\n%s", noDesc) + } + if !strings.HasPrefix(noDesc, "# manual\n\n") { + t.Errorf("expected project header, got:\n%s", noDesc) + } +} diff --git a/commands/generate/llms/llms.go b/commands/generate/llms/llms.go new file mode 100644 index 0000000..ae984f2 --- /dev/null +++ b/commands/generate/llms/llms.go @@ -0,0 +1,133 @@ +package llms + +import ( + "fmt" + "os" + "text/tabwriter" + + "github.com/grove-platform/audit-cli/internal/config" + "github.com/spf13/cobra" +) + +// NewLLMSCommand creates the "generate llms" subcommand. +// +// Usage: +// +// generate llms [monorepo-path] [flags] +// +// It generates one llms.txt per documentation project (current + non-versioned +// pages), then prints a summary of each file's character count both with and +// without meta descriptions, flagging any that exceed the 50k character limit. +func NewLLMSCommand() *cobra.Command { + var baseURL string + var outputDir string + var forProject string + var noDescriptions bool + + cmd := &cobra.Command{ + Use: "llms [monorepo-path]", + Short: "Generate per-project llms.txt files", + Long: `Generate an llms.txt file for each documentation project. + +For every project under the monorepo's content/ directory, this command +enumerates the pages of its current version (and non-versioned projects), +extracts each page's title and meta description, resolves its production URL +(with .md appended), and writes a project llms.txt in the format: + + - [Page Title](https://www.mongodb.com/docs/manual/core/document.md): Description. + +After generating the files it prints a summary showing each project's character +count both WITH and WITHOUT descriptions, flagging any file over 50,000 +characters so you can decide whether descriptions fit for larger projects. + +Pages without a meta description are emitted without the trailing ": description". + +Examples: + # Generate for all projects using the configured monorepo path + generate llms + + # Generate for a single project + generate llms --for-project atlas + + # Omit descriptions (useful for oversized projects) + generate llms --for-project cloud-docs --no-descriptions`, + Args: cobra.MaximumNArgs(1), + RunE: func(cmd *cobra.Command, args []string) error { + cmdLineArg := "" + if len(args) == 1 { + cmdLineArg = args[0] + } + monorepoPath, err := config.GetMonorepoPath(cmdLineArg) + if err != nil { + return err + } + + results, err := Generate(Options{ + MonorepoPath: monorepoPath, + BaseURL: baseURL, + OutputDir: outputDir, + ForProject: forProject, + NoDescriptions: noDescriptions, + }) + if err != nil { + return err + } + return printSummary(results, outputDir, noDescriptions) + }, + } + + cmd.Flags().StringVar(&baseURL, "base-url", "https://www.mongodb.com/docs", "Base URL for production documentation") + cmd.Flags().StringVar(&outputDir, "output-dir", "llms-output", "Directory to write per-project llms.txt files into") + cmd.Flags().StringVar(&forProject, "for-project", "", "Limit generation to a single project (content directory name)") + cmd.Flags().BoolVar(&noDescriptions, "no-descriptions", false, "Omit meta descriptions from the written files") + + return cmd +} + +// printSummary writes the per-project character-count report to stdout. +func printSummary(results []*ProjectResult, outputDir string, noDescriptions bool) error { + if len(results) == 0 { + fmt.Println("No projects generated (no matching content found).") + return nil + } + + fmt.Printf("Generated %d llms.txt file(s) in %s/\n\n", len(results), outputDir) + + w := tabwriter.NewWriter(os.Stdout, 0, 4, 2, ' ', 0) + fmt.Fprintln(w, "PROJECT\tVERSION\tPAGES\tNO_DESC\tCHARS(w/ desc)\tCHARS(no desc)\tOVER 50k?") + + var over []string + for _, r := range results { + version := r.Version + if version == "" { + version = "-" + } + flag := "" + // Which count applies to the file we actually wrote? + written := r.CharsWith + if noDescriptions { + written = r.CharsNoDesc + } + if written > CharLimit { + flag = "YES" + over = append(over, r.Project) + } + fmt.Fprintf(w, "%s\t%s\t%d\t%d\t%d\t%d\t%s\n", + r.Project, version, len(r.Pages), r.MissingDesc, r.CharsWith, r.CharsNoDesc, flag) + } + if err := w.Flush(); err != nil { + return err + } + + if len(over) > 0 { + fmt.Printf("\n%d project(s) exceed the %d-character limit for the written files: %v\n", + len(over), CharLimit, over) + if !noDescriptions { + fmt.Println("Consider re-running these with --no-descriptions, or compare the CHARS(no desc) column above.") + } + } else { + fmt.Printf("\nAll files are within the %d-character limit.\n", CharLimit) + } + + return nil +} diff --git a/internal/rst/meta_parser.go b/internal/rst/meta_parser.go new file mode 100644 index 0000000..548c911 --- /dev/null +++ b/internal/rst/meta_parser.go @@ -0,0 +1,94 @@ +// Package rst provides utilities for parsing reStructuredText documentation files. +package rst + +import ( + "bufio" + "os" + "strings" +) + +// ExtractMetaDescription reads the value of the :description: field from the +// first ".. meta::" directive in an RST file. +// +// The meta directive looks like: +// +// .. meta:: +// :robots: noindex, nosnippet +// :description: A short summary of the page. +// +// The description value may wrap across multiple indented continuation lines, +// which are joined with single spaces. +// +// Parameters: +// - filePath: Path to the source .txt file +// +// Returns: +// - string: The description text, or an empty string if there is no meta +// directive or no :description: field +// - error: Error if the file cannot be read +func ExtractMetaDescription(filePath string) (string, error) { + file, err := os.Open(filePath) + if err != nil { + return "", err + } + defer file.Close() + + scanner := bufio.NewScanner(file) + // Allow long lines (default token size can be too small for long descriptions). + scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024) + + inMeta := false + collecting := false + var parts []string + + for scanner.Scan() { + line := scanner.Text() + trimmed := strings.TrimSpace(line) + + if !inMeta { + if strings.HasPrefix(trimmed, ".. meta::") { + inMeta = true + } + continue + } + + // Inside the meta directive. + indented := line != "" && (line[0] == ' ' || line[0] == '\t') + + // A blank line does not end the directive on its own, but it does end a + // multi-line description value. + if trimmed == "" { + if collecting { + break + } + continue + } + + // A non-indented, non-blank line ends the meta directive block. + if !indented { + break + } + + if collecting { + // Continuation lines are indented more deeply and are not new options. + if strings.HasPrefix(trimmed, ":") { + break + } + parts = append(parts, trimmed) + continue + } + + // Look for the :description: option. + if strings.HasPrefix(trimmed, ":description:") { + value := strings.TrimSpace(strings.TrimPrefix(trimmed, ":description:")) + parts = append(parts, value) + collecting = true + } + } + + if err := scanner.Err(); err != nil { + return "", err + } + + return strings.TrimSpace(strings.Join(parts, " ")), nil +} diff --git a/internal/rst/meta_parser_test.go b/internal/rst/meta_parser_test.go new file mode 100644 index 0000000..8c983d5 --- /dev/null +++ b/internal/rst/meta_parser_test.go @@ -0,0 +1,91 @@ +package rst + +import ( + "os" + "path/filepath" + "testing" +) + +func writeTempFile(t *testing.T, content string) string { + t.Helper() + dir := t.TempDir() + path := filepath.Join(dir, "page.txt") + if err := os.WriteFile(path, []byte(content), 0o644); err != nil { + t.Fatalf("failed to write temp file: %v", err) + } + return path +} + +func TestExtractMetaDescription(t *testing.T) { + tests := []struct { + name string + content string + want string + }{ + { + name: "description present with other options", + content: `.. meta:: + :robots: noindex, nosnippet + :description: Definition and structure of documents. + +==== +Docs +==== +`, + want: "Definition and structure of documents.", + }, + { + name: "description only", + content: `.. meta:: + :description: A short summary. + +Title +===== +`, + want: "A short summary.", + }, + { + name: "multi-line description is joined", + content: `.. meta:: + :description: This description wraps + across multiple lines. + +Title +===== +`, + want: "This description wraps across multiple lines.", + }, + { + name: "no meta directive", + content: `Title +===== + +Some content. +`, + want: "", + }, + { + name: "meta without description", + content: `.. meta:: + :robots: noindex + +Title +===== +`, + want: "", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + path := writeTempFile(t, tt.content) + got, err := ExtractMetaDescription(path) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if got != tt.want { + t.Errorf("ExtractMetaDescription() = %q, want %q", got, tt.want) + } + }) + } +} diff --git a/internal/rst/page_title.go b/internal/rst/page_title.go new file mode 100644 index 0000000..a1a1bef --- /dev/null +++ b/internal/rst/page_title.go @@ -0,0 +1,61 @@ +package rst + +import ( + "os" + "strings" +) + +// ExtractPageTitle returns the page's H1 title from an RST file. +// +// It finds the first section heading, supporting both underline-only headings: +// +// Page Title +// ========== +// +// and overline+underline headings: +// +// ========== +// Page Title +// ========== +// +// Directive lines (starting with "..") and RST field/option lines (starting +// with ":") are not considered valid titles. +// +// Parameters: +// - filePath: Path to the source .txt file +// +// Returns: +// - string: The title text, or an empty string if no heading is found +// - error: Error if the file cannot be read +func ExtractPageTitle(filePath string) (string, error) { + content, err := os.ReadFile(filePath) + if err != nil { + return "", err + } + + lines := strings.Split(string(content), "\n") + for i, line := range lines { + if !isHeadingUnderline(strings.TrimSpace(line)) { + continue + } + + // The title is the immediately preceding non-empty text line. + if i == 0 { + continue + } + candidate := strings.TrimSpace(lines[i-1]) + if candidate == "" { + continue + } + // Skip directives, field lists, and overline rows. + if strings.HasPrefix(candidate, "..") || strings.HasPrefix(candidate, ":") { + continue + } + if isHeadingUnderline(candidate) { + continue + } + return candidate, nil + } + + return "", nil +} diff --git a/internal/rst/page_title_test.go b/internal/rst/page_title_test.go new file mode 100644 index 0000000..ab04d46 --- /dev/null +++ b/internal/rst/page_title_test.go @@ -0,0 +1,72 @@ +package rst + +import "testing" + +func TestExtractPageTitle(t *testing.T) { + tests := []struct { + name string + content string + want string + }{ + { + name: "underline-only heading", + content: `Documents +========= + +Body text. +`, + want: "Documents", + }, + { + name: "overline and underline heading", + content: `========= +Documents +========= + +Body text. +`, + want: "Documents", + }, + { + name: "title after meta directive", + content: `.. meta:: + :description: A summary. + +================================ +Rotate Keys for Sharded Clusters +================================ +`, + want: "Rotate Keys for Sharded Clusters", + }, + { + name: "skips directives and field lists", + content: `.. default-domain:: mongodb + +My Page Title +============= +`, + want: "My Page Title", + }, + { + name: "no heading", + content: `.. include:: /includes/foo.rst + +Just a paragraph with no heading. +`, + want: "", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + path := writeTempFile(t, tt.content) + got, err := ExtractPageTitle(path) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if got != tt.want { + t.Errorf("ExtractPageTitle() = %q, want %q", got, tt.want) + } + }) + } +} diff --git a/internal/snooty/snooty.go b/internal/snooty/snooty.go index 7adc0b8..a3b3122 100644 --- a/internal/snooty/snooty.go +++ b/internal/snooty/snooty.go @@ -11,12 +11,43 @@ import ( "fmt" "os" "path/filepath" + "regexp" "strings" "github.com/BurntSushi/toml" "github.com/grove-platform/audit-cli/internal/projectinfo" ) +// substitutionPattern matches snooty constant references of the form {+name+}. +var substitutionPattern = regexp.MustCompile(`\{\+\s*([^+}]+?)\s*\+\}`) + +// ResolveSubstitutions replaces snooty constant references ({+name+}) in text +// with their values from the given constants map. References to unknown +// constants are left unchanged. Nested references (a constant whose value +// contains another {+...+}) are resolved up to a small fixed depth. +func ResolveSubstitutions(text string, constants map[string]string) string { + if len(constants) == 0 || !strings.Contains(text, "{+") { + return text + } + for i := 0; i < 5; i++ { + if !strings.Contains(text, "{+") { + break + } + replaced := substitutionPattern.ReplaceAllStringFunc(text, func(match string) string { + name := strings.TrimSpace(substitutionPattern.FindStringSubmatch(match)[1]) + if value, ok := constants[name]; ok { + return value + } + return match + }) + if replaced == text { + break + } + text = replaced + } + return text +} + // Composable represents a composable definition from a snooty.toml file. type Composable struct { ID string `toml:"id"` @@ -34,9 +65,10 @@ type ComposableOption struct { // Config represents the structure of a snooty.toml file. type Config struct { - Name string `toml:"name"` - Title string `toml:"title"` - Composables []Composable `toml:"composables"` + Name string `toml:"name"` + Title string `toml:"title"` + Composables []Composable `toml:"composables"` + Constants map[string]string `toml:"constants"` } // ParseFile parses a snooty.toml file and returns its configuration. diff --git a/internal/snooty/snooty_test.go b/internal/snooty/snooty_test.go index 9f78969..5d0aa40 100644 --- a/internal/snooty/snooty_test.go +++ b/internal/snooty/snooty_test.go @@ -314,3 +314,64 @@ func TestIsCurrentVersion(t *testing.T) { } } + +func TestResolveSubstitutions(t *testing.T) { + constants := map[string]string{ + "atlas-cli": "Atlas CLI", + "atlas-admin-api": "Atlas Administration API", + "nested": "prefix {+atlas-cli+}", + } + + tests := []struct { + name string + text string + constants map[string]string + want string + }{ + { + name: "single substitution", + text: "What is the {+atlas-cli+}?", + constants: constants, + want: "What is the Atlas CLI?", + }, + { + name: "multiple substitutions", + text: "Use the {+atlas-admin-api+} from the {+atlas-cli+}", + constants: constants, + want: "Use the Atlas Administration API from the Atlas CLI", + }, + { + name: "unknown constant left unchanged", + text: "Value of {+unknown+} here", + constants: constants, + want: "Value of {+unknown+} here", + }, + { + name: "nested substitution resolved", + text: "{+nested+}", + constants: constants, + want: "prefix Atlas CLI", + }, + { + name: "no substitutions", + text: "Plain title", + constants: constants, + want: "Plain title", + }, + { + name: "nil constants", + text: "What is the {+atlas-cli+}?", + constants: nil, + want: "What is the {+atlas-cli+}?", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got := ResolveSubstitutions(tt.text, tt.constants) + if got != tt.want { + t.Errorf("ResolveSubstitutions() = %q, want %q", got, tt.want) + } + }) + } +} diff --git a/main.go b/main.go index f374df0..adf8942 100644 --- a/main.go +++ b/main.go @@ -20,6 +20,7 @@ import ( "github.com/grove-platform/audit-cli/commands/compare" "github.com/grove-platform/audit-cli/commands/count" "github.com/grove-platform/audit-cli/commands/extract" + "github.com/grove-platform/audit-cli/commands/generate" "github.com/grove-platform/audit-cli/commands/report" "github.com/grove-platform/audit-cli/commands/resolve" "github.com/grove-platform/audit-cli/commands/search" @@ -58,6 +59,7 @@ Designed for maintenance tasks, scoping work, and reporting to stakeholders.`, rootCmd.AddCommand(count.NewCountCommand()) rootCmd.AddCommand(report.NewReportCommand()) rootCmd.AddCommand(resolve.NewResolveCommand()) + rootCmd.AddCommand(generate.NewGenerateCommand()) err := rootCmd.Execute() if err != nil { From 713b387c3d069d9b7ac2a6db5c7e962a3272ad3d Mon Sep 17 00:00:00 2001 From: Dachary Carey Date: Thu, 2 Jul 2026 10:27:23 -0400 Subject: [PATCH 2/2] chore(release): bump version to 0.4.0 and update CHANGELOG Document the generate llms command under a new 0.4.0 release (which also folds in the previously-unreleased resolve url command) and bump the version constant in main.go. Also fix the 0.3.0 release date (2026-01-07, was incorrectly 2025-01-07). --- CHANGELOG.md | 30 +++++++++++++++++++++++++++++- main.go | 2 +- 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index dd43514..ffb8691 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,8 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.4.0] - 2026-07-02 + ### Added +#### Generate Commands + +- `generate llms` - Generate a per-project `llms.txt` file for progressive disclosure + - Enumerates each project's current-version (and non-versioned) pages + - Extracts the page title (H1) and `meta` `:description:` for each page + - Resolves the production URL with `.md` appended + - Writes `//llms.txt` and prints a per-project + character-count summary (with and without descriptions), flagging files + over the 50,000-character `llms.txt` guideline + - Root landing pages use the `/index.md` markdown form + - Resolves snooty `{+name+}` substitutions in titles and descriptions from + the project's `snooty.toml` `[constants]` + - Excludes `includes/` and `code-examples/` directories and the deprecated + `app-services` and `realm` projects + - Flags: + - `--output-dir` - Directory to write files into (default: `llms-output`) + - `--for-project` - Limit generation to a single project + - `--no-descriptions` - Omit `meta` descriptions from the written files + - `--base-url` - Override the default base URL (default: `https://www.mongodb.com/docs`) + #### Resolve Commands - `resolve url` - Resolve documentation source files to production URLs @@ -21,7 +43,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Flags: - `--base-url` - Override the default base URL (default: `https://www.mongodb.com/docs`) -## [0.3.0] - 2025-01-07 +#### Internal Packages + +- `internal/rst/meta_parser.go` - Extract the `:description:` field from a page's `.. meta::` directive +- `internal/rst/page_title.go` - Extract a page's H1 title (underline-only and overline+underline styles) +- `internal/snooty` - Parse `[constants]` and resolve `{+name+}` substitutions (`ResolveSubstitutions`) + +## [0.3.0] - 2026-01-07 ### Added diff --git a/main.go b/main.go index adf8942..f0eda48 100644 --- a/main.go +++ b/main.go @@ -29,7 +29,7 @@ import ( // version is the current version of audit-cli. // Update this when releasing new versions following semantic versioning. -const version = "0.3.0" +const version = "0.4.0" func main() { var rootCmd = &cobra.Command{