Summary
Preprocess already scraped llvm/llvm-project issue data (including comments) into document objects (content + metadata) for Pinecone upsert. No scraping or GitHub API calls.
Scope
- Input: Scraped issue data for llvm/llvm-project (with comments). Output: Documents with
content and metadata (e.g. repo, number, state, author, created_at, labels, url). In scope: parse/validate, normalize text, include comments in content/chunking, document schema. Out of scope: GitHub fetch; Pinecone API.
Result
Library or CLI: scraped payload(s) → list of { content, metadata }. Config for field mapping/truncation. Code, tests, and doc schema README.
Acceptance criteria
Summary
Preprocess already scraped llvm/llvm-project issue data (including comments) into document objects (content + metadata) for Pinecone upsert. No scraping or GitHub API calls.
Scope
contentandmetadata(e.g. repo, number, state, author, created_at, labels, url). In scope: parse/validate, normalize text, include comments in content/chunking, document schema. Out of scope: GitHub fetch; Pinecone API.Result
Library or CLI: scraped payload(s) → list of
{ content, metadata }. Config for field mapping/truncation. Code, tests, and doc schema README.Acceptance criteria
content+metadata.