Skip to content

Latest commit

Β 

History

History
434 lines (320 loc) Β· 14.1 KB

File metadata and controls

434 lines (320 loc) Β· 14.1 KB

Terraphim Knowledge Graph System

Overview

The Terraphim Knowledge Graph (KG) system provides semantic search capabilities by building thesauri from markdown files and using graph-based ranking algorithms. The system converts synonym relationships into graph structures that dramatically improve search relevance and discoverability.

Architecture Components

Core Components

  1. Logseq Builder - Extracts synonyms from markdown files using synonyms:: syntax
  2. Thesaurus - Maps synonyms to normalized concept terms with unique IDs
  3. RoleGraph - Graph structure with nodes, edges, and documents for ranking
  4. TerraphimGraph Relevance Function - Graph-based scoring algorithm
  5. Knowledge Graph Local - Local markdown file processing for KG construction

Knowledge Graph Construction

Source Files

Knowledge graphs are built from markdown files in docs/src/kg/:

docs/src/kg/
β”œβ”€β”€ terraphim-graph.md    # Graph architecture concepts
β”œβ”€β”€ service.md           # Service definitions
β”œβ”€β”€ haystack.md          # Haystack integration
β”œβ”€β”€ bug-reporting.md     # Bug reporting terminology and structured analysis
β”œβ”€β”€ issue-tracking.md    # Domain-specific issue tracking terminology
└── [additional KG files]

Synonym Syntax

Markdown files use the synonyms:: syntax to define concept relationships:

# Terraphim-graph

## Terraphim Graph scorer

Terraphim Graph (scorer) is using unique graph embeddings.

[example] synonyms:: graph embeddings, graph, knowledge graph based embeddings

Now we will have a concept "Terraphim Graph Scorer" with synonyms.

Thesaurus Construction

The Logseq builder processes markdown files to create thesaurus mappings:

let logseq_builder = Logseq::default();
let thesaurus = logseq_builder
    .build("Terraphim Engineer".to_string(), kg_path)
    .await?;

Example thesaurus output:

'terraphim-graph' -> 'terraphim-graph' (ID: 3)
'graph embeddings' -> 'terraphim-graph' (ID: 3)
'graph' -> 'terraphim-graph' (ID: 3)
'knowledge graph based embeddings' -> 'terraphim-graph' (ID: 3)
'haystack' -> 'haystack' (ID: 1)
'service' -> 'service' (ID: 2)

Graph Structure

RoleGraph Components

The RoleGraph converts thesaurus data into searchable graph structures:

pub struct RoleGraph {
    pub role: RoleName,
    nodes: AHashMap<u64, Node>,           // Concept nodes
    edges: AHashMap<u64, Edge>,           // Connections between concepts
    documents: AHashMap<String, IndexedDocument>, // Indexed content
    pub thesaurus: Thesaurus,             // Synonym mappings
    pub ac: AhoCorasick,                  // Fast pattern matching
}

Node Structure

Each node represents a concept with connections:

pub struct Node {
    pub id: u64,                    // Unique concept ID
    pub rank: u64,                  // Importance score
    pub connected_with: HashSet<u64>, // Edge IDs connecting this node
}

Edge Structure

Edges connect concepts and track document associations:

pub struct Edge {
    pub id: u64,                           // Unique edge ID
    pub rank: u64,                         // Connection strength
    pub doc_hash: HashMap<String, u64>,    // Documents referencing this edge
}

Search and Ranking Algorithm

TerraphimGraph Relevance Function

The TerraphimGraph relevance function uses graph structure for ranking:

  1. Pattern Matching - Find synonym matches in query text using Aho-Corasick
  2. Node Discovery - Map matched terms to concept nodes via thesaurus
  3. Edge Traversal - Follow connections between related concepts
  4. Rank Calculation - Combine node rank + edge rank + document rank
  5. Result Aggregation - Sort by total rank and return top results

Ranking Formula

let total_rank = node.rank + edge.rank + document_rank;

The ranking rewards:

  • Concept Importance (node.rank) - How central the concept is
  • Connection Strength (edge.rank) - How strongly concepts are related
  • Document Relevance (document_rank) - How relevant the document is

Query Processing

pub fn query_graph(
    &self,
    query_string: &str,
    offset: Option<usize>,
    limit: Option<usize>,
) -> Result<Vec<(String, IndexedDocument)>> {
    // 1. Find matching node IDs using Aho-Corasick
    let node_ids = self.find_matching_node_ids(query_string);

    // 2. Traverse graph structure for each matched node
    for node_id in node_ids {
        let node = self.nodes.get(&node_id)?;

        // 3. Follow edges to find connected documents
        for edge_id in &node.connected_with {
            let edge = self.edges.get(edge_id)?;

            // 4. Calculate combined ranking
            for (document_id, document_rank) in &edge.doc_hash {
                let total_rank = node.rank + edge.rank + document_rank;
                // Aggregate results...
            }
        }
    }

    // 5. Sort by rank and return top results
    ranked_documents.sort_by_key(|(_, doc)| std::cmp::Reverse(doc.rank));
    Ok(documents)
}

Performance Characteristics

Search Performance

Based on comprehensive testing:

  • Initial KG State: 10 terms, 3 nodes, 5 edges
  • Query Response: Consistent rank 34 for "terraphim-graph"
  • Search Speed: Fast pattern matching with Aho-Corasick
  • Memory Efficiency: Compact graph representation

Ranking Improvement

Adding synonyms creates dramatic ranking improvements:

Metric Before After Improvement
Thesaurus Terms 10 16 +60%
Graph Nodes 3 4 +33%
Graph Edges 5 8 +60%
"terraphim-graph" Rank 28 117 +318%

Role Configuration

Terraphim Engineer Role

The Terraphim Engineer role uses local KG with TerraphimGraph relevance:

{
  "name": "Terraphim Engineer",
  "relevance_function": "terraphim-graph",
  "kg": {
    "knowledge_graph_local": {
      "input_type": "markdown",
      "path": "docs/src/kg"
    }
  }
}

Local vs Remote Thesaurus

Local KG (Recommended):

  • Built from docs/src/kg markdown files
  • 10-16 terms from local content
  • Domain-specific, highly relevant
  • Fast building (~10 seconds)

Remote Thesaurus:

  • Downloaded from external URL
  • 1,725+ terms from general content
  • May miss local domain terms
  • Network dependency

Implementation Examples

Building Knowledge Graph

use terraphim_middleware::thesaurus::{Logseq, ThesaurusBuilder};
use terraphim_rolegraph::RoleGraph;
use terraphim_types::RoleName;

// 1. Build thesaurus from local KG files
let logseq_builder = Logseq::default();
let thesaurus = logseq_builder
    .build("Terraphim Engineer".to_string(), kg_path)
    .await?;

// 2. Create rolegraph with thesaurus
let role_name = RoleName::new("Terraphim Engineer");
let mut rolegraph = RoleGraph::new(role_name, thesaurus).await?;

// 3. Index documents into rolegraph
rolegraph.insert_document(&document.id, document);

// 4. Search with graph-based ranking
let results = rolegraph.query_graph("terraphim-graph", Some(0), Some(10))?;

Adding New Knowledge

// Create new KG file with synonyms
let new_kg_content = r#"
# Graph Analysis

## Advanced Graph Processing

Graph Analysis provides deep insights into data relationships.

[example] synonyms:: data analysis, network analysis, graph processing,
          relationship mapping, connectivity analysis,
          terraphim-graph, graph embeddings

This enhances graph-based system capabilities.
"#;

// Write to KG directory
fs::write(&kg_path.join("graph-analysis.md"), new_kg_content).await?;

// Rebuild thesaurus to include new terms
let expanded_thesaurus = logseq_builder
    .build("Terraphim Engineer".to_string(), &kg_path)
    .await?;

Measuring Graph Growth

// Measure initial state
let initial_nodes = rolegraph.nodes_map().len();
let initial_edges = rolegraph.edges_map().len();
let initial_terms = thesaurus.len();

// ... add new content and rebuild ...

// Measure growth
let node_growth = expanded_nodes - initial_nodes;
let edge_growth = expanded_edges - initial_edges;
let term_growth = expanded_terms - initial_terms;

println!("Growth: +{} terms, +{} nodes, +{} edges",
    term_growth, node_growth, edge_growth);

Best Practices

Content Strategy

  1. Domain-Specific Terms - Use terminology relevant to your domain
  2. Synonym Research - Include terms users actually search for
  3. Concept Mapping - Group related terms under common concepts
  4. Strategic Placement - Add important synonyms to boost key terms

Performance Optimization

  1. Local KG Preferred - Use local markdown files for domain relevance
  2. Measured Growth - Track thesaurus and graph expansion metrics
  3. Test-Driven - Validate ranking improvements with tests
  4. Incremental Building - Add synonyms gradually and measure impact

Testing and Validation

  1. Isolated Testing - Use temporary directories for safe testing
  2. Baseline Measurement - Record initial state before changes
  3. Impact Validation - Verify ranking improvements after additions
  4. Regression Testing - Ensure changes don't break existing functionality

Troubleshooting

Common Issues

No Search Results:

  • Check if thesaurus contains expected terms
  • Verify role uses TerraphimGraph relevance function
  • Ensure KG path points to correct directory

Low Search Rankings:

  • Add more relevant synonyms to target concepts
  • Check synonym syntax in markdown files
  • Verify graph structure has sufficient connections

Build Failures:

  • Validate markdown file syntax
  • Check file permissions in KG directory
  • Ensure Logseq builder has access to files

Debug Information

// Print thesaurus contents
for (term, normalized_term) in &thesaurus {
    println!("'{}' -> '{}' (ID: {})",
        term.as_str(),
        normalized_term.value.as_str(),
        normalized_term.id);
}

// Check graph structure
println!("Nodes: {}, Edges: {}, Documents: {}",
    rolegraph.nodes_map().len(),
    rolegraph.edges_map().len(),
    rolegraph.get_all_documents().count());

// Test search functionality
let results = rolegraph.query_graph("test-term", Some(0), Some(5))?;
println!("Search results: {} found", results.len());

Bug Reporting and Issue Tracking Enhancement (2025-01-31)

Domain-Specific Knowledge Graph Files

The Terraphim KG system has been enhanced with comprehensive bug reporting and issue tracking terminology:

bug-reporting.md - Core bug reporting concepts:

  • Steps to Reproduce - Comprehensive synonyms for reproduction procedures
  • Expected Behaviour - Terminology for intended system behavior
  • Actual Behaviour - Variations for describing observed problems
  • Impact Analysis - Business and operational impact terminology
  • Bug Classification - Issue categorization and severity terms
  • Quality Assurance - QA processes and testing terminology

issue-tracking.md - Domain-specific terminology:

  • Payroll System Issues - Salary calculation and compensation problems
  • Data Consistency Problems - Synchronization and integrity issues
  • HR System Integration - Human resources system connectivity
  • System Integration Failures - Cross-system communication problems
  • Performance Degradation - System slowdown and bottleneck terminology
  • User Experience Issues - UI/UX problem descriptions

MCP Integration Testing

Comprehensive test suite validates bug reporting functionality:

test_bug_report_extraction.rs - Core functionality testing:

  • Extracts 2,615 paragraphs from comprehensive bug reports
  • Extracts 165 paragraphs from short content scenarios
  • Tests all four bug report sections systematically
  • Validates connectivity analysis across related terms

test_kg_term_verification.rs - Knowledge graph validation:

  • Payroll terms: 3 suggestions (provider, service, middleware)
  • Data consistency terms: 9 suggestions (data analysis, network analysis, etc.)
  • Quality assurance terms: 9 suggestions (connectivity analysis, graph processing, etc.)

Performance Improvements

The enhanced knowledge graph demonstrates significant improvements in structured document analysis:

  • Semantic Understanding: Enhanced ability to process structured bug reports using semantic understanding rather than keyword matching
  • Domain Coverage: Comprehensive terminology coverage for technical documentation and issue tracking
  • Extraction Performance: Robust paragraph extraction across different content types and sizes
  • Term Recognition: Effective autocomplete functionality with expanded terminology

Future Enhancements

Planned Features

  1. Dynamic KG Updates - Hot-reload KG changes without restart
  2. Graph Visualization - Visual representation of concept relationships
  3. Advanced Ranking - Machine learning-enhanced relevance scoring
  4. Multi-Language Support - Synonym support for multiple languages
  5. Performance Optimization - Caching and incremental updates
  6. Domain Expansion - Additional specialized terminology for specific industries and use cases

Integration Opportunities

  1. External Ontologies - Import from RDF/OWL knowledge bases
  2. Collaborative Editing - Multi-user KG development workflows
  3. Analytics Dashboard - Search analytics and KG health monitoring
  4. API Extensions - RESTful APIs for KG management

Conclusion

The Terraphim Knowledge Graph system provides powerful semantic search capabilities through graph-based ranking. By converting synonym relationships into graph structures, the system dramatically improves search relevance and provides a framework for continuous improvement through strategic content additions.

Key Benefits:

  • πŸ” Semantic Search - Find content by meaning, not just keywords
  • πŸ“ˆ Ranking Improvement - Up to 318% ranking boost from synonyms
  • 🎯 Domain Relevance - Local KG ensures domain-specific accuracy
  • πŸ”§ Easy Expansion - Simple markdown syntax for adding knowledge
  • πŸ“Š Measurable Impact - Comprehensive testing framework for validation

The knowledge graph system forms the foundation for intelligent, context-aware search in the Terraphim AI platform.