Content Indexing

Content indexing is the process of organizing and storing data in a way that enables efficient search and retrieval. This article covers practical implementation approaches using TypeScript.

For advanced search capabilities, indexed content can be enhanced with Semantic Understanding and Vector Search techniques to enable more intelligent retrieval.

Core Concepts

Content indexing typically involves:

  • Document processing
  • Text normalization
  • Metadata extraction
  • Storage optimization

Basic Implementation

Document Processor

interface ProcessingStep {
  name: string;
  process: (content: string) => string | Promise<string>;
}

class DocumentProcessor {
  private steps: ProcessingStep[] = [];

  addStep(step: ProcessingStep) {
    this.steps.push(step);
  }

  async process(content: string): Promise<string> {
    let processed = content;
    for (const step of this.steps) {
      processed = await step.process(processed);
    }
    return processed;
  }
}

Processing Pipeline Example

interface Document {
  id: string;
  content: string;
  metadata: Record<string, unknown>;
}

const processor = new DocumentProcessor();

// Add common processing steps
processor.addStep({
  name: 'normalize',
  process: (content: string) => content.toLowerCase()
});

processor.addStep({
  name: 'cleanWhitespace',
  process: (content: string) => content.replace(/\s+/g, ' ').trim()
});

Practical Implementation

1. Document Source

interface DocumentSource {
  id: string;
  type: 'database' | 'filesystem' | 'api';
  lastIndexed?: Date;
}

class ContentIndexer {
  private sources: Map<string, DocumentSource> = new Map();
  
  async indexDocument(doc: Document) {
    try {
      const processed = await this.processor.process(doc.content);
      await this.store(doc.id, processed, doc.metadata);
    } catch (error) {
      console.error(`Failed to index document ${doc.id}:`, error);
      // Handle error appropriately
    }
  }

  private async store(id: string, content: string, metadata: Record<string, unknown>) {
    // Implementation depends on your storage solution
    // Example using a vector database:
    const embedding = await this.generateEmbedding(content);
    await this.vectorDb.insert({
      id,
      embedding,
      metadata,
      content
    });
  }
}

2. Vector Storage Integration

interface VectorEntry {
  id: string;
  embedding: number[];
  metadata: Record<string, unknown>;
  content: string;
}

class VectorStore {
  async insert(entry: VectorEntry): Promise<void> {
    // Implementation with your chosen vector database
    // Example with Pinecone or Milvus
  }

  async search(query: string, limit: number = 10): Promise<VectorEntry[]> {
    const queryEmbedding = await this.generateEmbedding(query);
    // Perform vector similarity search
    return [];
  }
}

Error Handling and Monitoring

interface IndexingError {
  documentId: string;
  error: Error;
  timestamp: Date;
}

class IndexingMonitor {
  private errors: IndexingError[] = [];

  logError(docId: string, error: Error) {
    this.errors.push({
      documentId: docId,
      error,
      timestamp: new Date()
    });
  }

  getErrorRate(timeWindow: number): number {
    const recent = this.errors.filter(e => 
      e.timestamp > new Date(Date.now() - timeWindow)
    );
    return recent.length / timeWindow;
  }
}

Performance Considerations

  1. Batch Processing
async function batchProcess(docs: Document[], batchSize: number = 100) {
  for (let i = 0; i < docs.length; i += batchSize) {
    const batch = docs.slice(i, i + batchSize);
    await Promise.all(batch.map(doc => indexer.indexDocument(doc)));
  }
}
  1. Caching
class EmbeddingCache {
  private cache = new Map<string, number[]>();
  private maxSize: number;

  constructor(maxSize: number = 1000) {
    this.maxSize = maxSize;
  }

  set(key: string, embedding: number[]) {
    if (this.cache.size >= this.maxSize) {
      const firstKey = this.cache.keys().next().value;
      this.cache.delete(firstKey);
    }
    this.cache.set(key, embedding);
  }

  get(key: string): number[] | undefined {
    return this.cache.get(key);
  }
}

Common Challenges

  1. Incremental Updates
    • Track document versions
    • Implement change detection
    • Handle deletions
  2. Resource Management
    • Monitor memory usage
    • Implement connection pooling
    • Handle rate limiting
  3. Data Consistency
    • Implement transaction handling
    • Ensure atomic updates
    • Handle failed operations
  • Full-text search engines
  • Vector similarity search
  • Document preprocessing
  • Search relevance optimization

This implementation focuses on practical aspects of content indexing while maintaining flexibility for different use cases and storage solutions.