Skip to main content

Document Preprocessing

Document preprocessing is a crucial step in preparing your content for AI training. Different document types require different splitting strategies to maintain context and ensure optimal performance. Our system provides six specialized formatters, each designed for specific content types.

Overview

When you upload documents, they need to be split into smaller chunks that the AI can process effectively. The choice of formatter depends on your document type and structure. Each formatter has configurable parameters to fine-tune the splitting behavior.

Formatter Types

Simple Formatter

The Simple formatter is ideal for plain text documents with consistent structure. It splits documents using a single separator pattern.

Configuration Options

Title

  • Type: String
  • Max Length: 255 characters
  • Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

  • Type: String
  • Max Length: 255 characters
  • Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separator

  • Type: String
  • Default: "\n\n" (double newline)
  • Description: The character or string used to split the document. Common options include:
    • "\n\n" - Split on paragraph breaks
    • "\n" - Split on line breaks
    • "." - Split on sentences
    • Custom separators like "---" or "###"

Chunk Size

  • Type: Number
  • Range: 10-3000 characters
  • Default: 1024
  • Description: Maximum size of each text chunk in characters. Larger chunks preserve more context but may exceed model limits.

Chunk Overlap

  • Type: Number
  • Range: 0-1500 characters
  • Default: 512
  • Description: Number of characters to overlap between adjacent chunks. Helps maintain context across chunk boundaries.

Keep Separator

  • Type: Boolean
  • Default: true
  • Description: Whether to include the separator in the resulting chunks. Useful for maintaining formatting cues.

Best Used For

  • Plain text documents
  • Simple reports
  • Basic documentation
  • Content with consistent paragraph structure
Recursive Formatter

The Recursive formatter attempts to split documents using multiple separators in order of preference, falling back to less ideal separators when necessary.

Configuration Options

Title

  • Type: String
  • Max Length: 255 characters
  • Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

  • Type: String
  • Max Length: 255 characters
  • Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separators

  • Type: Array of strings
  • Default: ["\n\n", "\n", " ", ""]
  • Description: List of separators tried in order. The formatter will:
    1. Try to split on "\n\n" (paragraphs)
    2. Fall back to "\n" (lines)
    3. Then try " " (words)
    4. Finally split anywhere "" if needed

Chunk Size

  • Type: Number
  • Range: 10-3000 characters
  • Default: 1024
  • Description: Target size for each chunk. The formatter will try to stay close to this size while respecting separator boundaries.

Chunk Overlap

  • Type: Number
  • Range: 0-1500 characters
  • Default: 512
  • Description: Overlap between chunks to maintain context continuity.

Keep Separator

  • Type: Boolean
  • Default: true
  • Description: Preserves separators in the final chunks.

Best Used For

  • Mixed content documents
  • Books and articles
  • Documentation with varying structure
  • Content where natural breaks are important
Code Formatter

The Code formatter is specifically designed for source code files, understanding programming language syntax and structure.

Configuration Options

Title

  • Type: String
  • Max Length: 255 characters
  • Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

  • Type: String
  • Max Length: 255 characters
  • Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Language

  • Type: String (optional)
  • Options: cpp, go, java, kotlin, js, ts, php, proto, python, rst, ruby, rust, scala, swift, markdown, latex, html, sol, csharp, cobol, c, lua, perl, haskell, elixir, powershell
  • Default: None (auto-detect)
  • Description: Programming language of the source code. Helps the formatter understand syntax patterns and make smarter splits.

Chunk Size

  • Type: Number
  • Range: 10-3000 characters
  • Default: 60
  • Description: Smaller default size for code to maintain function/method boundaries.

Chunk Overlap

  • Type: Number
  • Range: 0-1500 characters
  • Default: 0
  • Description: Usually set to 0 for code to avoid duplicating logic across chunks.

Best Used For

  • Source code files
  • Programming documentation
  • Technical specifications
  • Any structured code content
FAQ Formatter

The FAQ formatter is optimized for question-and-answer content, splitting on question indicators.

Configuration Options

Title

  • Type: String
  • Max Length: 255 characters
  • Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

  • Type: String
  • Max Length: 255 characters
  • Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separators

  • Type: Array of strings
  • Default: ["question:", "Question:", "QUESTION:", "Q:", "q:"]
  • Description: Question indicators that trigger splits. Common patterns include:
    • "Q:" - Simple Q&A format
    • "Question:" - Formal question format
    • "FAQ:" - FAQ-style format
    • Custom patterns like "**Q:**" or "### Q:"

Chunk Size

  • Type: Number
  • Range: 10-3000 characters
  • Default: 20
  • Description: Small default size since FAQ items are typically short and self-contained.

Chunk Overlap

  • Type: Number
  • Range: 0-1500 characters
  • Default: 0
  • Description: No overlap needed since each FAQ item is independent.

Keep Separator

  • Type: Boolean
  • Default: true
  • Description: Preserves question formatting for clarity.

Best Used For

  • FAQ documents
  • Q&A content
  • Interview transcripts
  • Knowledge base articles
Web Formatter

The Web formatter is designed for web-scraped content and HTML documents, handling web-specific formatting.

Configuration Options

Title

  • Type: String
  • Max Length: 255 characters
  • Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

  • Type: String
  • Max Length: 255 characters
  • Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Chunk Size

  • Type: Number
  • Range: 10-3000 characters
  • Default: 1024
  • Description: Balanced size for web content that may include headers, paragraphs, and lists.

Chunk Overlap

  • Type: Number
  • Range: 0-1500 characters
  • Default: 512
  • Description: Maintains context across web page sections.

Best Used For

  • Web-scraped content
  • HTML documents
  • Blog posts
  • Web articles and pages
Markdown Formatter

The Markdown formatter understands Markdown syntax and splits content based on headers and structure.

Configuration Options

Title

  • Type: String
  • Max Length: 255 characters
  • Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

  • Type: String
  • Max Length: 255 characters
  • Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separators

  • Type: Array of strings
  • Default: ["\n\n", "\n", " ", ""]
  • Description: Fallback separators when header-based splitting creates chunks that are too large.

Keep Separator

  • Type: Boolean
  • Default: true
  • Description: Preserves header formatting and structure.

Chunk Size

  • Type: Number
  • Range: 10-3000 characters
  • Default: 1000
  • Description: Size target for each section after header-based splitting.

Chunk Overlap

  • Type: Number
  • Range: 0-1500 characters
  • Default: 200
  • Description: Overlap to maintain context between sections.

Best Used For

  • Markdown documentation
  • README files
  • Technical documentation
  • Structured text with headers

General Guidelines

Choosing the Right Formatter

  1. Code files: Use Code formatter with appropriate language setting
  2. Markdown documents: Use Markdown formatter for header-aware splitting
  3. FAQ content: Use FAQ formatter for question-based splitting
  4. Web content: Use Web formatter for scraped or HTML content
  5. Structured text: Use Recursive formatter for mixed content
  6. Simple text: Use Simple formatter for basic documents

Parameter Tuning Tips

Chunk Size

  • Larger chunks (1500-3000): Better context retention, may hit model limits
  • Medium chunks (500-1500): Balanced approach, good for most use cases
  • Smaller chunks (100-500): Better for precise retrieval, may lose context

Chunk Overlap

  • High overlap (30-50% of chunk size): Maximum context preservation
  • Medium overlap (15-30%): Balanced approach
  • Low/No overlap (0-15%): Minimal duplication, faster processing

Keep Separator

  • Enable for formatted content where separators provide meaning
  • Disable for raw text processing where separators are just splitting tools