Document Preprocessing

Document preprocessing is a crucial step in preparing your content for AI training. Different document types require different splitting strategies to maintain context and ensure optimal performance. Our system provides six specialized formatters, each designed for specific content types.

Overview

When you upload documents, they need to be split into smaller chunks that the AI can process effectively. The choice of formatter depends on your document type and structure. Each formatter has configurable parameters to fine-tune the splitting behavior.

Formatter Types

Simple Formatter

The Simple formatter is ideal for plain text documents with consistent structure. It splits documents using a single separator pattern.

Configuration Options

Title

Type: String
Max Length: 255 characters
Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

Type: String
Max Length: 255 characters
Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separator

Type: String
Default: "\n\n" (double newline)
Description: The character or string used to split the document. Common options include:
- "\n\n" - Split on paragraph breaks
- "\n" - Split on line breaks
- "." - Split on sentences
- Custom separators like "---" or "###"

Chunk Size

Type: Number
Range: 10-3000 characters
Default: 1024
Description: Maximum size of each text chunk in characters. Larger chunks preserve more context but may exceed model limits.

Chunk Overlap

Type: Number
Range: 0-1500 characters
Default: 512
Description: Number of characters to overlap between adjacent chunks. Helps maintain context across chunk boundaries.

Keep Separator

Type: Boolean
Default: true
Description: Whether to include the separator in the resulting chunks. Useful for maintaining formatting cues.

Best Used For

Plain text documents
Simple reports
Basic documentation
Content with consistent paragraph structure

Recursive Formatter

The Recursive formatter attempts to split documents using multiple separators in order of preference, falling back to less ideal separators when necessary.

Configuration Options

Title

Type: String
Max Length: 255 characters
Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

Type: String
Max Length: 255 characters
Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separators

Type: Array of strings
Default: ["\n\n", "\n", " ", ""]
Description: List of separators tried in order. The formatter will:
1. Try to split on "\n\n" (paragraphs)
2. Fall back to "\n" (lines)
3. Then try " " (words)
4. Finally split anywhere "" if needed

Chunk Size

Type: Number
Range: 10-3000 characters
Default: 1024
Description: Target size for each chunk. The formatter will try to stay close to this size while respecting separator boundaries.

Chunk Overlap

Type: Number
Range: 0-1500 characters
Default: 512
Description: Overlap between chunks to maintain context continuity.

Keep Separator

Type: Boolean
Default: true
Description: Preserves separators in the final chunks.

Best Used For

Mixed content documents
Books and articles
Documentation with varying structure
Content where natural breaks are important

Code Formatter

The Code formatter is specifically designed for source code files, understanding programming language syntax and structure.

Configuration Options

Title

Type: String
Max Length: 255 characters
Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

Type: String
Max Length: 255 characters
Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Language

Type: String (optional)
Options: cpp, go, java, kotlin, js, ts, php, proto, python, rst, ruby, rust, scala, swift, markdown, latex, html, sol, csharp, cobol, c, lua, perl, haskell, elixir, powershell
Default: None (auto-detect)
Description: Programming language of the source code. Helps the formatter understand syntax patterns and make smarter splits.

Chunk Size

Type: Number
Range: 10-3000 characters
Default: 60
Description: Smaller default size for code to maintain function/method boundaries.

Chunk Overlap

Type: Number
Range: 0-1500 characters
Default: 0
Description: Usually set to 0 for code to avoid duplicating logic across chunks.

Best Used For

Source code files
Programming documentation
Technical specifications
Any structured code content

FAQ Formatter

The FAQ formatter is optimized for question-and-answer content, splitting on question indicators.

Configuration Options

Title

Type: String
Max Length: 255 characters
Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

Type: String
Max Length: 255 characters
Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separators

Type: Array of strings
Default: ["question:", "Question:", "QUESTION:", "Q:", "q:"]
Description: Question indicators that trigger splits. Common patterns include:
- "Q:" - Simple Q&A format
- "Question:" - Formal question format
- "FAQ:" - FAQ-style format
- Custom patterns like "**Q:**" or "### Q:"

Chunk Size

Type: Number
Range: 10-3000 characters
Default: 20
Description: Small default size since FAQ items are typically short and self-contained.

Chunk Overlap

Type: Number
Range: 0-1500 characters
Default: 0
Description: No overlap needed since each FAQ item is independent.

Keep Separator

Type: Boolean
Default: true
Description: Preserves question formatting for clarity.

Best Used For

FAQ documents
Q&A content
Interview transcripts
Knowledge base articles

Web Formatter

The Web formatter is designed for web-scraped content and HTML documents, handling web-specific formatting.

Configuration Options

Title

Type: String
Max Length: 255 characters
Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

Type: String
Max Length: 255 characters
Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Chunk Size

Type: Number
Range: 10-3000 characters
Default: 1024
Description: Balanced size for web content that may include headers, paragraphs, and lists.

Chunk Overlap

Type: Number
Range: 0-1500 characters
Default: 512
Description: Maintains context across web page sections.

Best Used For

Web-scraped content
HTML documents
Blog posts
Web articles and pages

Markdown Formatter

The Markdown formatter understands Markdown syntax and splits content based on headers and structure.

Configuration Options

Title

Type: String
Max Length: 255 characters
Description: Document title that will be added to each chunk of data. This helps identify the source document and provides context for the AI when retrieving information.

Description

Type: String
Max Length: 255 characters
Description: Document description that provides additional context about the content. This helps the AI understand the nature and purpose of the document.

Separators

Type: Array of strings
Default: ["\n\n", "\n", " ", ""]
Description: Fallback separators when header-based splitting creates chunks that are too large.

Keep Separator

Type: Boolean
Default: true
Description: Preserves header formatting and structure.

Chunk Size

Type: Number
Range: 10-3000 characters
Default: 1000
Description: Size target for each section after header-based splitting.

Chunk Overlap

Type: Number
Range: 0-1500 characters
Default: 200
Description: Overlap to maintain context between sections.

Best Used For

Markdown documentation
README files
Technical documentation
Structured text with headers

General Guidelines

Choosing the Right Formatter

Code files: Use Code formatter with appropriate language setting
Markdown documents: Use Markdown formatter for header-aware splitting
FAQ content: Use FAQ formatter for question-based splitting
Web content: Use Web formatter for scraped or HTML content
Structured text: Use Recursive formatter for mixed content
Simple text: Use Simple formatter for basic documents

Parameter Tuning Tips

Chunk Size

Larger chunks (1500-3000): Better context retention, may hit model limits
Medium chunks (500-1500): Balanced approach, good for most use cases
Smaller chunks (100-500): Better for precise retrieval, may lose context

Chunk Overlap

High overlap (30-50% of chunk size): Maximum context preservation
Medium overlap (15-30%): Balanced approach
Low/No overlap (0-15%): Minimal duplication, faster processing

Keep Separator

Enable for formatted content where separators provide meaning
Disable for raw text processing where separators are just splitting tools

Overview​

Formatter Types​

Configuration Options​

Best Used For​

Configuration Options​

Best Used For​

Configuration Options​

Best Used For​

Configuration Options​

Best Used For​

Configuration Options​

Best Used For​

Configuration Options​

Best Used For​

General Guidelines​

Choosing the Right Formatter​

Parameter Tuning Tips​

Overview

Formatter Types

Configuration Options

Best Used For

Configuration Options

Best Used For

Configuration Options

Best Used For

Configuration Options

Best Used For

Configuration Options

Best Used For

Configuration Options

Best Used For

General Guidelines

Choosing the Right Formatter

Parameter Tuning Tips