DocSplitter Help
DocSplitter analyses scanned multi-document PDF batches and splits them into individual files using a vision AI model. This guide covers every feature, setting, and workflow.
How it works
Each file passes through a five-stage pipeline:
Ingest — A file arrives either by being dropped into a watched folder or uploaded via the API / Upload tab.
Render — Every page is rendered to an image at the configured DPI (150 by default). Higher DPI means better accuracy but slower processing and larger API payloads.
Extract text — Text is extracted from the PDF's text layer. If a page has fewer than 80 characters of usable text, Tesseract OCR is used as a fallback. The extracted text is included in the AI prompt so local models do not need to rely on vision alone to read fine print.
Analyse — The AI model examines each page in a sliding window (previous page + current page + next page) and decides: does this page start a new document? It returns a document type label and a confidence score (0–1) for each decision.
Split or queue — If the lowest confidence across all boundary decisions meets the channel's threshold, the PDF is split automatically (auto split). If not, the job enters the review queue (review) where a human can inspect and adjust the split points before approving.
Key concepts
Channels
A channel is a named processing pipeline with its own settings — confidence threshold, document type hints, output folder, and split rules. You can have multiple channels for different document types or departments. There are two channel types:
- API channel — accepts file uploads via the Upload tab or HTTP POST. Best for on-demand or ad-hoc processing.
- Folder watcher — monitors a directory and processes files automatically when they stop changing. Best for automated workflows where files are dropped by another system.
Confidence score
Every boundary decision made by the AI comes with a confidence score from 0.0 (completely uncertain) to 1.0 (certain). The channel's confidence threshold determines whether the job is auto-split or sent to review. A threshold of 0.85 means every boundary must be at least 85% confident for auto-split to proceed.
Review queue
When a job falls below the confidence threshold it enters the review queue. A reviewer uses the visual split editor to inspect the detected boundaries, adjust them if needed, and then approve or reject the job. Output files are only written after approval.
Dashboard tab
The dashboard gives you an at-a-glance view of system activity. It refreshes automatically every 10 seconds.
| Card | What it shows |
|---|---|
| Pending Review | Items waiting for human review across all channels. Also shown as a badge on the Review Queue nav item. |
| Jobs Today | Count of jobs created since midnight (local browser time). |
| Active Channels | Number of enabled channels. |
| AI Model | The model currently configured in the AI backend. |
The Recent Jobs table shows the last 10 jobs across all channels. Click any row in the Jobs tab for full detail. The AI status indicator in the bottom-left of the sidebar shows whether the configured AI backend is reachable (green) or not (red).
Upload tab
The Upload tab shows all enabled API channels as cards. You can drag a PDF (or TIFF) onto a channel card, or click the drop zone to browse for a file.
Upload workflow
Drag a file onto a channel card (the border turns blue) or click Drop PDF here or click to browse.
The file uploads and a progress entry appears on the card. Status updates automatically every 2–3 seconds.
If auto-split: Download links appear for each split PDF. If there are multiple files a Download all as ZIP button also appears.
If review required: A review required badge appears with a Review → button. Clicking it opens the full-screen split editor (see Review Queue). After you approve, the card immediately updates to show download links.
.pdf), TIFF (.tif, .tiff). Only API-type channels are shown on this tab — folder watcher channels do not accept manual uploads.
Managing upload entries
Each entry on a card has a × button to dismiss it. Dismissing does not cancel or delete the job — it only removes the entry from the card. The job continues processing and can be found in the Jobs tab.
Channels tab
Create, edit, and delete channels here. Click + New Channel to open the channel editor.
Channel fields
invoices, hr-documents). Cannot be changed after creation.API Upload — file is submitted via HTTP. Folder Watcher — file is dropped into a directory. Cannot be changed after creation./app/watch/invoices). Must correspond to a mounted host directory in your docker-compose.yml./app/output/ where split PDFs are written. Use different subdirectories for different channels to keep output organised. Defaults to default.invoice, purchase_order, credit_note). The model is instructed to prefer these labels when classifying pages. Leave empty for fully dynamic classification. See Type hints.*.pdf, *.PDF). Files not matching any pattern are ignored. Both *.pdf and *.PDF are included by default because Linux filesystems are case-sensitive.Confidence threshold
The threshold controls the balance between automation and accuracy:
| Threshold | Behaviour | Best for |
|---|---|---|
| 0% (0.0) | Every job goes to review regardless of confidence | Testing; full manual control |
| 50–70% | Only very low-confidence jobs go to review | Varied or complex document batches |
| 75–85% | Good balance — most clean batches auto-split | General use (recommended starting point) |
| 90–95% | Only high-confidence jobs auto-split; more go to review | High-accuracy requirements |
| 100% (1.0) | Nothing auto-splits — every job goes to review | Full manual review of all jobs |
The confidence score for a job is the lowest confidence across all boundary decisions in that job. A 10-page batch with 9 confident decisions and 1 ambiguous one will still go to review if the ambiguous decision falls below the threshold.
Document type hints
Type hints are short snake_case labels that tell the AI what document types to expect in this channel. When hints are provided the model is instructed to prefer them when classifying a page.
Examples
invoice,purchase_order,remittance_advice,credit_note— accounts payable batchapplication_form,supporting_document,identification— HR or admissions batchlab_report,referral_letter,consent_form— medical records batch
When to use type hints
- When you know what document types will appear in the batch
- When you want consistent, predictable type labels in output filenames
- When you are using split trigger types — triggers must match the hint labels exactly
If you leave type hints empty the model classifies freely using whatever label it considers most appropriate. This is fine for general-purpose channels but produces less consistent output filenames.
Split trigger types
Normally DocSplitter creates a new output file for every boundary the AI detects. Split trigger types change this behaviour: a new output file is only started when the AI detects a page whose document type is in the trigger list. All other pages — even if the AI considers them a new document — are appended to the output file that started with the most recent trigger page.
The invoice + attachments scenario
A common batch pattern is: Invoice → Delivery note → Proof of delivery → Invoice → Purchase order → Invoice…
Without trigger types, DocSplitter would split every boundary, producing 6 separate files. With invoice as the only trigger type, it produces 3 files:
Multiple trigger types
You can add multiple triggers. For example, invoice + credit_note would start a new file for either type, while still appending supporting documents to the preceding trigger document.
Review Queue tab
The review queue lists all jobs that fell below the channel confidence threshold and are waiting for a human decision. Each entry shows the filename, channel, number of detected documents, minimum confidence score, and submission time.
Click Review to open the full-screen split editor for that job.
Split editor
The split editor has four zones:
Approving
Click Approve & Split to write the output PDFs. The split editor's current boundary configuration (including any manual adjustments) is used, not the AI's original proposal. The job status changes to approved.
Rejecting
Click Reject to discard the job. No output files are written. The job status changes to rejected. This cannot be undone.
Jobs tab
A complete history of every job processed by DocSplitter. Use the status filter dropdown to narrow the list. Click any row to expand it and see the full job ID, any error message, and the output file paths on disk.
Jobs are retained indefinitely in the database. The output files are written to the output directory and are not managed by DocSplitter after creation — you are responsible for archiving or deleting them.
System tab
Shows the currently active configuration values for the AI backend and output settings. API keys are masked. This view reflects the live configuration including any environment variable overrides — useful for verifying that your .env or local.yaml settings have been applied correctly.
The AI status indicator in the sidebar footer shows whether the AI backend is reachable. If it shows red, check the base URL and API key in your configuration and confirm the AI service is running.
AI providers
DocSplitter works with any OpenAI-compatible vision API. Configure the provider in your .env file or config/local.yaml.
OpenAI
DOCSPLITTER_AI__API_KEY=sk-your-key-here
DOCSPLITTER_AI__BASE_URL=https://api.openai.com/v1
DOCSPLITTER_AI__MODEL=gpt-4o
Azure OpenAI
DOCSPLITTER_AI__API_KEY=your-azure-api-key
DOCSPLITTER_AI__BASE_URL=https://your-resource.cognitiveservices.azure.com
DOCSPLITTER_AI__API_VERSION=2025-01-01-preview
DOCSPLITTER_AI__MODEL=gpt-4o
The API_VERSION variable is what switches DocSplitter into Azure mode. Leave it unset for standard OpenAI-compatible endpoints.
Local LLM (LM Studio / Ollama)
DOCSPLITTER_AI__API_KEY=no-key
DOCSPLITTER_AI__BASE_URL=http://host.docker.internal:1234/v1
DOCSPLITTER_AI__MODEL=qwen/qwen3-vl-8b
Use host.docker.internal to reach a model running on the host machine from inside the Docker container. The model must support vision (image inputs). Tested models include Qwen3-VL-8B — smaller models may struggle with fine text. Enable text extraction (on by default) to help smaller models.
Image detail
The DOCSPLITTER_AI__IMAGE_DETAIL setting controls how much resolution the AI model sees:
| Value | Behaviour | Notes |
|---|---|---|
auto | Model decides based on image size | Best accuracy with GPT-4o; recommended |
low | Fixed 512×512 preview | Faster and cheaper but may miss fine text such as invoice numbers |
high | Always full resolution tiling | Maximum accuracy; higher cost and latency |
Scenarios
You have clean scanned batches and want everything processed without human intervention.
- Set Confidence Threshold to 75–85%
- Add Type Hints for the document types you expect
- Leave Split Trigger Types empty
- Monitor the Jobs tab for any review or failed jobs
If the review queue stays empty, consider raising the threshold to catch edge cases.
You need a human to verify every split before output files are written.
- Set Confidence Threshold to 100% (slider all the way right)
- Every job will enter the review queue regardless of AI confidence
- Reviewers open the Review Queue tab, inspect each job, adjust splits as needed, and approve
For a hybrid approach set the threshold to 90% — high-confidence jobs auto-split while ambiguous ones go to review.
Each batch contains invoices followed by their supporting documents (delivery notes, proof of delivery, etc.). You want one output file per invoice that includes its attachments.
- Add type hints:
invoice,delivery_note,proof_of_delivery,remittance_advice,credit_note - Add split trigger type:
invoice - DocSplitter will only start a new output file when it detects an invoice. Supporting documents are appended to the preceding invoice file.
- Set confidence threshold to 75–85%
A scanner or another system drops files into a network folder. DocSplitter should pick them up automatically.
- Create a Folder Watcher channel in the Channels tab
- Set the Watch Path to the container path (e.g.
/app/watch/invoices) - In
docker-compose.yml, mount the host folder to the same container path:volumes: - /mnt/scanner/invoices:/app/watch/invoices - Restart the container. Drop a PDF into the host folder and it will be processed automatically.
You want to run fully on-premises without sending data to a cloud API.
- Install LM Studio or Ollama on the host machine and load a vision-capable model (e.g. Qwen3-VL-8B)
- Start the model server and note its port (LM Studio default: 1234)
- In
.env:DOCSPLITTER_AI__BASE_URL=http://host.docker.internal:1234/v1 DOCSPLITTER_AI__API_KEY=no-key DOCSPLITTER_AI__MODEL=qwen/qwen3-vl-8b DOCSPLITTER_AI__IMAGE_DETAIL=auto - Rebuild and restart:
docker compose up --build -d
Tips for local models: Text extraction (on by default) supplies extracted page text to the prompt, which significantly helps smaller models that struggle to read fine print from images. Use image_detail: auto. If accuracy is still poor, try a larger model (14B+) or switch to a cloud API for complex batches.
Job statuses
| Status | Meaning |
|---|---|
| pending | Job created, not yet started. |
| processing | File is being rendered, analysed, or split. |
| auto split | Confidence threshold met; output files written automatically. |
| review | Below threshold; waiting for human review in the Review Queue. |
| approved | Reviewer approved; output files written. |
| rejected | Reviewer rejected; no output written. |
| failed | Unrecoverable processing error. Expand the row in the Jobs tab to see the error message. |
Output files
Split PDFs are written to /app/output/<output_subdir>/ inside the container, which maps to ./output/<output_subdir>/ on the host.
Filename template
The default template is {date}_{doc_type}_{doc_index:03d}.pdf, producing filenames like 2026-04-01_invoice_001.pdf.
| Variable | Value |
|---|---|
{channel} | Channel name |
{date} | Processing date in YYYY-MM-DD format |
{doc_type} | Document type as classified by the AI (e.g. invoice) |
{doc_index} | 1-based index of the document within the batch. Supports Python format spec: {doc_index:03d} → 001 |
{job_id} | Full job UUID |
Override the template in your .env or config/local.yaml:
DOCSPLITTER_OUTPUT__FILENAME_TEMPLATE={channel}_{date}_{doc_type}_{doc_index:03d}.pdf
Metadata sidecar
When write_metadata_json is enabled (default: on), a .json file is written alongside each split PDF. It records the source filename, page range, confidence score, document type, channel, job ID, and model used. Useful for auditing and downstream processing.
API endpoints
Interactive API documentation (Swagger UI) is available at /docs. A summary of the most useful endpoints:
| Method | Path | Description |
|---|---|---|
POST | /api/v1/ingest/upload | Upload a PDF. Query param: ?channel=name. Returns job_id. |
GET | /api/v1/jobs/{job_id} | Poll job status and get output paths. |
GET | /api/v1/jobs/{job_id}/outputs/{index} | Download a split output PDF by index. |
GET | /api/v1/jobs/{job_id}/download-zip | Download all outputs for a job as a ZIP. |
GET | /api/v1/jobs/{job_id}/review | Get the review item for a job. |
GET | /api/v1/review | List review items. Filter by ?status=pending. |
PUT | /api/v1/review/{review_id}/boundaries | Update split boundaries programmatically. |
POST | /api/v1/review/{review_id}/approve | Approve and write output files. |
POST | /api/v1/review/{review_id}/reject | Reject — no output written. |
GET | /api/v1/channels | List all channels. |
POST | /api/v1/channels | Create a channel. |
PUT | /api/v1/channels/{name} | Update a channel. |
DELETE | /api/v1/channels/{name} | Delete a channel. |
GET | /api/v1/health | Health check including AI backend reachability. |
DocSplitter · Back to dashboard · GitHub