DocSplitter Help

DocSplitter analyses scanned multi-document PDF batches and splits them into individual files using a vision AI model. This guide covers every feature, setting, and workflow.

How it works

Each file passes through a five-stage pipeline:

1

Ingest — A file arrives either by being dropped into a watched folder or uploaded via the API / Upload tab.

2

Render — Every page is rendered to an image at the configured DPI (150 by default). Higher DPI means better accuracy but slower processing and larger API payloads.

3

Extract text — Text is extracted from the PDF's text layer. If a page has fewer than 80 characters of usable text, Tesseract OCR is used as a fallback. The extracted text is included in the AI prompt so local models do not need to rely on vision alone to read fine print.

4

Analyse — The AI model examines each page in a sliding window (previous page + current page + next page) and decides: does this page start a new document? It returns a document type label and a confidence score (0–1) for each decision.

5

Split or queue — If the lowest confidence across all boundary decisions meets the channel's threshold, the PDF is split automatically (auto split). If not, the job enters the review queue (review) where a human can inspect and adjust the split points before approving.

Key concepts

Channels

A channel is a named processing pipeline with its own settings — confidence threshold, document type hints, output folder, and split rules. You can have multiple channels for different document types or departments. There are two channel types:

  • API channel — accepts file uploads via the Upload tab or HTTP POST. Best for on-demand or ad-hoc processing.
  • Folder watcher — monitors a directory and processes files automatically when they stop changing. Best for automated workflows where files are dropped by another system.

Confidence score

Every boundary decision made by the AI comes with a confidence score from 0.0 (completely uncertain) to 1.0 (certain). The channel's confidence threshold determines whether the job is auto-split or sent to review. A threshold of 0.85 means every boundary must be at least 85% confident for auto-split to proceed.

Review queue

When a job falls below the confidence threshold it enters the review queue. A reviewer uses the visual split editor to inspect the detected boundaries, adjust them if needed, and then approve or reject the job. Output files are only written after approval.

Dashboard tab

The dashboard gives you an at-a-glance view of system activity. It refreshes automatically every 10 seconds.

Summary cards
CardWhat it shows
Pending ReviewItems waiting for human review across all channels. Also shown as a badge on the Review Queue nav item.
Jobs TodayCount of jobs created since midnight (local browser time).
Active ChannelsNumber of enabled channels.
AI ModelThe model currently configured in the AI backend.

The Recent Jobs table shows the last 10 jobs across all channels. Click any row in the Jobs tab for full detail. The AI status indicator in the bottom-left of the sidebar shows whether the configured AI backend is reachable (green) or not (red).

Upload tab

The Upload tab shows all enabled API channels as cards. You can drag a PDF (or TIFF) onto a channel card, or click the drop zone to browse for a file.

Upload workflow

1

Drag a file onto a channel card (the border turns blue) or click Drop PDF here or click to browse.

2

The file uploads and a progress entry appears on the card. Status updates automatically every 2–3 seconds.

3

If auto-split: Download links appear for each split PDF. If there are multiple files a Download all as ZIP button also appears.

4

If review required: A review required badge appears with a Review → button. Clicking it opens the full-screen split editor (see Review Queue). After you approve, the card immediately updates to show download links.

Supported formats: PDF (.pdf), TIFF (.tif, .tiff). Only API-type channels are shown on this tab — folder watcher channels do not accept manual uploads.

Managing upload entries

Each entry on a card has a × button to dismiss it. Dismissing does not cancel or delete the job — it only removes the entry from the card. The job continues processing and can be found in the Jobs tab.

Channels tab

Create, edit, and delete channels here. Click + New Channel to open the channel editor.

Restart required for watcher channels: Changes to a folder watcher channel only take effect after the container is restarted. The Status column will show Restart Required until then. API channels apply immediately.

Channel fields

Name
Unique identifier for the channel. Lowercase letters, numbers, hyphens, and underscores only (e.g. invoices, hr-documents). Cannot be changed after creation.
Description
Optional free-text description shown in the channels table and on the Upload tab channel cards. Helps users identify which channel to use.
Type
API Upload — file is submitted via HTTP. Folder Watcher — file is dropped into a directory. Cannot be changed after creation.
Watch Path (watcher only)
Absolute path inside the container that will be monitored (e.g. /app/watch/invoices). Must correspond to a mounted host directory in your docker-compose.yml.
Output Subdirectory
Sub-folder under /app/output/ where split PDFs are written. Use different subdirectories for different channels to keep output organised. Defaults to default.
Confidence Threshold
Minimum confidence required across all boundary decisions for the job to be auto-split (0–100%). Below this value the job enters the review queue. See Confidence threshold for guidance on choosing a value.
Document Type Hints
Short labels that tell the AI what types of documents to expect (e.g. invoice, purchase_order, credit_note). The model is instructed to prefer these labels when classifying pages. Leave empty for fully dynamic classification. See Type hints.
Split Trigger Types
When set, a new document boundary is only created when the AI detects a page of one of these types. Pages detected as a new document but with a different type are appended to the previous document rather than starting a new one. Leave empty to split on every detected boundary. See Split trigger types.
Stable Seconds (watcher only)
How long a file must remain unchanged before it is considered fully written and safe to process. Increase this if large files are being picked up before they finish copying. Default: 2 seconds.
File Patterns (watcher only)
Glob patterns controlling which files are processed (e.g. *.pdf, *.PDF). Files not matching any pattern are ignored. Both *.pdf and *.PDF are included by default because Linux filesystems are case-sensitive.
Enabled
Disabled channels are not shown on the Upload tab and their folder watchers are not started. Use this to temporarily pause a channel without deleting it.

Confidence threshold

The threshold controls the balance between automation and accuracy:

ThresholdBehaviourBest for
0% (0.0)Every job goes to review regardless of confidenceTesting; full manual control
50–70%Only very low-confidence jobs go to reviewVaried or complex document batches
75–85%Good balance — most clean batches auto-splitGeneral use (recommended starting point)
90–95%Only high-confidence jobs auto-split; more go to reviewHigh-accuracy requirements
100% (1.0)Nothing auto-splits — every job goes to reviewFull manual review of all jobs
Start with 75–80% and monitor the review queue. If too many correct splits are being sent to review, raise the threshold. If too many incorrect auto-splits are happening, lower it or switch to using the review queue more.

The confidence score for a job is the lowest confidence across all boundary decisions in that job. A 10-page batch with 9 confident decisions and 1 ambiguous one will still go to review if the ambiguous decision falls below the threshold.

Document type hints

Type hints are short snake_case labels that tell the AI what document types to expect in this channel. When hints are provided the model is instructed to prefer them when classifying a page.

Examples

  • invoice, purchase_order, remittance_advice, credit_note — accounts payable batch
  • application_form, supporting_document, identification — HR or admissions batch
  • lab_report, referral_letter, consent_form — medical records batch

When to use type hints

  • When you know what document types will appear in the batch
  • When you want consistent, predictable type labels in output filenames
  • When you are using split trigger types — triggers must match the hint labels exactly

If you leave type hints empty the model classifies freely using whatever label it considers most appropriate. This is fine for general-purpose channels but produces less consistent output filenames.

Split trigger types

Normally DocSplitter creates a new output file for every boundary the AI detects. Split trigger types change this behaviour: a new output file is only started when the AI detects a page whose document type is in the trigger list. All other pages — even if the AI considers them a new document — are appended to the output file that started with the most recent trigger page.

The invoice + attachments scenario

A common batch pattern is: Invoice → Delivery note → Proof of delivery → Invoice → Purchase order → Invoice…

Without trigger types, DocSplitter would split every boundary, producing 6 separate files. With invoice as the only trigger type, it produces 3 files:

invoice → starts output file 1
delivery note → appended to file 1 (not a trigger)
proof of delivery → appended to file 1 (not a trigger)
invoice → starts output file 2
purchase order → appended to file 2 (not a trigger)
invoice → starts output file 3
Important: Trigger type values must exactly match the document type labels the AI will produce. If you use type hints, the trigger values should match your hint labels. If you use free classification, check the review UI or job output to see what labels the model is using.

Multiple trigger types

You can add multiple triggers. For example, invoice + credit_note would start a new file for either type, while still appending supporting documents to the preceding trigger document.

Review Queue tab

The review queue lists all jobs that fell below the channel confidence threshold and are waiting for a human decision. Each entry shows the filename, channel, number of detected documents, minimum confidence score, and submission time.

Click Review to open the full-screen split editor for that job.

Split editor

The split editor has four zones:

Page strip (top)
All pages of the document shown as thumbnails in order. Coloured borders group pages into detected documents — each document gets its own colour. The document type label is shown above the first page of each section. Click any thumbnail to view it full-size in the preview area below.
Split dividers
Between every pair of adjacent pages there is a small handle. A red handle with an × indicates an existing split boundary — click it to remove the split (merging the two documents into one). A dashed handle with a + indicates no split — click it to add a split (starting a new document at that page).
Types bar (below strip)
Shows one entry per detected document section with its page range and an editable type field. Edit the type label directly here to correct misclassified documents before approving.
Page preview (main area)
Full-size view of the selected page. Click any thumbnail in the strip to switch pages.

Approving

Click Approve & Split to write the output PDFs. The split editor's current boundary configuration (including any manual adjustments) is used, not the AI's original proposal. The job status changes to approved.

Rejecting

Click Reject to discard the job. No output files are written. The job status changes to rejected. This cannot be undone.

After approving from the Upload tab's Review → button, the upload card on the Upload tab automatically updates to show the download links for the newly created files.

Jobs tab

A complete history of every job processed by DocSplitter. Use the status filter dropdown to narrow the list. Click any row to expand it and see the full job ID, any error message, and the output file paths on disk.

Jobs are retained indefinitely in the database. The output files are written to the output directory and are not managed by DocSplitter after creation — you are responsible for archiving or deleting them.

System tab

Shows the currently active configuration values for the AI backend and output settings. API keys are masked. This view reflects the live configuration including any environment variable overrides — useful for verifying that your .env or local.yaml settings have been applied correctly.

The AI status indicator in the sidebar footer shows whether the AI backend is reachable. If it shows red, check the base URL and API key in your configuration and confirm the AI service is running.

AI providers

DocSplitter works with any OpenAI-compatible vision API. Configure the provider in your .env file or config/local.yaml.

OpenAI

DOCSPLITTER_AI__API_KEY=sk-your-key-here
DOCSPLITTER_AI__BASE_URL=https://api.openai.com/v1
DOCSPLITTER_AI__MODEL=gpt-4o

Azure OpenAI

DOCSPLITTER_AI__API_KEY=your-azure-api-key
DOCSPLITTER_AI__BASE_URL=https://your-resource.cognitiveservices.azure.com
DOCSPLITTER_AI__API_VERSION=2025-01-01-preview
DOCSPLITTER_AI__MODEL=gpt-4o

The API_VERSION variable is what switches DocSplitter into Azure mode. Leave it unset for standard OpenAI-compatible endpoints.

Local LLM (LM Studio / Ollama)

DOCSPLITTER_AI__API_KEY=no-key
DOCSPLITTER_AI__BASE_URL=http://host.docker.internal:1234/v1
DOCSPLITTER_AI__MODEL=qwen/qwen3-vl-8b

Use host.docker.internal to reach a model running on the host machine from inside the Docker container. The model must support vision (image inputs). Tested models include Qwen3-VL-8B — smaller models may struggle with fine text. Enable text extraction (on by default) to help smaller models.

Image detail

The DOCSPLITTER_AI__IMAGE_DETAIL setting controls how much resolution the AI model sees:

ValueBehaviourNotes
autoModel decides based on image sizeBest accuracy with GPT-4o; recommended
lowFixed 512×512 previewFaster and cheaper but may miss fine text such as invoice numbers
highAlways full resolution tilingMaximum accuracy; higher cost and latency

Scenarios

Fully automated splitting auto-split

You have clean scanned batches and want everything processed without human intervention.

  • Set Confidence Threshold to 75–85%
  • Add Type Hints for the document types you expect
  • Leave Split Trigger Types empty
  • Monitor the Jobs tab for any review or failed jobs

If the review queue stays empty, consider raising the threshold to catch edge cases.

Human-in-the-loop review review queue

You need a human to verify every split before output files are written.

  • Set Confidence Threshold to 100% (slider all the way right)
  • Every job will enter the review queue regardless of AI confidence
  • Reviewers open the Review Queue tab, inspect each job, adjust splits as needed, and approve

For a hybrid approach set the threshold to 90% — high-confidence jobs auto-split while ambiguous ones go to review.

Invoices with supporting attachments split triggers

Each batch contains invoices followed by their supporting documents (delivery notes, proof of delivery, etc.). You want one output file per invoice that includes its attachments.

  • Add type hints: invoice, delivery_note, proof_of_delivery, remittance_advice, credit_note
  • Add split trigger type: invoice
  • DocSplitter will only start a new output file when it detects an invoice. Supporting documents are appended to the preceding invoice file.
  • Set confidence threshold to 75–85%
Automated folder watcher watcher

A scanner or another system drops files into a network folder. DocSplitter should pick them up automatically.

  1. Create a Folder Watcher channel in the Channels tab
  2. Set the Watch Path to the container path (e.g. /app/watch/invoices)
  3. In docker-compose.yml, mount the host folder to the same container path:
    volumes:
      - /mnt/scanner/invoices:/app/watch/invoices
  4. Restart the container. Drop a PDF into the host folder and it will be processed automatically.
Changes to watcher channels require a container restart to take effect. The Status column shows Restart Required as a reminder.
Running with a local AI model local LLM

You want to run fully on-premises without sending data to a cloud API.

  1. Install LM Studio or Ollama on the host machine and load a vision-capable model (e.g. Qwen3-VL-8B)
  2. Start the model server and note its port (LM Studio default: 1234)
  3. In .env:
    DOCSPLITTER_AI__BASE_URL=http://host.docker.internal:1234/v1
    DOCSPLITTER_AI__API_KEY=no-key
    DOCSPLITTER_AI__MODEL=qwen/qwen3-vl-8b
    DOCSPLITTER_AI__IMAGE_DETAIL=auto
  4. Rebuild and restart: docker compose up --build -d

Tips for local models: Text extraction (on by default) supplies extracted page text to the prompt, which significantly helps smaller models that struggle to read fine print from images. Use image_detail: auto. If accuracy is still poor, try a larger model (14B+) or switch to a cloud API for complex batches.

Job statuses

StatusMeaning
pendingJob created, not yet started.
processingFile is being rendered, analysed, or split.
auto splitConfidence threshold met; output files written automatically.
reviewBelow threshold; waiting for human review in the Review Queue.
approvedReviewer approved; output files written.
rejectedReviewer rejected; no output written.
failedUnrecoverable processing error. Expand the row in the Jobs tab to see the error message.

Output files

Split PDFs are written to /app/output/<output_subdir>/ inside the container, which maps to ./output/<output_subdir>/ on the host.

Filename template

The default template is {date}_{doc_type}_{doc_index:03d}.pdf, producing filenames like 2026-04-01_invoice_001.pdf.

VariableValue
{channel}Channel name
{date}Processing date in YYYY-MM-DD format
{doc_type}Document type as classified by the AI (e.g. invoice)
{doc_index}1-based index of the document within the batch. Supports Python format spec: {doc_index:03d}001
{job_id}Full job UUID

Override the template in your .env or config/local.yaml:

DOCSPLITTER_OUTPUT__FILENAME_TEMPLATE={channel}_{date}_{doc_type}_{doc_index:03d}.pdf

Metadata sidecar

When write_metadata_json is enabled (default: on), a .json file is written alongside each split PDF. It records the source filename, page range, confidence score, document type, channel, job ID, and model used. Useful for auditing and downstream processing.

API endpoints

Interactive API documentation (Swagger UI) is available at /docs. A summary of the most useful endpoints:

MethodPathDescription
POST/api/v1/ingest/uploadUpload a PDF. Query param: ?channel=name. Returns job_id.
GET/api/v1/jobs/{job_id}Poll job status and get output paths.
GET/api/v1/jobs/{job_id}/outputs/{index}Download a split output PDF by index.
GET/api/v1/jobs/{job_id}/download-zipDownload all outputs for a job as a ZIP.
GET/api/v1/jobs/{job_id}/reviewGet the review item for a job.
GET/api/v1/reviewList review items. Filter by ?status=pending.
PUT/api/v1/review/{review_id}/boundariesUpdate split boundaries programmatically.
POST/api/v1/review/{review_id}/approveApprove and write output files.
POST/api/v1/review/{review_id}/rejectReject — no output written.
GET/api/v1/channelsList all channels.
POST/api/v1/channelsCreate a channel.
PUT/api/v1/channels/{name}Update a channel.
DELETE/api/v1/channels/{name}Delete a channel.
GET/api/v1/healthHealth check including AI backend reachability.

DocSplitter  ·  Back to dashboard  ·  GitHub

Fybre