Introducing the Mathpix Files API

Over the past few years, we’ve helped teams convert PDFs and other documents into structured Markdown at scale. Today we’re excited to introduce the Mathpix Files API, built specifically for high-throughput, asynchronous document processing at scale, with lower cost per page.

What’s new

The Files API lets you submit and process documents — one at a time or in large batches — from a variety of sources:

POST /files/v1 — upload a single file directly
POST /files/v1/uri — process a single document from a source URI, routed automatically by scheme:
- s3://bucket/key (Amazon S3)
- gs://bucket/object (Google Cloud Storage)
- https://… (a public or presigned URL)
- https://{account}.blob.core.windows.net/{container}/{blob} (Azure Blob Storage)
POST /files/v1/jobs — submit many documents in one request. Pass a list of sources (any mix of the URI types above), each with an optional custom_id you choose and an optional per-file destination_uri to write that file’s results to, plus job-level conversion formats applied to the whole batch. Returns a job_id.

Retrieve and track your results:

GET /files/v1/{file_id} — poll a file’s processing status
GET /files/v1/{file_id}.{ext} — download a converted result (docx, md, tex.zip, mmd, …) when you aren’t writing to your own bucket
GET /files/v1/jobs/{job_id} — poll a job’s status and counters
GET /files/v1/jobs/{job_id}/files?status=error — list a job’s files, filterable by status (e.g. to find which ones failed and retry them by custom_id)

All endpoints accept the same OCR options as the v3/pdf endpoint.

Examples

POST /files/v1/uri

// Request
{ "source_uri": "s3://my-bucket/a.pdf", "conversion_formats": { "docx": true } }

// Response
{ "file_id": "b1c9c3a8-55e4-4a09-b7d0-218ba5de4c4d" }

…or submit many at once:

POST /files/v1/jobs

// Request
{
  "job_id": "my-batch-2026-06",
  "files": [
    { "source_uri": "s3://my-bucket/a.pdf", "custom_id": "doc-1" },
    { "source_uri": "https://example.com/b.pdf", "custom_id": "doc-2" }
  ],
  "conversion_formats": { "docx": true }
}

// Response
{ "job_id": "my-batch-2026-06", "file_count": 2 }

custom_id (optional, per file) is your own identifier — echoed back in results so you don’t have to track our file_ids. It requires a job_id that you supply, and together (job_id, custom_id) makes submission idempotent: retrying with the same pair returns the existing file_id instead of creating a duplicate, so network blips and client timeouts can’t double-bill you.

Track a job with GET /files/v1/jobs/{job_id}. Full reference and all options: View Documentation.

Why use the Files API?

Built for large-scale workloads — process thousands or millions of documents without building your own queueing system.
Lower cost for high-volume processing — $1.50 / 1,000 pages, dropping to $1.00 / 1,000 pages above 30M pages/month. Optimized for large document pipelines and batch workloads.
Higher throughput — designed to process large batches in parallel, improving total job completion time.
Better workflow visibility — track job status and build reliable async pipelines.
Cloud-native integrations — read from, and write results directly to, your own storage (S3, GCS, Azure Blob) via the Data Source API.
Built for the data wall — convert document archives (papers, books, filings) into structured Markdown with aligned figure and equation crops, ready for VLM and LLM pretraining.

Bucket integrations are keyless — you grant a Mathpix identity bucket access via IAM/role; we use short-lived, per-customer-scoped credentials. One pattern, three providers:

AWS — create an IAM role Mathpix can assume with an ExternalId we issue.
Azure — grant our multi-tenant AD app Storage Blob Data Contributor on your container.
GCS — grant our service account roles/iam.serviceAccountTokenCreator on your service account (we use it only to mint short-lived per-job tokens).

Two API calls to onboard:

GET  /files/v1/onboarding/identities    # returns the Mathpix identities + your ExternalId
POST /files/v1/data-sources             # register your bucket after you've set up the grant

After that, reference your bucket as source_uri for reads or as destination_uri to have results (and cropped images) written directly back. Results land as relative-linked files in your bucket — portable, self-hosted, and out of our retention.

When should you use it?

The Files API is ideal for:

Processing existing document archives
One-time conversion of large PDF collections
Continuous, high-volume document pipelines

What it doesn’t replace

The Files API is not a replacement for low-latency APIs like v3/pdf:

Use v3/pdf for real-time, per-document processing
Use the Files API for throughput and scale

A single document may take longer to process — but large workloads complete significantly faster overall.

Getting started

You can start using the Files API right away with your existing API key for direct uploads and presigned/public URLs. To read from or write results to your own S3 / GCS / Azure bucket, see “Connect your storage” above and the per-provider grant steps in the docs.

Data lifecycle

Source documents and page images are retained for up to 30 days; default text outputs (mmd, lines.json, lines.mmd.json) for up to 90 days — then auto-deleted.

Delete sooner — useful for GDPR / data-residency compliance — with DELETE /files/v1/{file_id}.

If you write results to your own bucket via destination_uri, those copies live under your retention, not ours.

Migrating from SCS Classic?

Your existing processing pipeline, output formats, and crop-delivery model carry over to the Files API. See the migration guide for the mapping from SCS publish-folder operations to POST /files/v1/jobs + Data Sources.

Full documentation and examples: View Documentation.