Over the past few years, we’ve helped teams convert PDFs and other documents into structured Markdown at scale. Today we’re excited to introduce the Mathpix Files API, built specifically for high-throughput, asynchronous document processing at scale, with lower cost per page.
What’s new
The Files API lets you submit and process documents — one at a time or in large batches — from a variety of sources:
POST /files/v1— upload a single file directlyPOST /files/v1/uri— process a single document from a source URI, routed automatically by scheme:s3://bucket/key(Amazon S3)gs://bucket/object(Google Cloud Storage)https://…(a public or presigned URL)https://{account}.blob.core.windows.net/{container}/{blob}(Azure Blob Storage)
POST /files/v1/jobs— submit many documents in one request. Pass a list of sources (any mix of the URI types above), each with an optionalcustom_idyou choose and an optional per-filedestination_urito write that file’s results to, plus job-level conversion formats applied to the whole batch. Returns ajob_id.
Retrieve and track your results:
GET /files/v1/{file_id}— poll a file’s processing statusGET /files/v1/{file_id}.{ext}— download a converted result (docx,md,tex.zip,mmd, …) when you aren’t writing to your own bucketGET /files/v1/jobs/{job_id}— poll a job’s status and countersGET /files/v1/jobs/{job_id}/files?status=error— list a job’s files, filterable by status (e.g. to find which ones failed and retry them bycustom_id)
All endpoints accept the same OCR options as the
v3/pdf endpoint.Examples
POST /files/v1/uri
// Request
{ "source_uri": "s3://my-bucket/a.pdf", "conversion_formats": { "docx": true } }
// Response
{ "file_id": "b1c9c3a8-55e4-4a09-b7d0-218ba5de4c4d" }
…or submit many at once:
POST /files/v1/jobs
// Request
{
"job_id": "my-batch-2026-06",
"files": [
{ "source_uri": "s3://my-bucket/a.pdf", "custom_id": "doc-1" },
{ "source_uri": "https://example.com/b.pdf", "custom_id": "doc-2" }
],
"conversion_formats": { "docx": true }
}
// Response
{ "job_id": "my-batch-2026-06", "file_count": 2 }
custom_id (optional, per file) is your own identifier — echoed back in results so you don’t have to track our file_ids. It requires a job_id that you supply, and together (job_id, custom_id) makes submission idempotent: retrying with the same pair returns the existing file_id instead of creating a duplicate, so network blips and client timeouts can’t double-bill you.Track a job with
GET /files/v1/jobs/{job_id}. Full reference and all options: View Documentation.Why use the Files API?
- Built for large-scale workloads — process thousands or millions of documents without building your own queueing system.
- Lower cost for high-volume processing — $1.50 / 1,000 pages, dropping to $1.00 / 1,000 pages above 30M pages/month. Optimized for large document pipelines and batch workloads.
- Higher throughput — designed to process large batches in parallel, improving total job completion time.
- Better workflow visibility — track job status and build reliable async pipelines.
- Cloud-native integrations — read from, and write results directly to, your own storage (S3, GCS, Azure Blob) via the Data Source API.
- Built for the data wall — convert document archives (papers, books, filings) into structured Markdown with aligned figure and equation crops, ready for VLM and LLM pretraining.
Connect your storage (no keys to share)
Bucket integrations are keyless — you grant a Mathpix identity bucket access via IAM/role; we use short-lived, per-customer-scoped credentials. One pattern, three providers:
- AWS — create an IAM role Mathpix can assume with an
ExternalIdwe issue. - Azure — grant our multi-tenant AD app Storage Blob Data Contributor on your container.
- GCS — grant our service account
roles/iam.serviceAccountTokenCreatoron your service account (we use it only to mint short-lived per-job tokens).
Two API calls to onboard:
GET /files/v1/onboarding/identities # returns the Mathpix identities + your ExternalId
POST /files/v1/data-sources # register your bucket after you've set up the grant
After that, reference your bucket as
source_uri for reads or as destination_uri to have results (and cropped images) written directly back. Results land as relative-linked files in your bucket — portable, self-hosted, and out of our retention.When should you use it?
The Files API is ideal for:
- Processing existing document archives
- One-time conversion of large PDF collections
- Continuous, high-volume document pipelines
What it doesn’t replace
The Files API is not a replacement for low-latency APIs like
v3/pdf:- Use
v3/pdffor real-time, per-document processing - Use the Files API for throughput and scale
A single document may take longer to process — but large workloads complete significantly faster overall.
Getting started
You can start using the Files API right away with your existing API key for direct uploads and presigned/public URLs. To read from or write results to your own S3 / GCS / Azure bucket, see “Connect your storage” above and the per-provider grant steps in the docs.
Data lifecycle
Source documents and page images are retained for up to 30 days; default text outputs (
mmd, lines.json, lines.mmd.json) for up to 90 days — then auto-deleted.Delete sooner — useful for GDPR / data-residency compliance — with
DELETE /files/v1/{file_id}.If you write results to your own bucket via
destination_uri, those copies live under your retention, not ours.Migrating from SCS Classic?
Your existing processing pipeline, output formats, and crop-delivery model carry over to the Files API. See the migration guide for the mapping from SCS publish-folder operations to
POST /files/v1/jobs + Data Sources.Full documentation and examples: View Documentation.