- High-volume processing: If you need to process more than tens of millions of PDF pages in a short period of time, SCS is designed for large-scale batch jobs and can handle this efficiently.
- Asynchronous workflows: When real-time results aren’t necessary, SCS processes documents in the background, making it ideal for big jobs.
- Advanced workflow needs: While our API is highly secure, SCS is tailored for workflows that require additional customization and direct integration with storage providers like AWS S3, GCP GCS, Alibaba OSS, and Baidu BOS, ensuring seamless and secure data handling at scale.
Secure Conversion Service
Accurately convert large PDF and image libraries into machine readable text files in hours, not months.
Cost efficiency
Batch API offers cost savings over interactive API due to optimized processing of multiple files, allowing us to provide lower rates to customers.
Fastest processing
Secure conversion service delivers rapid results without compromising quality with up to 100 million PDF pages per day.
SOC2 compliant
Ensures document protection with robust encryption and compliance with industry-standard security protocols.
Batch processing
Handles high-throughput document processing for large volumes, optimizing entire data bucket directories.
Frequently Asked Questions
Secure Conversion Services (SCS) is ideal for:
- Training and fine-tuning large language models (LLMs): Preparing massive datasets from PDFs or images for training or fine-tuning LLMs. SCS’s scalability and ability to generate structured outputs make it perfect for producing high-quality training data.
- Enterprise document processing: Converting large volumes of legal, financial, or technical documents into structured data for internal systems, analysis, or archival purposes.
- Large-scale academic archives: Universities and research institutions digitizing and processing massive collections of research papers, lecture notes, or archives into accessible formats.
- Publishing and content digitization: Publishers processing books, journals, or articles with complex layouts, including math, tables, and images, for online or print use.
- Custom workflows for sensitive data: Organizations with strict privacy and data security requirements that need direct integration with storage providers such as AWS S3, Microsoft Azure, GCP GCS, Alibaba OSS, or Baidu BOS. SCS enables secure management of input and output data within designated storage buckets.
- High-volume projects with flexible timelines: Handling tens of millions of documents asynchronously for projects where scalability and efficiency are key, but immediate, real-time results aren’t necessary.
SCS is particularly well-suited for industries leveraging LLMs and AI, as well as organizations requiring secure, efficient, and large-scale batch processing.
Using SCS involves these steps:
-
Set up access to your storage provider:
- Grant Mathpix access to your storage bucket (e.g., AWS S3, Microsoft Azure, GCP GCS, Alibaba OSS, Baidu BOS) via access tokens.
-
Upload input files to your bucket:
- Place your input files (PDFs or images) in a designated folder within your storage bucket.
-
Configure SCS processing:
- Mathpix processes the documents asynchronously, pulling input files from your bucket, running OCR or conversion, and writing the results back to an output folder in the same bucket.
-
Retrieve processed results:
- Processed outputs, such as structured data (e.g., Mathpix Markdown), are saved in your designated output folder for easy retrieval.
For more details, or to get started with SCS, contact support@mathpix.com.
SCS is designed for large-scale, high-speed processing. It can handle hundreds of millions of pages per day and scale to process several billion pages in just a few weeks.
This speed makes it ideal for organizations managing massive workloads, like converting large archives or running extensive data extraction projects. The exact processing time depends on document complexity and file size, but SCS is built to maximize efficiency and throughput.
If you’re working with tight timelines, feel free to reach out to discuss your specific requirements, and we can help optimize the process for your needs.
SCS can generate outputs in the following formats:
- Markdown
- Mathpix Markdown
- LaTeX
- DOCX
- HTML
- lines.json
You can select one or multiple formats based on your requirements.