The Kensho Extract API allows users to transform PDF documents into structured JSON files. The JSON files have a few key characteristics:
- They contain core document data types including:
- Headers & Titles
- Tables & Table Titles
- Figures & Figure Titles
- Miscellaneous Text
- The JSON is structured in a hierarchical fashion going down to the second header level mimicking the intended hierarchy of a document.
The API behaves in the following fashion:
After authentication, the user is able to submit PDF documents as well as a priority code to the API. By default, the API will treat all documents as first in, first out with the exception that any document marked as low priority will be handled after high priority documents are completed regardless of when they are submitted.
The low priority queue is intended for all bulk document processing to avoid delaying the processing of any high-urgency documents which may need a fast turnaround.
The user may also optionally have OCR executed on their documents using the do_ocr field, but note that this can add an addtional 3-9 seconds of processing, per page. OCR will only be completed on areas with text images without textual information (such as screenshots and figures.)
After document submission, the API will return a unique request_id key which can be used for a subsequent query to retrieve the document output at a later time.
You can begin using Kensho Extract in seconds via our REST API.
To sign up, please email email@example.com to set up your API profile.