Getting Started
Ragextract is AI document search for messy PDFs, Docx and Pptx.
Sign up and Register your Organisation
You'll need to sign up an account with your organisation (if not invited to one) to gain access to Subworkflow. An organisation respresents the company you work for and is where you'll manage your workspaces, datasets and team as well as access controls and billing.
Get Started for FreeChoose a Plan
Starter Plan
This plan is designed for companies with low usage requirements to enjoy the benefits of Ragextract at a very low cost. Best suited for internal document operations with low frequency and/or smaller documents. This plan does not come with guaranteed support.
Standard Plan
This plan is our recommended plan for business and production use. Allows documents up to 300mb or 500 pages and job concurrency is increased from 3 to 20 enabling much higher throughput of documents processed.
Note: We use Stripe™ to handle our subscriptions. We require a card to sign up to the trial, this is to prevent abuse and help focus our support priorities.
Generate a Workspace API Key
Workspaces help organise and scope access to documents (or rather Datasets) for your team, clients or projects. API Keys are also workspace-scoped meaning they are only valid for the workspace they are generated for. In the Workspace > Settings > Keys page, click on the "Create a new API key" button to create your key. Once the new API key is created, keep it secret and copy only when you're ready to use!
- All files uploaded goes to the workspace the key belongs to.
- All queries and searches will be scoped to the same workspace the key belongs to.
- Keep a note of which workspace a key belongs to!
You're now ready to upload your first Document!
The intended way to use Subworkflow is through the REST API and how you'll start uploading your documents. We recommend using our SDKs if you can as they help to simplify Subworkflow usage in your application.
- Curl
- JS/TS
curl -X POST https://api.ragextract.com/v1/extract
--header 'x-api-key: <YOUR-API-KEY>'
--form "file=@/path/to/file"
If you plan to use the /search functionality, use the following instead:
curl -X POST https://api.ragextract.com/v1/vectorize
--header 'x-api-key: <YOUR-API-KEY>'
--form "file=@/path/to/file"
import { Ragextract } from "@subworkflow/ragextract";
const ragextract = new Ragextract({ apiKey: "<MY-API-KEY>" });
const fileInput = fs.readFileSync("annual_report.pdf");
const dataset = await ragextract.extract(fileInput, { filename: "annual_report.pdf" });
If you plan to use the /search functionality, use the following instead:
const dataset = await ragextract.vectorize(fileInput, { filename: "annual_report.pdf" });
Retrieve Your Dataset and Its Items
Once a document is uploaded to Ragextract, it becomes a Dataset and its pages are refered to as Dataset Items. We'll be using this terminology a lot through the documentation.
Retrieving the Dataset (Document)
Requesting the dataset can provide you a link to a pdf-version of the original document and tell you how many pages were contained within (itemCount). Typically, you'll fetch the dataset record only for the metadata needed to query over its Dataset Items.
- Curl
- JS/TS
curl https://api.ragextract.com/v1/datasets/<datasetId>
--header 'x-api-key: <YOUR-API-KEY>'
{
"sucess": true,
"total": 1,
"data": {
"id": "ds_VV08ECeQBQgDoVn6",
"workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
"type": "doc",
"itemCount": 1,
"fileName": "file_AIpNsoTx4OkRNY3H",
"fileExt": "pdf",
"mimeType": "application/pdf",
"fileSize": 136056,
"createdAt": 1761910646651,
"updatedAt": 1761910646651,
"expiresAt": 1761910646651,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
"token": "VkVBNh",
"expiresAt": 1761910891643
}
}
}
import { Ragextract } from "@subworkflow/ragextract";
const ragextract = new Ragextract({ apiKey: "<MY-API-KEY>" });
const dataset = await ragextract.datasets.get('<datasetId>');
{
"id": "ds_VV08ECeQBQgDoVn6",
"workspaceId": "wks_Gg9Bzi7sx8fbCfWI",
"type": "doc",
"itemCount": 1,
"fileName": "file_AIpNsoTx4OkRNY3H",
"fileExt": "pdf",
"mimeType": "application/pdf",
"fileSize": 136056,
"createdAt": 1761910646651,
"updatedAt": 1761910646651,
"expiresAt": 1761910646651,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_DdTXOgxPh0PLSPhb?token=VkVBNh",
"token": "VkVBNh",
"expiresAt": 1761910891643
}
}
Retrieving the Dataset Items (Pages)
In this example, we retrieve the equivalent to the 1st, 3rd and 5th pages from our document in jpeg format. Notice that this is particular powerful when handling large documents (1000+ pages) - you don't necessarily need to receive full dataset as you do with other services, pick out only the pages you want! Other cols patterns include the range modifier where cols=50:100 will return pages from 50 to 100. For full details on querying options, please refer to the API reference documentation.
- Curl
- JS/TS
curl https://api.ragextract.com/v1/datasets/<datasetId>/items?row=jpg&cols=1,3,5
--header 'x-api-key: <YOUR-API-KEY>'
{
"sort": ["-createdAt"],
"offset": 0,
"limit": 10,
"total": 3,
"data": [
{
"id": "dsx_B5bsOBDzsXsqfmLo",
"col": 1,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_1muCWQXZ58r5PsjC",
"col": 3,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_1muCWQXZ58r5PsjC?token=Qqkk7U",
"token": "Qqkk7U",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_0yIaKxZjiZIXc1G3",
"col": 5,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_0yIaKxZjiZIXc1G3?token=7GQKco",
"token": "7GQKco",
"expiresAt": 1761911418809
}
}
]
}
import { Ragextract } from "@subworkflow/ragextract";
const ragextract = new Ragextract({ apiKey: "<MY-API-KEY>" });
const datasetItems = await ragextract.datasets.getItems('<dataset>', {
row: 'pdf',
cols: [1,3,5],
});
// console.log(datasetItems)
[
{
"id": "dsx_B5bsOBDzsXsqfmLo",
"col": 1,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_1muCWQXZ58r5PsjC",
"col": 3,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_1muCWQXZ58r5PsjC?token=Qqkk7U",
"token": "Qqkk7U",
"expiresAt": 1761911418809
}
},
{
"id": "dsx_0yIaKxZjiZIXc1G3",
"col": 5,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_0yIaKxZjiZIXc1G3?token=7GQKco",
"token": "7GQKco",
"expiresAt": 1761911418809
}
}
]
Use Retrieved Files as Context for LLMs
With each retrieved Dataset Item, Subworkflow will give you a "share url". This is a url to a page of your document that is protected by a expiring token. We find this a more secure approach as even if the URL is cached/leaked, the file will be inaccessible after the expiry.
"share": {
"url": "https://api.ragextract.com/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
Pass these urls into your LLM along with your prompt to generate answers or for extracting structured output.
- Curl
- JS/TS
curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1",
"input": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "what is in this image?"},
{
"type": "input_image",
"image_url": "https://api.ragextract.com/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH"
}
]
}
]
}'
import OpenAI from "openai";
const openai = new OpenAI();
const response = await openai.responses.create({
model: "gpt-4.1",
input: [
{
role: "user",
content: [
{ type: "input_text", text: "what is in this image?" },
{
type: "input_image",
image_url: "https://api.ragextract.com/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
}
]
}
]
});
console.log(response);
Next steps
Congratulations! 🎉 If you've managed to make it this far, you should have some idea of how Ragextract can help you build powerful and more durable RAG and/or Structured Output applications.
Head on over to the next section on ways to use the Subworkflow APIs in your application and workflows.