Ingest files with OCR

Use case

This guide shows how to store binary files as first-class objects, transcribe their contents with the OCR pipeline step, and expose the extracted text for search or downstream processing.

Prerequisites

A Clinia workspace and API key with access to object collections.
A registry source dedicated to document ingestion (created in Step 1 if you do not already have one).
A local PDF file to upload.

Step 1. Create a registry source for documents

Provision a registry data source that will own the object collection. See the Create a data source API Reference.

curl -X PUT "https://$CLINIA_WORKSPACE/catalog/v1/sources/document-ocr" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
  "type": "registry"
}'

Step 2. Define an object collection schema

Create an object collection that captures high-level metadata about the uploaded file. The OCR pipeline will enrich content with the extracted transcript. For the payload structure, refer to Create a collection in the MDM.

curl -X PUT "https://$CLINIA_WORKSPACE/sources/document-ocr/v1/collections/scanned-documents" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
  "type": "objects",
  "objectDefinition": {
    "properties": {
      "content": {
        "type": "markdown",
        "description": "Text content extracted from the document."
      },
      "tags": {
        "type": "array",
        "items": {
          "type": "symbol"
        }
      }
    }
  }
}'

Existing objects are not backfilled when you add or modify a pipeline. Re-upload files if you need OCR output on historical assets.

Step 3. Attach the OCR pipeline

Upsert a collection pipeline that adds the OCR processor. The propertyKey will be the target of this augmentation (content in this example). See Processor details.

OCR steps can only target a root/top-level property.

curl -X PUT "https://$CLINIA_WORKSPACE/sources/document-ocr/v1/collections/scanned-documents/pipeline" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
  "steps": [
    {
      "type": "OPTICAL_CHARACTER_RECOGNITION",
      "opticalCharacterRecognition": {
        "propertyKey": "content",
        "propertyDescription": "Machine generated transcript"
      }
    }
  ]
}'

The OCR Processor mutates the document content property in place instead of creating a new enriched property. It is a mutating processor

Step 4. Upload a file object

Send a multipart request with the JSON object payload in the data part and the binary file in the file part. Setting the id to @rootId lets the registry assign an identifier automatically. Consult Create an object.

curl -X POST "https://$CLINIA_WORKSPACE/sources/document-ocr/v1/objects/scanned-documents" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  -F 'data={"id":"@rootId","data":{"tags":["intake","2024"]}}' \
  -F "file=@~/documents/intake-form.pdf;type=application/pdf"

A successful submission returns an accepted task:

{
  "status": "ACCEPTED",
  "taskId": "bk_4w2Fn0ZyxZVuiG1nS8Qi7v9UBc"
}

Step 5. Monitor the task and read the transcript

Track ingestion progress with the Get a task from the registry endpoint. When status reaches SUCCESS, the receipts include the newly created object (note the targetType: "OBJECT").

curl -X GET "https://$CLINIA_WORKSPACE/sources/document-ocr/v1/tasks/bk_4w2Fn0ZyxZVuiG1nS8Qi7v9UBc?withReceipts=true" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN"

{
  "status": "SUCCESS",
  "taskId": "bk_4w2Fn0ZyxZVuiG1nS8Qi7v9UBc",
  "receipts": [
    {
      "status": "SUCCESS",
      "target": {
        "targetType": "OBJECT",
        "type": "scanned-documents",
        "id": "scanned-documents/0f6c31ec"
      }
    }
  ]
}

Fetch the object once the task completes to verify the OCR output. The transcript is stored under data.content. See Get an object from the registry.

curl -X GET "https://$CLINIA_WORKSPACE/sources/document-ocr/v1/objects/scanned-documents/scanned-documents%2F0f6c31ec" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN"

{
  "id": "scanned-documents/0f6c31ec",
  "type": "scanned-documents",
  "data": {
    "content": "Patient Name: Jane Doe\nDate of Birth: 1984-03-17\nPrimary Care Physician: ...",
    "tags": ["intake", "2024"]
  },
  "meta": {
    "source": "document-ocr",
    "createdAt": "2024-05-10T18:42:11.120Z",
    "updatedAt": "2024-05-10T18:42:12.071Z"
  }
}

To download the original binary, call Get an object file from the registry:

curl -X GET "https://$CLINIA_WORKSPACE/sources/document-ocr/v1/objects/scanned-documents/scanned-documents%2F0f6c31ec/file/content" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  --output intake-form.pdf

Troubleshooting

multipart/form-data is expected — ensure you use -F fields and do not set -H "Content-Type: application/json" when creating objects.
Pipeline execution never finishes — confirm the collection pipeline exists and is behaving correctly with Get the pipeline execution of a collection.
File exceeds upload limit — object uploads honor the ObjectCollections.maxUploadSize workspace setting. Split or compress files above the limit before retrying. Contact Clinia Support to increase the limit if needed.

Search the OCR transcript

With the transcript stored in content, you can expose the object collection through a data partition and run searches against it. For partition setup, see Create a data partition. Use the Search resources API to query the transcripts directly.

Link the collection to a partition

curl -X PUT "https://$CLINIA_WORKSPACE/catalog/v1/partitions/document-ocr-search" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
  "modules": {
    "search": "STANDARD"
  },
  "source": {
    "type": "DATA_SOURCE",
    "key": "document-ocr",
    "collections": [
      {
        "key": "scanned-documents"
      }
    ]
  }
}'

Search transcripts for keywords

curl -X POST "https://$CLINIA_WORKSPACE/partitions/document-ocr-search/v1/objects/scanned-documents/query" \
  -H "X-Clinia-API-Key: $CLINIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
  "query": {
    "match": {
      "objects.scanned-documents.content": {
        "value": "Jane Doe"
      }
    }
  },
  "properties": {
    "include": [
      "objects.scanned-documents.content"
    ]
  }
}'

The response includes matching object records with the enriched content field so you can display or post-process the OCR transcript.

About Clinia

Core Concepts

Search

Configuring Data Sources

Configuring Partitions

Managing Data

Master Data Management

Terminology

Identity and Access Management

Agents

Ingest files with OCR

Use case

Prerequisites

Step 1. Create a registry source for documents

Step 2. Define an object collection schema

Step 3. Attach the OCR pipeline

Step 4. Upload a file object

Step 5. Monitor the task and read the transcript

Troubleshooting

Search the OCR transcript

About Clinia

Core Concepts

Search

Configuring Data Sources

Configuring Partitions

Managing Data

Master Data Management

Terminology

Identity and Access Management

Agents

​Use case

​Prerequisites

​Step 1. Create a registry source for documents

​Step 2. Define an object collection schema

​Step 3. Attach the OCR pipeline

​Step 4. Upload a file object

​Step 5. Monitor the task and read the transcript

​Troubleshooting

​Search the OCR transcript

Use case

Prerequisites

Step 1. Create a registry source for documents

Step 2. Define an object collection schema

Step 3. Attach the OCR pipeline

Step 4. Upload a file object

Step 5. Monitor the task and read the transcript

Troubleshooting

Search the OCR transcript