Skip to main content
Built-in processors let you enrich or validate data without maintaining custom code. Combine them to prepare content for semantic search, normalize addresses, or orchestrate human review.

Vectorizer

Transforms text into numerical embeddings so you can run hybrid search. Configuration:
  • inputProperty — Path to the text field or the output of a previous processor.
  • modelId — Embedding model to use (for example mte-base.1 or mte-base-knowledge.1).
  • propertyKey — Destination sub-property that stores the resulting vector.
{
  "steps": [
    {
      "type": "VECTORIZER",
      "vectorizer": {
        "inputProperty": "title",
        "modelId": "mte-base.1",
        "propertyKey": "vector"
      }
    }
  ]
}
Use a Schema Validator ahead of the vectorizer when you want to avoid expensive work on malformed data.

Segmenter

Splits long text into passages before vectorization so semantic queries stay targeted.
{
  "steps": [
    {
      "type": "SEGMENTER",
      "segmenter": {
        "inputProperty": "abstract",
        "modelId": "clinia-chunk.1",
        "propertyKey": "passages"
      }
    },
    {
      "type": "VECTORIZER",
      "vectorizer": {
        "inputProperty": "abstract.passages",
        "modelId": "mte-base-knowledge.1",
        "propertyKey": "vector"
      }
    }
  ]
}
Segmenting reduces noise for RAG workloads and lets you highlight the precise passage that matched the user’s query.

Optical Character Recognition

Currently available on object collections. The processor reads the uploaded file attached to the object.
Extracts text from PDFs or images and stores it in a markdown field alongside the original binary.
{
  "steps": [
    {
      "type": "OPTICAL_CHARACTER_RECOGNITION",
      "opticalCharacterRecognition": {
        "propertyKey": "content",
        "propertyDescription": "Machine generated transcript"
      }
    }
  ]
}
Expose the generated transcript in partitions or search responses as shown in the OCR ingestion how-to.

Address Augmenter

The Address Augmenter relies on Clinia’s geocoding service and is rolling out in stages.
Enriches an address object with standardized formatting, coordinates, and time zone metadata. Because it mutates the input property, ensure your schema includes all required sub-fields before enabling the processor.

Actionable

Lets you pause the pipeline and route the payload to human reviewers. Additional documentation will follow.
Design actionable steps for scenarios where automated processors cannot make the final decision.

Schema Validator

Adds an explicit validation checkpoint mid-pipeline. This reuses the rules defined in your profiles and complements the implicit validation that occurs at the end. Use it to:
  • Stop the pipeline before an expensive processor when data is incomplete.
  • Re-validate after a mutation step to ensure enriched data stays compliant.
{
  "steps": [
    {
      "type": "SCHEMA_VALIDATOR",
      "schemaValidator": {}
    }
  ]
}

Mutating vs. enriching processors

  • Mutating processors (Address Augmenter, Clinia Function) replace the input property with enriched data. Update your schema first so the new shape passes validation.
  • Enriching processors (Segmenter, Vectorizer, OCR) add derived properties under enrichedProperties. They keep the original field intact while making extra data available to partitions.
Plan processor order accordingly and consult pipeline basics for execution semantics.
I