Documentation

Core Concepts

The fundamental building blocks of Veculo: vertices, edges, vector embeddings, multi-modal content, cell-level security, and cluster scaling.

What is a graph database?

A graph database stores data as vertices (nodes) and edges (relationships between nodes). Unlike relational databases, where relationships require expensive JOINs, graph databases make relationship traversal a first-class operation. This makes them ideal for:

  • Knowledge graphs — modeling entities and their relationships (people, organizations, documents, concepts)
  • RAG pipelines — combining semantic search with structural context for better retrieval
  • Recommendation engines — traversing user-item-tag graphs for personalized suggestions
  • Fraud detection — identifying suspicious patterns across interconnected entities

Veculo adds vector embeddings to the graph model, enabling hybrid queries that combine similarity search with graph traversal. This is something traditional graph databases cannot do natively.

Vertices

A vertex represents an entity in your graph. Every vertex has:

FieldTypeDescription
idstringUnique identifier. You choose the format — we recommend prefixed IDs like doc:arxiv-2401.001 or user:u_7f3a.
labelstringThe type of entity (e.g., "document", "person", "concept"). Used for filtering and schema validation.
propertiesobjectArbitrary key-value pairs. Values can be strings, numbers, booleans, or arrays.
visibilitystringA visibility expression controlling who can read this vertex. Optional — defaults to the cluster's default visibility.
embeddingfloat[]Optional vector embedding for similarity search. See Vector embeddings.

Vertex IDs are unique within a cluster. Inserting a vertex with an existing ID will update the existing vertex (upsert behavior).

Edges

An edge represents a directed relationship between two vertices. Edges connect a source vertex to a target vertex and have a type that describes the relationship.

FieldTypeDescription
sourcestringThe ID of the source vertex.
targetstringThe ID of the target vertex.
edge_typestringThe kind of relationship (e.g., "cites", "authored_by", "related_to").
propertiesobjectArbitrary key-value pairs on the edge itself (e.g., weight, timestamp, context).
visibilitystringVisibility expression for the edge. Can differ from the vertices it connects.

Edges are directed: an edge from A to B is distinct from an edge from B to A. If you need a bidirectional relationship, create two edges.

Edge uniqueness

The combination of (source, target, edge_type) is unique. Inserting a duplicate updates the existing edge properties.

Properties

Both vertices and edges can carry arbitrary key-value properties. These are stored as a JSON object and can contain:

  • Strings"title": "Attention Is All You Need"
  • Numbers"year": 2017
  • Booleans"peer_reviewed": true
  • Arrays"tags": ["nlp", "transformers"]

Properties are stored alongside the graph structure in Accumulo's sorted key-value store, ensuring they are always read together with the vertex or edge — no secondary lookups required.

Vector embeddings

A vector embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of an entity. Embeddings are typically generated by an embedding model (such as OpenAI's text-embedding-3-small or Cohere's embed-v3).

When you attach an embedding to a vertex, Veculo indexes it for approximate nearest-neighbor (ANN) search. You can then query for vertices whose embeddings are closest to a given query vector.

Vertex with embeddingjson
{
  "id": "doc:arxiv-2401.001",
  "label": "document",
  "properties": {
    "title": "Attention Is All You Need"
  },
  "embedding": [0.023, -0.114, 0.891, 0.445, -0.067, ...],
  "visibility": "public"
}

Key characteristics of vector search in Veculo:

  • Dimension flexibility — Veculo supports embeddings of any dimension. All embeddings within a cluster must have the same dimension.
  • Cosine similarity — Results are ranked by cosine similarity, returned as a score between 0 and 1.
  • Graph-aware — Vector search results can be enriched with graph context by specifying an edge type and traversal depth.

Hybrid queries

The real power of Veculo is combining vector search with graph traversal. For example: find the 10 documents most semantically similar to a query, then for each result walk the "cites" edges 2 hops deep to discover related papers. This produces far richer context for RAG than vector search alone.

Multi-Modal Content

Veculo supports uploading rich media files and automatically extracting structured knowledge from them. This turns unstructured content into queryable graph data with embeddings, entities, and cross-modal relationships.

Supported file types

ModalityFile TypesWhat Veculo Extracts
DocumentsPDF, TXT, Markdown, CSVFull text, page-level embeddings, named entities (people, orgs, locations, concepts), section structure
ImagesJPEG, PNG, GIF, WebPImage embeddings via CLIP, OCR text extraction, object detection labels, visual similarity edges
AudioMP3, WAV, OGGSpeech-to-text transcription, speaker diarization, transcript embeddings, mentioned entities
VideoMP4, WebMKeyframe extraction, audio transcription, scene embeddings, combined entity extraction
CodePython, JavaScript, JSON, HTMLCode embeddings, function/class extraction, dependency edges, documentation extraction

Cross-modal edge discovery

When entities appear across different modalities, Veculo automatically creates edges between them. For example, if a person mentioned in a PDF also appears in an audio transcript, Veculo creates a mentioned_in edge connecting the person entity to both source files. This enables cross-modal queries like "find all content related to this person across documents, images, and recordings."

Extraction pipeline

When you upload a file via the /vertices/file endpoint, Veculo runs an asynchronous extraction pipeline:

  1. Upload & store — The file is stored in GCS and a file vertex is created immediately with the file metadata.
  2. Content extraction — A worker processes the file based on its content type: OCR for images, speech-to-text for audio, text extraction for PDFs, AST parsing for code.
  3. Entity extraction — Named entities (people, organizations, locations, concepts) are identified and created as vertices with edges back to the source file.
  4. Embedding generation — Text and content embeddings are generated and attached to the file vertex, enabling vector similarity search across all modalities.
  5. Edge discovery — Cross-references to existing vertices are detected and edges are created, building a connected knowledge graph automatically.

Processing status

You can monitor the extraction pipeline via the GET /v1/{cluster_id}/jobs endpoint or the Jobs tab in the Explorer UI. Each job reports its status, extracted text availability, embedding generation, and page count.

Cell-level security (ABAC)

Every vertex and edge in Veculo carries an optional visibility expression — a boolean expression that determines which users can see that piece of data. This is attribute-based access control (ABAC) enforced at the storage layer.

For details on visibility syntax and how to use it, see the Security & ABAC page.

Veculo Units (VUs)

A Veculo Unit (VU) is the unit of compute and storage capacity for your cluster. Each VU provides a fixed amount of:

  • CPU and memory for query processing
  • Tablet server capacity for read/write throughput
  • Storage bandwidth for scan operations

You choose how many VUs your cluster runs. More VUs means more throughput, more concurrent connections, and faster scan performance. Veculo distributes your data across tablet servers proportionally to your VU count.

TierVUsBest for
Starter2Development, prototyping, low-traffic applications
Growth4 – 8Production applications with moderate throughput
Scale12 – 32High-throughput workloads, large knowledge graphs
EnterpriseCustomDedicated infrastructure, compliance requirements

Scaling is live — you can add or remove VUs without downtime. Veculo rebalances tablets across the new tablet server count automatically.

Clusters

A cluster is a dedicated, isolated Veculo deployment. Each cluster runs its own:

  • Accumulo instance (manager, tablet servers, garbage collector)
  • ZooKeeper ensemble for coordination
  • Dedicated GCS storage prefix for data isolation

Clusters are fully isolated from each other — there is no shared infrastructure between tenants beyond the underlying cloud platform. This ensures strong security boundaries and predictable performance.

Each cluster is identified by a unique ID (e.g., cls_abc123) and runs in the region you select at creation time. Clusters can be paused and resumed to save costs during periods of inactivity.