Core Concepts

The fundamental building blocks of Veculo: vertices, edges, vector embeddings, multi-modal content, cell-level security, and cluster scaling.

What is a graph database?

A graph database stores data as vertices (nodes) and edges (relationships between nodes). Unlike relational databases, where relationships require expensive JOINs, graph databases make relationship traversal a first-class operation. This makes them ideal for:

Knowledge graphs — modeling entities and their relationships (people, organizations, documents, concepts)
RAG pipelines — combining semantic search with structural context for better retrieval
Recommendation engines — traversing user-item-tag graphs for personalized suggestions
Fraud detection — identifying suspicious patterns across interconnected entities

Veculo adds vector embeddings to the graph model, enabling hybrid queries that combine similarity search with graph traversal. This is something traditional graph databases cannot do natively.

Vertices

A vertex represents an entity in your graph. Every vertex has:

Field	Type	Description
`id`	string	Unique identifier. You choose the format — we recommend prefixed IDs like `doc:arxiv-2401.001` or `user:u_7f3a`.
`label`	string	The type of entity (e.g., "document", "person", "concept"). Used for filtering and schema validation.
`properties`	object	Arbitrary key-value pairs. Values can be strings, numbers, booleans, or arrays.
`visibility`	string	A visibility expression controlling who can read this vertex. Optional — defaults to the cluster's default visibility.
`embedding`	float[]	Optional vector embedding for similarity search. See Vector embeddings.

Vertex IDs are unique within a cluster. Inserting a vertex with an existing ID will update the existing vertex (upsert behavior).

Edges

An edge represents a directed relationship between two vertices. Edges connect a source vertex to a target vertex and have a type that describes the relationship.

Field	Type	Description
`source`	string	The ID of the source vertex.
`target`	string	The ID of the target vertex.
`edge_type`	string	The kind of relationship (e.g., "cites", "authored_by", "related_to").
`properties`	object	Arbitrary key-value pairs on the edge itself (e.g., weight, timestamp, context).
`visibility`	string	Visibility expression for the edge. Can differ from the vertices it connects.

Edges are directed: an edge from A to B is distinct from an edge from B to A. If you need a bidirectional relationship, create two edges.

Edge uniqueness

The combination of (source, target, edge_type) is unique. Inserting a duplicate updates the existing edge properties.

Properties

Both vertices and edges can carry arbitrary key-value properties. These are stored as a JSON object and can contain:

Strings — "title": "Attention Is All You Need"
Numbers — "year": 2017
Booleans — "peer_reviewed": true
Arrays — "tags": ["nlp", "transformers"]

Properties are stored alongside the graph structure in Accumulo's sorted key-value store, ensuring they are always read together with the vertex or edge — no secondary lookups required.

Vector embeddings

A vector embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of an entity. Embeddings are typically generated by an embedding model (such as OpenAI's text-embedding-3-small or Cohere's embed-v3).

When you attach an embedding to a vertex, Veculo indexes it for approximate nearest-neighbor (ANN) search. You can then query for vertices whose embeddings are closest to a given query vector.

Vertex with embeddingjson

{
  "id": "doc:arxiv-2401.001",
  "label": "document",
  "properties": {
    "title": "Attention Is All You Need"
  },
  "embedding": [0.023, -0.114, 0.891, 0.445, -0.067, ...],
  "visibility": "public"
}

Key characteristics of vector search in Veculo:

Dimension flexibility — Veculo supports embeddings of any dimension. All embeddings within a cluster must have the same dimension.
Cosine similarity — Results are ranked by cosine similarity, returned as a score between 0 and 1.
Graph-aware — Vector search results can be enriched with graph context by specifying an edge type and traversal depth.

Hybrid queries

The real power of Veculo is combining vector search with graph traversal. For example: find the 10 documents most semantically similar to a query, then for each result walk the "cites" edges 2 hops deep to discover related papers. This produces far richer context for RAG than vector search alone.

Veculo supports uploading rich media files and automatically extracting structured knowledge from them. This turns unstructured content into queryable graph data with embeddings, entities, and cross-modal relationships.

Supported file types

Modality	File Types	What Veculo Extracts
Documents	PDF, TXT, Markdown, CSV	Full text, page-level embeddings, named entities (people, orgs, locations, concepts), section structure
Images	JPEG, PNG, GIF, WebP	Image embeddings via CLIP, OCR text extraction, object detection labels, visual similarity edges
Audio	MP3, WAV, OGG	Speech-to-text transcription, speaker diarization, transcript embeddings, mentioned entities
Video	MP4, WebM	Keyframe extraction, audio transcription, scene embeddings, combined entity extraction
Code	Python, JavaScript, JSON, HTML	Code embeddings, function/class extraction, dependency edges, documentation extraction

Cross-modal edge discovery

When entities appear across different modalities, Veculo automatically creates edges between them. For example, if a person mentioned in a PDF also appears in an audio transcript, Veculo creates a mentioned_in edge connecting the person entity to both source files. This enables cross-modal queries like "find all content related to this person across documents, images, and recordings."

Extraction pipeline

When you upload a file via the /vertices/file endpoint, Veculo runs an asynchronous extraction pipeline:

Upload & store — The file is stored in GCS and a file vertex is created immediately with the file metadata.
Content extraction — A worker processes the file based on its content type: OCR for images, speech-to-text for audio, text extraction for PDFs, AST parsing for code.
Entity extraction — Named entities (people, organizations, locations, concepts) are identified and created as vertices with edges back to the source file.
Embedding generation — Text and content embeddings are generated and attached to the file vertex, enabling vector similarity search across all modalities.
Edge discovery — Cross-references to existing vertices are detected and edges are created, building a connected knowledge graph automatically.

Processing status

You can monitor the extraction pipeline via the GET /v1/{cluster_id}/jobs endpoint or the Jobs tab in the Explorer UI. Each job reports its status, extracted text availability, embedding generation, and page count.

Cell-level security (ABAC)

Every vertex and edge in Veculo carries an optional visibility expression — a boolean expression that determines which users can see that piece of data. This is attribute-based access control (ABAC) enforced at the storage layer.

For details on visibility syntax and how to use it, see the Security & ABAC page.

Veculo Units (VUs)

A Veculo Unit (VU) is the unit of compute and storage capacity for your cluster. Each VU provides a fixed amount of:

CPU and memory for query processing
Tablet server capacity for read/write throughput
Storage bandwidth for scan operations

You choose how many VUs your cluster runs. More VUs means more throughput, more concurrent connections, and faster scan performance. Veculo distributes your data across tablet servers proportionally to your VU count.

Tier	VUs	Best for
Starter	2	Development, prototyping, low-traffic applications
Growth	4 – 8	Production applications with moderate throughput
Scale	12 – 32	High-throughput workloads, large knowledge graphs
Enterprise	Custom	Dedicated infrastructure, compliance requirements

Scaling is live — you can add or remove VUs without downtime. Veculo rebalances tablets across the new tablet server count automatically.

Clusters

A cluster is a dedicated, isolated Veculo deployment. Each cluster runs its own:

Accumulo instance (manager, tablet servers, garbage collector)
ZooKeeper ensemble for coordination
Dedicated GCS storage prefix for data isolation

Clusters are fully isolated from each other — there is no shared infrastructure between tenants beyond the underlying cloud platform. This ensures strong security boundaries and predictable performance.

Each cluster is identified by a unique ID (e.g., cls_abc123) and runs in the region you select at creation time. Clusters can be paused and resumed to save costs during periods of inactivity.

Quickstart Architecture