Documentation

Core Concepts

The fundamental building blocks of Veculo: vertices, edges, vector embeddings, cell-level security, and cluster scaling.

What is a graph database?

A graph database stores data as vertices (nodes) and edges (relationships between nodes). Unlike relational databases, where relationships require expensive JOINs, graph databases make relationship traversal a first-class operation. This makes them ideal for:

  • Knowledge graphs — modeling entities and their relationships (people, organizations, documents, concepts)
  • RAG pipelines — combining semantic search with structural context for better retrieval
  • Recommendation engines — traversing user-item-tag graphs for personalized suggestions
  • Fraud detection — identifying suspicious patterns across interconnected entities

Veculo adds vector embeddings to the graph model, enabling hybrid queries that combine similarity search with graph traversal. This is something traditional graph databases cannot do natively.

Vertices

A vertex represents an entity in your graph. Every vertex has:

FieldTypeDescription
idstringUnique identifier. You choose the format — we recommend prefixed IDs like doc:arxiv-2401.001 or user:u_7f3a.
labelstringThe type of entity (e.g., "document", "person", "concept"). Used for filtering and schema validation.
propertiesobjectArbitrary key-value pairs. Values can be strings, numbers, booleans, or arrays.
visibilitystringA visibility expression controlling who can read this vertex. Optional — defaults to the cluster's default visibility.
embeddingfloat[]Optional vector embedding for similarity search. See Vector embeddings.

Vertex IDs are unique within a cluster. Inserting a vertex with an existing ID will update the existing vertex (upsert behavior).

Edges

An edge represents a directed relationship between two vertices. Edges connect a source vertex to a target vertex and have a type that describes the relationship.

FieldTypeDescription
sourcestringThe ID of the source vertex.
targetstringThe ID of the target vertex.
edge_typestringThe kind of relationship (e.g., "cites", "authored_by", "related_to").
propertiesobjectArbitrary key-value pairs on the edge itself (e.g., weight, timestamp, context).
visibilitystringVisibility expression for the edge. Can differ from the vertices it connects.

Edges are directed: an edge from A to B is distinct from an edge from B to A. If you need a bidirectional relationship, create two edges.

Edge uniqueness

The combination of (source, target, edge_type) is unique. Inserting a duplicate updates the existing edge properties.

Properties

Both vertices and edges can carry arbitrary key-value properties. These are stored as a JSON object and can contain:

  • Strings"title": "Attention Is All You Need"
  • Numbers"year": 2017
  • Booleans"peer_reviewed": true
  • Arrays"tags": ["nlp", "transformers"]

Properties are stored alongside the graph structure in Accumulo's sorted key-value store, ensuring they are always read together with the vertex or edge — no secondary lookups required.

Vector embeddings

A vector embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of an entity. Embeddings are typically generated by an embedding model (such as OpenAI's text-embedding-3-small or Cohere's embed-v3).

When you attach an embedding to a vertex, Veculo indexes it for approximate nearest-neighbor (ANN) search. You can then query for vertices whose embeddings are closest to a given query vector.

Vertex with embeddingjson
{
  "id": "doc:arxiv-2401.001",
  "label": "document",
  "properties": {
    "title": "Attention Is All You Need"
  },
  "embedding": [0.023, -0.114, 0.891, 0.445, -0.067, ...],
  "visibility": "public"
}

Key characteristics of vector search in Veculo:

  • Dimension flexibility — Veculo supports embeddings of any dimension. All embeddings within a cluster must have the same dimension.
  • Cosine similarity — Results are ranked by cosine similarity, returned as a score between 0 and 1.
  • Graph-aware — Vector search results can be enriched with graph context by specifying an edge type and traversal depth.

Hybrid queries

The real power of Veculo is combining vector search with graph traversal. For example: find the 10 documents most semantically similar to a query, then for each result walk the "cites" edges 2 hops deep to discover related papers. This produces far richer context for RAG than vector search alone.

Cell-level security (ABAC)

Every vertex and edge in Veculo carries an optional visibility expression — a boolean expression that determines which users can see that piece of data. This is attribute-based access control (ABAC) enforced at the storage layer.

For details on visibility syntax and how to use it, see the Security & ABAC page.

Veculo Units (VUs)

A Veculo Unit (VU) is the unit of compute and storage capacity for your cluster. Each VU provides a fixed amount of:

  • CPU and memory for query processing
  • Tablet server capacity for read/write throughput
  • Storage bandwidth for scan operations

You choose how many VUs your cluster runs. More VUs means more throughput, more concurrent connections, and faster scan performance. Veculo distributes your data across tablet servers proportionally to your VU count.

TierVUsBest for
Starter2Development, prototyping, low-traffic applications
Growth4 – 8Production applications with moderate throughput
Scale12 – 32High-throughput workloads, large knowledge graphs
EnterpriseCustomDedicated infrastructure, compliance requirements

Scaling is live — you can add or remove VUs without downtime. Veculo rebalances tablets across the new tablet server count automatically.

Clusters

A cluster is a dedicated, isolated Veculo deployment. Each cluster runs its own:

  • Accumulo instance (manager, tablet servers, garbage collector)
  • ZooKeeper ensemble for coordination
  • Dedicated GCS storage prefix for data isolation

Clusters are fully isolated from each other — there is no shared infrastructure between tenants beyond the underlying cloud platform. This ensures strong security boundaries and predictable performance.

Each cluster is identified by a unique ID (e.g., cls_abc123) and runs in the region you select at creation time. Clusters can be paused and resumed to save costs during periods of inactivity.