Core Concepts
The fundamental building blocks of Veculo: vertices, edges, vector embeddings, multi-modal content, cell-level security, and cluster scaling.
What is a graph database?
A graph database stores data as vertices (nodes) and edges (relationships between nodes). Unlike relational databases, where relationships require expensive JOINs, graph databases make relationship traversal a first-class operation. This makes them ideal for:
- Knowledge graphs — modeling entities and their relationships (people, organizations, documents, concepts)
- RAG pipelines — combining semantic search with structural context for better retrieval
- Recommendation engines — traversing user-item-tag graphs for personalized suggestions
- Fraud detection — identifying suspicious patterns across interconnected entities
Veculo adds vector embeddings to the graph model, enabling hybrid queries that combine similarity search with graph traversal. This is something traditional graph databases cannot do natively.
Vertices
A vertex represents an entity in your graph. Every vertex has:
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier. You choose the format — we recommend prefixed IDs like doc:arxiv-2401.001 or user:u_7f3a. |
label | string | The type of entity (e.g., "document", "person", "concept"). Used for filtering and schema validation. |
properties | object | Arbitrary key-value pairs. Values can be strings, numbers, booleans, or arrays. |
visibility | string | A visibility expression controlling who can read this vertex. Optional — defaults to the cluster's default visibility. |
embedding | float[] | Optional vector embedding for similarity search. See Vector embeddings. |
Vertex IDs are unique within a cluster. Inserting a vertex with an existing ID will update the existing vertex (upsert behavior).
Edges
An edge represents a directed relationship between two vertices. Edges connect a source vertex to a target vertex and have a type that describes the relationship.
| Field | Type | Description |
|---|---|---|
source | string | The ID of the source vertex. |
target | string | The ID of the target vertex. |
edge_type | string | The kind of relationship (e.g., "cites", "authored_by", "related_to"). |
properties | object | Arbitrary key-value pairs on the edge itself (e.g., weight, timestamp, context). |
visibility | string | Visibility expression for the edge. Can differ from the vertices it connects. |
Edges are directed: an edge from A to B is distinct from an edge from B to A. If you need a bidirectional relationship, create two edges.
Edge uniqueness
Properties
Both vertices and edges can carry arbitrary key-value properties. These are stored as a JSON object and can contain:
- Strings —
"title": "Attention Is All You Need" - Numbers —
"year": 2017 - Booleans —
"peer_reviewed": true - Arrays —
"tags": ["nlp", "transformers"]
Properties are stored alongside the graph structure in Accumulo's sorted key-value store, ensuring they are always read together with the vertex or edge — no secondary lookups required.
Vector embeddings
A vector embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of an entity. Embeddings are typically generated by an embedding model (such as OpenAI's text-embedding-3-small or Cohere's embed-v3).
When you attach an embedding to a vertex, Veculo indexes it for approximate nearest-neighbor (ANN) search. You can then query for vertices whose embeddings are closest to a given query vector.
{
"id": "doc:arxiv-2401.001",
"label": "document",
"properties": {
"title": "Attention Is All You Need"
},
"embedding": [0.023, -0.114, 0.891, 0.445, -0.067, ...],
"visibility": "public"
}Key characteristics of vector search in Veculo:
- Dimension flexibility — Veculo supports embeddings of any dimension. All embeddings within a cluster must have the same dimension.
- Cosine similarity — Results are ranked by cosine similarity, returned as a score between 0 and 1.
- Graph-aware — Vector search results can be enriched with graph context by specifying an edge type and traversal depth.
Hybrid queries
Multi-Modal Content
Veculo supports uploading rich media files and automatically extracting structured knowledge from them. This turns unstructured content into queryable graph data with embeddings, entities, and cross-modal relationships.
Supported file types
| Modality | File Types | What Veculo Extracts |
|---|---|---|
| Documents | PDF, TXT, Markdown, CSV | Full text, page-level embeddings, named entities (people, orgs, locations, concepts), section structure |
| Images | JPEG, PNG, GIF, WebP | Image embeddings via CLIP, OCR text extraction, object detection labels, visual similarity edges |
| Audio | MP3, WAV, OGG | Speech-to-text transcription, speaker diarization, transcript embeddings, mentioned entities |
| Video | MP4, WebM | Keyframe extraction, audio transcription, scene embeddings, combined entity extraction |
| Code | Python, JavaScript, JSON, HTML | Code embeddings, function/class extraction, dependency edges, documentation extraction |
Cross-modal edge discovery
When entities appear across different modalities, Veculo automatically creates edges between them. For example, if a person mentioned in a PDF also appears in an audio transcript, Veculo creates a mentioned_in edge connecting the person entity to both source files. This enables cross-modal queries like "find all content related to this person across documents, images, and recordings."
Extraction pipeline
When you upload a file via the /vertices/file endpoint, Veculo runs an asynchronous extraction pipeline:
- Upload & store — The file is stored in GCS and a file vertex is created immediately with the file metadata.
- Content extraction — A worker processes the file based on its content type: OCR for images, speech-to-text for audio, text extraction for PDFs, AST parsing for code.
- Entity extraction — Named entities (people, organizations, locations, concepts) are identified and created as vertices with edges back to the source file.
- Embedding generation — Text and content embeddings are generated and attached to the file vertex, enabling vector similarity search across all modalities.
- Edge discovery — Cross-references to existing vertices are detected and edges are created, building a connected knowledge graph automatically.
Processing status
GET /v1/{cluster_id}/jobs endpoint or the Jobs tab in the Explorer UI. Each job reports its status, extracted text availability, embedding generation, and page count.Cell-level security (ABAC)
Every vertex and edge in Veculo carries an optional visibility expression — a boolean expression that determines which users can see that piece of data. This is attribute-based access control (ABAC) enforced at the storage layer.
For details on visibility syntax and how to use it, see the Security & ABAC page.
Veculo Units (VUs)
A Veculo Unit (VU) is the unit of compute and storage capacity for your cluster. Each VU provides a fixed amount of:
- CPU and memory for query processing
- Tablet server capacity for read/write throughput
- Storage bandwidth for scan operations
You choose how many VUs your cluster runs. More VUs means more throughput, more concurrent connections, and faster scan performance. Veculo distributes your data across tablet servers proportionally to your VU count.
| Tier | VUs | Best for |
|---|---|---|
| Starter | 2 | Development, prototyping, low-traffic applications |
| Growth | 4 – 8 | Production applications with moderate throughput |
| Scale | 12 – 32 | High-throughput workloads, large knowledge graphs |
| Enterprise | Custom | Dedicated infrastructure, compliance requirements |
Scaling is live — you can add or remove VUs without downtime. Veculo rebalances tablets across the new tablet server count automatically.
Clusters
A cluster is a dedicated, isolated Veculo deployment. Each cluster runs its own:
- Accumulo instance (manager, tablet servers, garbage collector)
- ZooKeeper ensemble for coordination
- Dedicated GCS storage prefix for data isolation
Clusters are fully isolated from each other — there is no shared infrastructure between tenants beyond the underlying cloud platform. This ensures strong security boundaries and predictable performance.
Each cluster is identified by a unique ID (e.g., cls_abc123) and runs in the region you select at creation time. Clusters can be paused and resumed to save costs during periods of inactivity.