Skip to main content

Hybrid Relational + Semantic Database Design – FacturaScan 360

Overview

FacturaScan 360 integrates AWS Textract, semantic validation, and conversational analytics through a modular AI agent system governed by the Model Context Protocol (MCP). This architecture requires a database that supports:

  • Structured and semi-structured data simultaneously.
  • Persistent traceability of OCR results and validations.
  • Storage and retrieval of semantic contexts exchanged via MCP agents.
  • Analytical querying for dashboards and AI reporting agents.

This document defines a relational-hybrid data model optimized for multi-tenant SaaS, semantic processing, and BI tool integration.


1. Core Table: invoices

The invoices table captures the full lifecycle of a scanned invoice, from raw OCR to contextual MCP representations.

ColumnTypeDescription
idUUID (PK)Unique invoice identifier
tenant_idUUID (FK)Foreign key to organizations.id
uploaded_byUUID (FK)User who submitted the document
file_urlTEXTLocation of the original PDF in S3
textract_rawJSONBRaw AWS Textract output
invoice_contextJSONBParsed semantic representation (MCP schema)
validation_contextJSONBValidation outcomes (MCP schema)
analytics_contextJSONBDerived metrics and business features (MCP schema)
statusTEXTEnum: 'pending', 'validated', 'error', 'analyzed'
created_atTIMESTAMPUpload timestamp

Indexes:

  • (tenant_id, created_at)
  • GIN(textract_raw) and GIN(invoice_context) for JSONB search (optional)

2. Supporting Tables

2.1. organizations

ColumnTypeDescription
idUUID (PK)Organization identifier
nameTEXTLegal or business name
created_atTIMESTAMPAccount creation date

2.2. users

ColumnTypeDescription
idUUID (PK)User ID
emailTEXTEmail (Cognito-based auth)
roleTEXTEnum: 'admin', 'editor', etc.
tenant_idUUID (FK)Link to organizations.id
is_activeBOOLEANActivation state

2.3. alerts

ColumnTypeDescription
idUUID (PK)Alert ID
invoice_idUUID (FK)Affected invoice
typeTEXT'email', 'slack', 'teams'
statusTEXT'sent', 'pending', 'failed'
triggered_atTIMESTAMPTime of alert dispatch

2.4. logs

Used for user actions and system auditing.

ColumnTypeDescription
idUUID (PK)Log ID
user_idUUID (FK)Associated user
action_typeTEXT'upload', 'validate', 'alert'
descriptionTEXTFree-text message
timestampTIMESTAMPAction time

3. Multi-Tenant Strategy

All tables share a unified schema with tenant_id as foreign key and enforcement via PostgreSQL Row-Level Security (RLS).

Policy Example

CREATE POLICY tenant_isolation ON invoices
USING (tenant_id = current_setting('app.current_tenant')::uuid);

4. Analytical Views (SQL-Based)

4.1. Monthly Totals per Tenant

CREATE VIEW view_invoice_totals_monthly AS
SELECT
tenant_id,
date_trunc('month', created_at) AS month,
COUNT(*) AS invoice_count,
SUM((invoice_context->>'subtotal')::numeric) AS subtotal,
SUM((invoice_context->>'vat')::numeric) AS vat,
SUM((invoice_context->>'total')::numeric) AS total
FROM invoices
GROUP BY tenant_id, month;

4.2. Error Rate per Rule

CREATE VIEW view_validation_error_rate AS
SELECT
tenant_id,
validation_context->>'rule_id' AS rule_id,
COUNT(*) AS occurrences
FROM invoices
WHERE status = 'error'
GROUP BY tenant_id, rule_id;

4.3. Duplicate Invoices Detected

CREATE VIEW view_possible_duplicates AS
SELECT
tenant_id,
invoice_context->>'supplier_name' AS supplier,
invoice_context->>'invoice_number' AS invoice_number,
COUNT(*) AS count
FROM invoices
GROUP BY tenant_id, supplier, invoice_number
HAVING COUNT(*) > 1;

5. JSONB Column Semantics (MCP Contexts)

Each context field (invoice_context, validation_context, analytics_context) adheres to a well-defined MCP schema. These contexts are:

  • Machine-readable
  • AI-agent-exchangeable
  • Traceable for audit and reprocessing

These will be detailed separately in the MCP schema documentation.


6. BI Compatibility and External Access

The data model supports direct integration with BI tools:

ToolAccess methodNotes
MetabasePostgreSQL read-only userCan explore both views and JSONB data
SupersetSQLAlchemy connectorUsed for advanced dashboards
TableauPostgreSQL driverCompatible with structured columns

7. Summary

The hybrid schema enables:

  • Long-term persistence of raw and semantic invoice data.
  • Modular integration of AI agents via MCP.
  • Structured access to key metrics for validation and analytics.
  • Extensibility for new agent types (e.g., risk scoring, audit compliance).

Next: see 04-mcp-schemas.md for definitions of InvoiceContext, ValidationContext, and AnalyticsContext.