Data Guardiuum
A modern Data Lakehouse solution with Open and Unified data processing platform for Data Lake and Data warehouse.
Data Guardiuum
Guardiuum is a centralised Data Governance solution, support on-premise as well as cloud agnostic. In the era of data-driven decision-making, Guardiuum emerges as a pivotal tool for enterprises aiming to harness the power of their data securely and efficiently. Guardiuum is a comprehensive data governance platform that serves as the backbone for managing your organization's data landscape. With its robust connectors, Guardiuum integrates seamlessly with various platforms, offering a unified view of schemas, namespaces, tables, and columns.
Key Features
- Centralised data management: Guardiumm will be the entry point for all data sources provisioned in the platform.
- Query Federation: Run queries directly on the database without moving the data across. Queries can be written to perform complicated join across various different data sources.
- SQL Everywhere: Any connected platform can be connected using Postgres compliant client and access data using SQL queries.
- Audit Logging: Captures every interation within the platform and generate insights into the data is being accessed.
- Role based Security policy: Apply individual level or group based security policy for accessing the data.
- Enterprise auth integration: Support for simple, Kerberose(Upcoming) and Active Directory and Azure Entra ID authentication.
- Zero Data Copy: With in-built query federation, data is not moved between platforms instead, it is directly queried against the source platform where data resides. For end user Guardiuum is the data provider but in the backend Guardiuum identifies the right source, executes the query on the right platform and returns the results back to the users. This will be very helpful in building a unified analytics dashboard, exporting enriched data for consumers etc.
- Inherently Support Data Mesh and Data Fabric: Guardiuum plays a crucial role in advancing the capabilities of both data mesh and data fabric architectures, offering tools that facilitate better data governance and seamless interconnectivity. By integrating with Guardiuum, organizations can leverage their distributed data assets more efficiently and securely.
Use Cases
E-commerce Multi-vendor Case Study
Executive Summary
This document outlines the implementation of a data governance solution for a major e-commerce platform managing marketing attribution across multiple external vendors. The solution addresses challenges in data security, transparency, and accurate commission calculations while maintaining data privacy and regulatory compliance.
Business Context
Initial Challenges
- Distributed Data Sources
- Business Requirements
- Security Concerns
Solution Architecture
Data Source Integration
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CRM Data │ │ Internal DB │ │ Marketing Data │
│ - Customer │ │ - Transactions │ │ - Campaigns │
│ - Interactions │ │ - Orders │ │ - Clicks │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────────────────┐
│ Query Federation Layer │
│ (Unified data access with security policies) │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Security Enforcement Layer │
│ (Row-level security, column masking, encryption) │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Vendor Access Layer │
│ (Restricted views, audit logging, analytics) │
└────────────────────────────────────────────────────────┘
Security Implementation
1. Query Federation
- Cross-database Queries
-- Example federated query structure
SELECT
t.order_id,
t.transaction_amount,
m.campaign_id,
m.vendor_id
FROM transactions.orders t
JOIN marketing.attributions m
ON t.attribution_id = m.attribution_id
WHERE m.vendor_id = :current_vendor_id2. Column Restrictions
- Data Masking
Customer Data:
- Full Name → Masked
- Email → Hashed
- Phone → Last 4 digits only
- Address → City/Region onlyPolicy Enforcement Framework
1. Access Control Matrix
Data Category | Marketing Vendor | Internal Analyst | Admin |
|---|---|---|---|
Customer PII | Masked | Full | Full |
Transaction Amount | Own Attribution | Full | Full |
Campaign Details | Own Campaigns | Summary | Full |
Commission Data | Own Calculation | Full | Full |
2. Data Classification
- Highly Sensitive
- Sensitive
- Internal Use
Implementation Details
1. Vendor Authentication
Authentication Flow:
1. Vendor portal login
2. JWT token generation
3. Role and policy attachment
4. Access token validation
5. Query execution with context
2. Attribution Model
Attribution Logic:
1. Click tracking via vendor UTM
2. Purchase event capture
3. Attribution window check
4. Commission calculation
5. Vendor notification
3. Audit System
- Real-time access logging
- Query pattern analysis
- Policy violation alerts
- Compliance reporting
Vendor Access Implementation
2. Data Access Patterns
- Daily Reports
- Real-time Dashboard
Security Measures
1. Data Protection
- End-to-end encryption
- Data masking rules
- Access time restrictions
- IP whitelisting
2. Compliance Controls
- GDPR compliance
- Data retention policies
- Consent management
- Audit trails
Outcomes and Benefits
1. Business Impact
- 100% transparency in attribution
- Reduced commission disputes
- Improved vendor relationships
- Enhanced data security
2. Technical Achievements
- Zero data breaches
- System availability
- Optimized query performance
- Automated compliance reporting
3. Vendor Satisfaction
- Real-time access to relevant data
- Clear attribution visibility
- Automated commission calculations
- Self-service analytics
Best Practices and Lessons Learned
1. Implementation Guidelines
- Start with strict policies
- Regular security audits
- Automated testing
- Performance monitoring
2. Maintenance Procedures
- Policy review cycle
- Access recertification
- Performance optimization
- Security updates
Connect Hive Server to Databricks Unity Catalog
Architecture
Query Gateway Service
PostgreSQL
Protocol
Cache
Refresh
Translated
HiveQL
Results
PostgreSQL
Results
Query Translator
PostgreSQL-Compliant
Query Gateway
Metadata Cache
Databricks
SQL Client
External
Hive Metastore
Hive Server
Problem Statement
Databricks users need to query data stored in external Hive deployments, but Databricks lacks native support for external Hive metastores or competitor products. Organizations need a seamless way to access their existing Hive data warehouse without migrating to Databricks' internal metastore.
Solution
Guardium Query Gateway - A PostgreSQL-compliant query gateway that:
Presents itself as a PostgreSQL database to Databricks Translates incoming PostgreSQL queries to HiveQL Routes queries to appropriate Hive servers Returns results in PostgreSQL-compatible format
Technical Components
1. PostgreSQL Protocol Handler
Implements PostgreSQL wire protocol (version 3.0) Handles connection management and authentication Supports common PostgreSQL data types Implements required PostgreSQL system catalogs
2. Query Translation Engine
Parses incoming PostgreSQL queries using libpg_query Transforms PostgreSQL AST to HiveQL AST Handles SQL dialect differences:
Date/time function translations Window function mappings Aggregate function conversions Type casting rules
3. Metadata Management
Caches Hive metadata locally Maps Hive schemas to PostgreSQL catalogs Maintains statistics for query optimization Handles schema evolution
4. Query Execution
Manages Hive JDBC connections Implements connection pooling Handles query timeouts and cancellation Manages result set streaming