Nexaris - Wa-Sul AI Studio
A modern Data Lakehouse solution with Open and Unified data processing platform for Data Lake and Data warehouse.
Nexaris - Wa-Sul AI Studio
Introduction
WU SUL AI Studio is a powerful module of iiDrak that offers advanced AI/ML tools for creating and deploying machine learning models, performing LLM and retrieval-augmented generation (RAG) tasks, Vision based tasks such as optical character recognition (OCR) and so on. It empowers users to efficiently retrieve and process data, build custom machine learning workflows, and extract information from various data sources. Some of the salient features are:
- Component based architecture
- Plug and play and interface
- Generation of optimised code internally for execution
- Download respective Jupyter notebook code for local development/enhancement
- Realtime testing.
- Continous enhacements and improvements to stay upto date with the current trends.
ML Experiment - Machine Learning (ML) Model Creation and Deployment
Overview
AI Studio allows users to train and deploy machine learning models for various applications such as regression, classification, and clustering. The platform provides an easy-to-use interface for selecting data sources, transforming datasets, and choosing the right machine learning algorithms.
Key Features
- Data Source Selection: Connect to external databases, query from data warehouses, or run custom SQL queries to fetch data.
- Data Transformation: Customize datasets by selecting, merging, or splitting columns, handling outliers, and performing train-test splits.
- Algorithm Selection: Choose from a variety of machine learning algorithms, including Decision Trees, Gradient Boosting, K-Means, Linear and Logistic Regression, and more.
- Training and Evaluation: Train the model using transformed data and evaluate its performance with built-in metrics such as accuracy and precision.
- Model Deployment: Once trained, models can be accessed via Experiments to view detailed metrics and artifacts for easy deployment.
Steps for ML Model Creation
- Create AI Studio: Clicking on “+ AI STUDIO” will open a panel where you can input the required information. Once completed, a studio will be created, and you will be redirected to a blank canvas.
- Select Data Source: After creating the studio, users can add a data source. There are three options available: external database, query, and warehouse. Users can simply drag and drop their selected data source onto the canvas.
By selecting the external database option, you will be prompted to enter details such as the database fields, connection credentials, and the type of database. If you choose the query option, you will need to write your SQL query. For the warehouse option, you can select from available warehouses.
If you don't have connection credentials, navigate to Settings -> Connectors Configurations, and click the Configuration button to add your connection credentials.
You can choose a data source from the available options and enter the required credentials in this section.
- Data Transformation: Customize the data by selecting relevant columns, splitting columns, dropping columns, merging columns, handling outliers, and performing other transformations as needed. Users can drag and drop this component onto the canvas to transform the data sourced from their selected data source.
- Algorithm Selection: Once data preparation is complete, Choose the best machine learning algorithm based on the use case. Upcoming releases will focus on prompting user the best ML model and cleanup/transformation techniques for better accuracy.
- Run AI Studio Flow: When you click the Run button, the AI model will be executed. Additionally, each component on the canvas can be run independently. To view logs for a specific component, simply click on the success or failure icons associated with it.
- Model Deployment: Access your trained model through the Experiments section to review its performance and deploy it for production use.
Model Experiments
Access Trained Models: In the iiDrak AI section, you can explore experiments to review and manage your trained models effectively. Additionally, you have the option to download model details as a CSV file.
Model Overview: Users can view detailed information and the current status of their models, as well as register new models using the Register modal button
Metrics and Artifacts: Examine key performance metrics and access artifacts such as model files for future use.
Example Workflow 1
Optical Character Recognition (OCR) with AI Studio
Overview
AI Studio's OCR capabilities integrate with cloud based solutions such as AWS Textract or open source models such as Tesseractand various data sources such as S3, Azure Blob Storage, and ABFS. Users can extract text and structured data from scanned documents, including tables and forms. Users can define specific rules and categories for classifying documents. Extracted data is then stored in Iceberg tables for advanced querying and analytics.
Key Features
- Multi-Source Integration: Fetch files from sources like S3, Azure, ABFS, and other supported data platforms.
- Custom Document Classification: Users can define classification rules within the AI Studio. For example:
Classification Example: "Classify the document as follows: Toll Violation as 0, Government ID as 1, Medical Prescription as 2, Parking/Parking Violation as 3, and Speeding/SPEEDING as 4. Return only the digits as JSON output with the key 'id' for the number and the key 'type' for the mentioned type."
- Text and Data Extraction: OCR extracts plain text, tables, and form data, applying the user-defined classification.
- Data Storage in Iceberg Tables: Once classified, the extracted data is stored in Iceberg tables, providing a structured format for querying and analytics.
- Hands-Free Analytics: Users can query Iceberg tables or generate reports to analyze the extracted data using any filter or classification, simplifying data analysis and integration across platforms.
Steps for OCR
- Source Files: Connect to external sources such as S3, Azure, or ABFS to retrieve scanned documents.
- Custom Classification: Define custom classification rules for the documents in the AI Studio.
- Text and Data Extraction: Apply OCR to extract text, tables, and form data based on the user-provided classification.
- Data Storage in Iceberg Tables: Store the classified data in Iceberg tables for further querying and reporting.
- Analytics and Reporting: Query the Iceberg tables or generate reports with full support for custom filters and classifications.
Example Workflow 2
Retrieval-Augmented Generation (RAG) with AI Studio
Overview
Retrieval-Augmented Generation (RAG) allows users to connect to multiple data sources, retrieve relevant information, and enhance the results with advanced AI techniques such as embeddings. RAG helps users interact with large datasets, perform efficient searches, and gain actionable insights through natural language queries.
Key Features
- Data Source Integration: Connect to various data sources like external databases, cloud storage, or web-based data.
- Text Splitting & Embeddings: Fetch data, split it into relevant sections, and generate embeddings for easier querying.
- Vector Databases: Store data as vectors in databases to improve search efficiency and accuracy.
- Interactive Chat Interface: Chat with your data in real time, making data exploration and decision-making more intuitive.
Steps for RAG Flow
- Connect Data Sources Set up connections to cloud storage or provide external web URL.
- Text Embeddings: Transform the retrieved data into embeddings for easier querying.
- Query Data: Use AI-powered queries to retrieve and enhance relevant information from vast datasets.
- Gain Insights: Interact with data using the chat interface to simplify analysis and decision-making.