Creating Multimodal AI Agent with Llama 3.2

Overview

This app is a fork of Multimodal RAG that leverages the latest Llama-3.2-3B, a small language model and Llama-3.2-11B-Vision, a Vision Language Model from Meta to extract and index information from these documents including text files, PDFs, PowerPoint presentations, and images, allowing users to query the processed data through an interactive chat interface through streamlit.

The system utilizes LlamaIndex for efficient indexing and retrieval of information for orchestration, Hugging Face integration for LlamaIndex for generating inference output from Llama 3.2 VLM and SLM, NIM microservices for high-performance inference on Google DePlot, and Milvus as a vector database for efficient storage and retrieval of embedding vectors. This combination of technologies enables the application to handle complex multimodal data, perform advanced queries, and deliver rapid, context-aware responses to user inquiries.

The Llama 3.2 language and vision models with NIM microservices will be integrated in this reference app soon.

Features

Multi-format Document Processing: Handles text files, PDFs, PowerPoint presentations, and images.
Advanced Text Extraction: Extracts text from PDFs and PowerPoint slides, including tables and embedded images.
Image Analysis: Uses a VLM (Llama-3.2-11B-Vision) running on Hugging Face transformers to describe images and Google's DePlot for processing graphs/charts on NIM microservices.
Vector Store Indexing: Creates a searchable index of processed documents using Milvus vector store. This folder is auto generated on execution.
Interactive Chat Interface: Allows users to query the processed information through a chat-like interface.

Setup

Clone the repository:

git clone https://github.com/jayrodge/Multimodal-RAG-with-Llama-3.2.git
cd Multimodal-RAG-with-Llama-3.2/

(Optional) Create a conda environment or a virtual environment:

Using conda:

conda create --name multimodal-rag python=3.10
conda activate multimodal-rag

Using venv:

python -m venv venv
source venv/bin/activate

Install the required packages:

pip install -r requirements.txt

Set up your NVIDIA API key as an environment variable or define it in initialize_settings() in app.py:

export NVIDIA_API_KEY="your-api-key-here"

Generate the NVIDIA API key on build.nvidia.com

Refer this tutorial to install and start the GPU-accelerated Milvus container:

sudo docker compose up -d

Usage

Ensure the Milvus container is running:

docker ps

Run the Streamlit app:

streamlit run app.py

Open the provided URL in your web browser.
Choose between uploading files or specifying a directory path containing your documents.
Process the files by clicking the "Process Files" or "Process Directory" button.
Once processing is complete, use the chat interface to query your documents.

File Structure

app.py: Main Streamlit application
utils.py: Utility functions for image processing and API interactions
document_processors.py: Functions for processing various document types
requirements.txt: List of Python dependencies
vectorstore/ : Repository to store information from pdfs and ppt, created automatically

GPU Acceleration for Vector Search

To utilize GPU acceleration in the vector database, ensure that:

Your system has a compatible NVIDIA GPU.
You're using the GPU-enabled version of Milvus (as shown in the setup instructions).
There are enough concurrent requests to justify GPU usage. GPU acceleration typically shows significant benefits under high load conditions.

It's important to note that GPU acceleration will only be used when the incoming requests are extremely high. For more detailed information on GPU indexing and search in Milvus, refer to the official Milvus GPU Index documentation.

To connect the GPU-accelerated Milvus with LlamaIndex, update the MilvusVectorStore configuration in app.py:

vector_store = MilvusVectorStore(
    host="127.0.0.1",
    port=19530,
    dim=1024,
    collection_name="your_collection_name",
    gpu_id=0  # Specify the GPU ID to use
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creating Multimodal AI Agent with Llama 3.2

Overview

Features

Setup

Usage

File Structure

GPU Acceleration for Vector Search

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
app.py		app.py
document_processors.py		document_processors.py
requirements.txt		requirements.txt
utils.py		utils.py

jayrodge/Multimodal-RAG-with-Llama-3.2

Folders and files

Latest commit

History

Repository files navigation

Creating Multimodal AI Agent with Llama 3.2

Overview

Features

Setup

Usage

File Structure

GPU Acceleration for Vector Search

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages