AI Engineer Code Challenge Documentation

Welcome to the AI Engineer Code Challenge documentation. This project implements a document-based question-answering system that can ingest PDF documents and answer questions using local language models.

Contents:

Features

Local Operation: Works completely offline without internet connection
Multiple PDF Engines: Support for PyMuPDF, pdfminer, and pdfplumber
Configurable Chunking: Adjustable chunk size and overlap parameters
Vector Similarity Search: FAISS-based retrieval for efficient searching
Local LLM Integration: Support for transformers and llama-cpp backends
Modular Architecture: Clean separation of concerns with extensible components
Comprehensive Testing: Unit and integration tests with high coverage
Production Ready: Robust error handling and logging

Quick Start

# Install dependencies
pip install -r requirements.txt

# Ingest documents
python main.py --mode ingest --documents ./data/

# Ask a question
python main.py --mode query --query "What are the key features?"

Architecture

The system follows a modular RAG (Retrieval-Augmented Generation) architecture:

Document Ingestion: PDF processing and text chunking
Embedding Generation: Vector embeddings using sentence transformers
Index Storage: FAISS vector database for efficient retrieval
Query Processing: Semantic search and context retrieval
Answer Generation: Local LLM for answer synthesis

Technology Stack

PDF Processing: PyMuPDF, pdfminer.six, pdfplumber
Embeddings: sentence-transformers (all-MiniLM-L6-v2)
Vector Database: FAISS (IndexFlatIP)
Language Models: transformers, llama-cpp-python
Testing: pytest, pytest-cov
Code Quality: ruff, black, isort
Documentation: Sphinx, Read the Docs theme

AI Engineer Code Challenge Documentation

Features

Quick Start

Architecture

Technology Stack

Indices and tables