AI Engineer Code Challenge Documentation

Welcome to the AI Engineer Code Challenge documentation. This project implements a document-based question-answering system that can ingest PDF documents and answer questions using local language models.

Features

  • Local Operation: Works completely offline without internet connection

  • Multiple PDF Engines: Support for PyMuPDF, pdfminer, and pdfplumber

  • Configurable Chunking: Adjustable chunk size and overlap parameters

  • Vector Similarity Search: FAISS-based retrieval for efficient searching

  • Local LLM Integration: Support for transformers and llama-cpp backends

  • Modular Architecture: Clean separation of concerns with extensible components

  • Comprehensive Testing: Unit and integration tests with high coverage

  • Production Ready: Robust error handling and logging

Quick Start

# Install dependencies
pip install -r requirements.txt

# Ingest documents
python main.py --mode ingest --documents ./data/

# Ask a question
python main.py --mode query --query "What are the key features?"

Architecture

The system follows a modular RAG (Retrieval-Augmented Generation) architecture:

  1. Document Ingestion: PDF processing and text chunking

  2. Embedding Generation: Vector embeddings using sentence transformers

  3. Index Storage: FAISS vector database for efficient retrieval

  4. Query Processing: Semantic search and context retrieval

  5. Answer Generation: Local LLM for answer synthesis

Technology Stack

  • PDF Processing: PyMuPDF, pdfminer.six, pdfplumber

  • Embeddings: sentence-transformers (all-MiniLM-L6-v2)

  • Vector Database: FAISS (IndexFlatIP)

  • Language Models: transformers, llama-cpp-python

  • Testing: pytest, pytest-cov

  • Code Quality: ruff, black, isort

  • Documentation: Sphinx, Read the Docs theme

Indices and tables