API Reference
This section provides detailed documentation for the AI Engineer Code Challenge API.
Core Modules
Document Ingestion Pipeline
Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.
- class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]
Bases:
object
Metadata for a text chunk.
- __init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
- class src.ingest.DocumentChunk(text, metadata)[source]
Bases:
object
A chunk of text from a document with metadata.
-
metadata:
ChunkMetadata
- __init__(text, metadata)
-
metadata:
- class src.ingest.PDFProcessor(engine='pymupdf')[source]
Bases:
object
Handles PDF text extraction using different engines.
- __init__(engine='pymupdf')[source]
Initialize PDF processor.
- Parameters:
engine (
str
) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)
- extract_text(pdf_path)[source]
Extract text from PDF with page numbers.
- Parameters:
pdf_path (
Path
) – Path to PDF file- Return type:
- Returns:
List of (text, page_number) tuples
- Raises:
ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist
- class src.ingest.TextCleaner(config)[source]
Bases:
object
Handles text cleaning and normalization.
- class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]
Bases:
object
Handles text chunking with sliding window.
- class src.ingest.DocumentIngester(config)[source]
Bases:
object
Main class for document ingestion pipeline.
- ingest_documents(documents_path)[source]
Ingest all PDF documents from the given path.
- Parameters:
documents_path (
Path
) – Path to directory containing PDF files- Return type:
- Returns:
List of all document chunks
- Raises:
ValueError – If documents_path doesn’t exist or contains no PDFs
- src.ingest.ingest_documents(documents_path, config, args)[source]
Main function for document ingestion.
Embedding Pipeline
Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.
- class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]
Bases:
object
Configuration for embedding generation.
- __init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)
- class src.embed.EmbeddingModel(config)[source]
Bases:
object
Handles embedding model loading and text embedding generation.
- __init__(config)[source]
Initialize embedding model.
- Parameters:
config (
EmbeddingConfig
) – Embedding configuration
- class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]
Bases:
object
Handles FAISS index creation and management.
- add_embeddings(embeddings, chunk_metadata)[source]
Add embeddings to the index.
- Parameters:
embeddings (
ndarray
) – numpy array of embeddingschunk_metadata (
list
[ChunkMetadata
]) – List of chunk metadata corresponding to embeddings
- Return type:
- get_chunk_by_index(index)[source]
Get chunk metadata by index.
- Parameters:
index (
int
) – Index in the metadata list- Return type:
- Returns:
Chunk metadata or None if index is invalid
- class src.embed.EmbeddingPipeline(config)[source]
Bases:
object
Main class for embedding generation and index management.
- create_embeddings_from_chunks(chunks)[source]
Create embeddings from document chunks and build FAISS index.
- Parameters:
chunks (
list
[DocumentChunk
]) – List of document chunks- Return type:
- src.embed.create_embeddings_from_chunks_file(chunks_file, config, output_path)[source]
Create embeddings from a chunks.json file.
- src.embed.load_embedding_pipeline(config, index_path)[source]
Load an embedding pipeline with existing index.
- Parameters:
- Return type:
- Returns:
Loaded EmbeddingPipeline
Query Engine
Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.
- class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]
Bases:
object
Result of a query with relevant chunks and metadata.
-
chunks:
list
[DocumentChunk
]
- __init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
-
chunks:
- class src.query.QueryEngine(config, index_path=None)[source]
Bases:
object
Main class for query processing and similarity search.
- search(query, top_k=None, similarity_threshold=None)[source]
Search for chunks similar to the query.
- class src.query.QueryProcessor(config, index_path=None)[source]
Bases:
object
High-level query processor with additional functionality.
- process_query(query, top_k=None, similarity_threshold=None)[source]
Process a user query and return relevant chunks.
- format_results(result, include_metadata=True)[source]
Format query results as a readable string.
- Parameters:
result (
QueryResult
) – QueryResult to formatinclude_metadata (
bool
) – Whether to include chunk metadata
- Return type:
- Returns:
Formatted string representation of results
- get_relevant_context(result, max_chars=2000)[source]
Get relevant context from search results for LLM input.
- Parameters:
result (
QueryResult
) – QueryResult from searchmax_chars (
int
) – Maximum characters to include
- Return type:
- Returns:
Formatted context string for LLM
- src.query.format_query_output(result, verbose=False)[source]
Format query results for output.
- Parameters:
result (
QueryResult
) – QueryResult to formatverbose (
bool
) – Whether to include detailed output
- Return type:
- Returns:
Formatted output string
LLM Interface
Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).
- class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]
Bases:
object
Configuration for LLM settings.
- __init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)
- class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]
Bases:
object
Response from LLM with metadata.
- __init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)
- class src.llm.BaseLLM(config)[source]
Bases:
object
Base class for LLM implementations.
- __init__(config)[source]
Initialize LLM with configuration.
- Parameters:
config (
LLMConfig
) – LLM configuration
- class src.llm.TransformersLLM(config)[source]
Bases:
BaseLLM
LLM implementation using transformers library.
- class src.llm.LlamaCppLLM(config)[source]
Bases:
BaseLLM
LLM implementation using llama-cpp-python.
- class src.llm.OpenAILLM(config)[source]
Bases:
BaseLLM
LLM implementation using OpenAI API (optional).
- class src.llm.LLMInterface(config)[source]
Bases:
object
Main interface for LLM operations.
- generate_answer(query, query_result)[source]
Generate answer from query and retrieved chunks.
- Parameters:
query (
str
) – User queryquery_result (
QueryResult
) – QueryResult with retrieved chunks
- Return type:
- Returns:
LLMResponse with generated answer
- src.llm.create_llm_interface(config)[source]
Create LLM interface from configuration.
- Parameters:
- Return type:
- Returns:
LLMInterface instance
- src.llm.generate_answer_from_query(query, query_result, config)[source]
Generate answer from query and query result.
- src.llm.format_llm_response(response, verbose=False)[source]
Format LLM response for output.
- Parameters:
response (
LLMResponse
) – LLMResponse to formatverbose (
bool
) – Whether to include metadata
- Return type:
- Returns:
Formatted output string
Utility functions and logging configuration for the document-based question answering system.
- src.utils.setup_logging(log_level='INFO', log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')[source]
Set up centralized logging configuration.
- src.utils.log_performance(func)[source]
Decorator to log function performance metrics.
- Parameters:
func – Function to decorate
- Returns:
Decorated function
- src.utils.batch_process(items, batch_size, process_func, logger, description='Processing')[source]
Process items in batches with progress logging.
Data Structures
Document Ingestion Pipeline
Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.
- class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]
Bases:
object
Metadata for a text chunk.
- __init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
- class src.ingest.DocumentChunk(text, metadata)[source]
Bases:
object
A chunk of text from a document with metadata.
-
metadata:
ChunkMetadata
- __init__(text, metadata)
-
metadata:
- class src.ingest.PDFProcessor(engine='pymupdf')[source]
Bases:
object
Handles PDF text extraction using different engines.
- __init__(engine='pymupdf')[source]
Initialize PDF processor.
- Parameters:
engine (
str
) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)
- extract_text(pdf_path)[source]
Extract text from PDF with page numbers.
- Parameters:
pdf_path (
Path
) – Path to PDF file- Return type:
- Returns:
List of (text, page_number) tuples
- Raises:
ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist
- class src.ingest.TextCleaner(config)[source]
Bases:
object
Handles text cleaning and normalization.
- class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]
Bases:
object
Handles text chunking with sliding window.
- class src.ingest.DocumentIngester(config)[source]
Bases:
object
Main class for document ingestion pipeline.
- ingest_documents(documents_path)[source]
Ingest all PDF documents from the given path.
- Parameters:
documents_path (
Path
) – Path to directory containing PDF files- Return type:
- Returns:
List of all document chunks
- Raises:
ValueError – If documents_path doesn’t exist or contains no PDFs
- src.ingest.ingest_documents(documents_path, config, args)[source]
Main function for document ingestion.
Embedding Pipeline
Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.
- class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]
Bases:
object
Configuration for embedding generation.
- __init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)
- class src.embed.EmbeddingModel(config)[source]
Bases:
object
Handles embedding model loading and text embedding generation.
- __init__(config)[source]
Initialize embedding model.
- Parameters:
config (
EmbeddingConfig
) – Embedding configuration
- class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]
Bases:
object
Handles FAISS index creation and management.
- add_embeddings(embeddings, chunk_metadata)[source]
Add embeddings to the index.
- Parameters:
embeddings (
ndarray
) – numpy array of embeddingschunk_metadata (
list
[ChunkMetadata
]) – List of chunk metadata corresponding to embeddings
- Return type:
- get_chunk_by_index(index)[source]
Get chunk metadata by index.
- Parameters:
index (
int
) – Index in the metadata list- Return type:
- Returns:
Chunk metadata or None if index is invalid
- class src.embed.EmbeddingPipeline(config)[source]
Bases:
object
Main class for embedding generation and index management.
- create_embeddings_from_chunks(chunks)[source]
Create embeddings from document chunks and build FAISS index.
- Parameters:
chunks (
list
[DocumentChunk
]) – List of document chunks- Return type:
- src.embed.create_embeddings_from_chunks_file(chunks_file, config, output_path)[source]
Create embeddings from a chunks.json file.
- src.embed.load_embedding_pipeline(config, index_path)[source]
Load an embedding pipeline with existing index.
- Parameters:
- Return type:
- Returns:
Loaded EmbeddingPipeline
Query Engine
Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.
- class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]
Bases:
object
Result of a query with relevant chunks and metadata.
-
chunks:
list
[DocumentChunk
]
- __init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
-
chunks:
- class src.query.QueryEngine(config, index_path=None)[source]
Bases:
object
Main class for query processing and similarity search.
- search(query, top_k=None, similarity_threshold=None)[source]
Search for chunks similar to the query.
- class src.query.QueryProcessor(config, index_path=None)[source]
Bases:
object
High-level query processor with additional functionality.
- process_query(query, top_k=None, similarity_threshold=None)[source]
Process a user query and return relevant chunks.
- format_results(result, include_metadata=True)[source]
Format query results as a readable string.
- Parameters:
result (
QueryResult
) – QueryResult to formatinclude_metadata (
bool
) – Whether to include chunk metadata
- Return type:
- Returns:
Formatted string representation of results
- get_relevant_context(result, max_chars=2000)[source]
Get relevant context from search results for LLM input.
- Parameters:
result (
QueryResult
) – QueryResult from searchmax_chars (
int
) – Maximum characters to include
- Return type:
- Returns:
Formatted context string for LLM
- src.query.format_query_output(result, verbose=False)[source]
Format query results for output.
- Parameters:
result (
QueryResult
) – QueryResult to formatverbose (
bool
) – Whether to include detailed output
- Return type:
- Returns:
Formatted output string
LLM Interface
Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).
- class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]
Bases:
object
Configuration for LLM settings.
- __init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)
- class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]
Bases:
object
Response from LLM with metadata.
- __init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)
- class src.llm.BaseLLM(config)[source]
Bases:
object
Base class for LLM implementations.
- __init__(config)[source]
Initialize LLM with configuration.
- Parameters:
config (
LLMConfig
) – LLM configuration
- class src.llm.TransformersLLM(config)[source]
Bases:
BaseLLM
LLM implementation using transformers library.
- class src.llm.LlamaCppLLM(config)[source]
Bases:
BaseLLM
LLM implementation using llama-cpp-python.
- class src.llm.OpenAILLM(config)[source]
Bases:
BaseLLM
LLM implementation using OpenAI API (optional).
- class src.llm.LLMInterface(config)[source]
Bases:
object
Main interface for LLM operations.
- generate_answer(query, query_result)[source]
Generate answer from query and retrieved chunks.
- Parameters:
query (
str
) – User queryquery_result (
QueryResult
) – QueryResult with retrieved chunks
- Return type:
- Returns:
LLMResponse with generated answer
- src.llm.create_llm_interface(config)[source]
Create LLM interface from configuration.
- Parameters:
- Return type:
- Returns:
LLMInterface instance
- src.llm.generate_answer_from_query(query, query_result, config)[source]
Generate answer from query and query result.
- src.llm.format_llm_response(response, verbose=False)[source]
Format LLM response for output.
- Parameters:
response (
LLMResponse
) – LLMResponse to formatverbose (
bool
) – Whether to include metadata
- Return type:
- Returns:
Formatted output string
Configuration
The system uses YAML configuration files for all settings. Here’s the structure:
# PDF Processing Configuration
pdf:
engine: "pymupdf" # Options: "pymupdf", "pdfminer", "pdfplumber"
chunk_size: 1000
chunk_overlap: 200
# Embedding Configuration
embedding:
model: "all-MiniLM-L6-v2"
top_k: 5
similarity_threshold: 0.7
# LLM Configuration
llm:
backend: "llama-cpp" # Options: "transformers", "llama-cpp", "openai"
model_path: "./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
temperature: 0.1
max_tokens: 200
top_p: 0.9
repeat_penalty: 1.1
context_window: 4096
# Storage Configuration
storage:
index_dir: "./index"
chunk_dir: "./index/chunks"
# System Configuration
system:
log_level: "INFO"
batch_size: 100
max_workers: 4
Command Line Interface
The main entry point provides a command-line interface:
# Ingest documents
python main.py --mode ingest --documents ./data/
# Query the system
python main.py --mode query --query "What are the key features?"
# With verbose output
python main.py --mode query --query "Your question" --verbose
# Override configuration
python main.py --mode query --query "Your question" --similarity-threshold 0.5
Available CLI Options:
--mode: Choose between 'ingest' or 'query'
--documents: Path to documents directory (for ingest mode)
--query: Your question (for query mode)
--verbose: Enable verbose output
--similarity-threshold: Override similarity threshold
--top-k: Override number of chunks to retrieve
--chunk-size: Override chunk size for ingestion
--chunk-overlap: Override chunk overlap for ingestion
--embedding-model: Override embedding model
--llm-backend: Override LLM backend
--llm-model: Override LLM model path
--temperature: Override LLM temperature
--max-tokens: Override LLM max tokens
Examples
Basic Usage
from src.ingest import ingest_documents
from src.query import process_query
from src.llm import generate_answer_from_query
# Ingest documents
config = load_config("config.yaml")
ingest_documents("./data/", config, args)
# Query the system
result = process_query("What is this about?", config, args)
answer = generate_answer_from_query("What is this about?", result, config)
Advanced Usage
from src.embed import EmbeddingPipeline
from src.query import QueryProcessor
from src.llm import LLMInterface
# Custom embedding pipeline
embedding_pipeline = EmbeddingPipeline(config)
embedding_pipeline.create_embeddings_from_chunks(chunks)
# Custom query processing
query_processor = QueryProcessor(config, index_path)
result = query_processor.process_query("Your question", top_k=10, similarity_threshold=0.8)
# Custom LLM interface
llm_interface = LLMInterface(config)
response = llm_interface.generate_answer("Your question", result)
Error Handling
The system provides comprehensive error handling:
try:
result = process_query("Your question", config, args)
except ValueError as e:
print(f"Configuration error: {e}")
except FileNotFoundError as e:
print(f"File not found: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Performance Optimization
For optimal performance:
Use appropriate chunk sizes: 1000-2000 characters work well for most documents
Adjust similarity threshold: 0.7-0.8 provides good balance
Batch processing: Use batch_size in system config for large datasets
Model selection: Choose quantized models for faster inference
Hardware utilization: Use GPU if available for LLM inference
Monitoring and Logging
The system provides comprehensive logging:
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
# Monitor performance
from src.utils import log_performance, log_memory_usage
@log_performance
def your_function():
# Your code here
pass
Testing
The system includes comprehensive tests:
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=src tests/
# Run specific test categories
pytest -m unit tests/
pytest -m integration tests/