API Reference
This section contains the complete API reference for the Document-Based Question Answering System.
Core Modules
CLI Interface
The main entry point for the system is the main.py module, which provides a command-line interface for document ingestion and querying.
Document-based Question Answering System
A local, modular RAG (retrieval-augmented generation) system that processes PDF documents and enables natural language queries.
- Usage:
python main.py –mode ingest –documents ./data/ python main.py –mode query –query “What is Consult+ prediction for Tesla stock?”
- main.load_config(config_path='config.yaml')[source]
Load configuration from YAML file.
- Parameters:
config_path (
str
) – Path to the configuration file- Return type:
- Returns:
Dictionary containing configuration parameters
- Raises:
FileNotFoundError – If config file doesn’t exist
yaml.YAMLError – If config file is malformed
- main.validate_ingest_args(args)[source]
Validate arguments for ingest mode.
- Parameters:
args (
Namespace
) – Parsed command line arguments- Raises:
ValueError – If validation fails
- Return type:
- main.validate_query_args(args)[source]
Validate arguments for query mode.
- Parameters:
args (
Namespace
) – Parsed command line arguments- Raises:
ValueError – If validation fails
- Return type:
Document Ingestion
The ingest module handles PDF document processing, text extraction, and chunking.
Document Ingestion Pipeline
Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.
- class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]
Bases:
object
Metadata for a text chunk.
- __init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
- class src.ingest.DocumentChunk(text, metadata)[source]
Bases:
object
A chunk of text from a document with metadata.
-
metadata:
ChunkMetadata
- __init__(text, metadata)
-
metadata:
- class src.ingest.PDFProcessor(engine='pymupdf')[source]
Bases:
object
Handles PDF text extraction using different engines.
- __init__(engine='pymupdf')[source]
Initialize PDF processor.
- Parameters:
engine (
str
) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)
- extract_text(pdf_path)[source]
Extract text from PDF with page numbers.
- Parameters:
pdf_path (
Path
) – Path to PDF file- Return type:
- Returns:
List of (text, page_number) tuples
- Raises:
ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist
- class src.ingest.TextCleaner(config)[source]
Bases:
object
Handles text cleaning and normalization.
- class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]
Bases:
object
Handles text chunking with sliding window.
- class src.ingest.DocumentIngester(config)[source]
Bases:
object
Main class for document ingestion pipeline.
- ingest_documents(documents_path)[source]
Ingest all PDF documents from the given path.
- Parameters:
documents_path (
Path
) – Path to directory containing PDF files- Return type:
- Returns:
List of all document chunks
- Raises:
ValueError – If documents_path doesn’t exist or contains no PDFs
Embedding Generation
The embed module manages embedding model loading, vector generation, and FAISS indexing.
Embedding Pipeline
Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.
- class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]
Bases:
object
Configuration for embedding generation.
- __init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)
- class src.embed.EmbeddingModel(config)[source]
Bases:
object
Handles embedding model loading and text embedding generation.
- __init__(config)[source]
Initialize embedding model.
- Parameters:
config (
EmbeddingConfig
) – Embedding configuration
- class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]
Bases:
object
Handles FAISS index creation and management.
- add_embeddings(embeddings, chunk_metadata)[source]
Add embeddings to the index.
- Parameters:
embeddings (
ndarray
) – numpy array of embeddingschunk_metadata (
list
[ChunkMetadata
]) – List of chunk metadata corresponding to embeddings
- Return type:
- get_chunk_by_index(index)[source]
Get chunk metadata by index.
- Parameters:
index (
int
) – Index in the metadata list- Return type:
- Returns:
Chunk metadata or None if index is invalid
- class src.embed.EmbeddingPipeline(config)[source]
Bases:
object
Main class for embedding generation and index management.
- create_embeddings_from_chunks(chunks)[source]
Create embeddings from document chunks and build FAISS index.
- Parameters:
chunks (
list
[DocumentChunk
]) – List of document chunks- Return type:
Query Processing
The query module handles query embedding, similarity search, and result ranking.
Query Engine
Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.
- class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]
Bases:
object
Result of a query with relevant chunks and metadata.
-
chunks:
list
[DocumentChunk
]
- __init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
-
chunks:
- class src.query.QueryEngine(config, index_path=None)[source]
Bases:
object
Main class for query processing and similarity search.
- search(query, top_k=None, similarity_threshold=None)[source]
Search for chunks similar to the query.
- class src.query.QueryProcessor(config, index_path=None)[source]
Bases:
object
High-level query processor with additional functionality.
- process_query(query, top_k=None, similarity_threshold=None)[source]
Process a user query and return relevant chunks.
- format_results(result, include_metadata=True)[source]
Format query results as a readable string.
- Parameters:
result (
QueryResult
) – QueryResult to formatinclude_metadata (
bool
) – Whether to include chunk metadata
- Return type:
- Returns:
Formatted string representation of results
- get_relevant_context(result, max_chars=2000)[source]
Get relevant context from search results for LLM input.
- Parameters:
result (
QueryResult
) – QueryResult from searchmax_chars (
int
) – Maximum characters to include
- Return type:
- Returns:
Formatted context string for LLM
- src.query.format_query_output(result, verbose=False)[source]
Format query results for output.
- Parameters:
result (
QueryResult
) – QueryResult to formatverbose (
bool
) – Whether to include detailed output
- Return type:
- Returns:
Formatted output string
LLM Interface
The llm module provides interfaces for different LLM backends and answer generation.
LLM Interface
Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).
- class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]
Bases:
object
Configuration for LLM settings.
- __init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)
- class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]
Bases:
object
Response from LLM with metadata.
- __init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)
- class src.llm.BaseLLM(config)[source]
Bases:
object
Base class for LLM implementations.
- __init__(config)[source]
Initialize LLM with configuration.
- Parameters:
config (
LLMConfig
) – LLM configuration
- class src.llm.TransformersLLM(config)[source]
Bases:
BaseLLM
LLM implementation using transformers library.
- class src.llm.LlamaCppLLM(config)[source]
Bases:
BaseLLM
LLM implementation using llama-cpp-python.
- class src.llm.OpenAILLM(config)[source]
Bases:
BaseLLM
LLM implementation using OpenAI API (optional).
- class src.llm.LLMInterface(config)[source]
Bases:
object
Main interface for LLM operations.
- generate_answer(query, query_result)[source]
Generate answer from query and retrieved chunks.
- Parameters:
query (
str
) – User queryquery_result (
QueryResult
) – QueryResult with retrieved chunks
- Return type:
- Returns:
LLMResponse with generated answer
- src.llm.create_llm_interface(config)[source]
Create LLM interface from configuration.
- Parameters:
- Return type:
- Returns:
LLMInterface instance
- src.llm.generate_answer_from_query(query, query_result, config)[source]
Generate answer from query and query result.
- src.llm.format_llm_response(response, verbose=False)[source]
Format LLM response for output.
- Parameters:
response (
LLMResponse
) – LLMResponse to formatverbose (
bool
) – Whether to include metadata
- Return type:
- Returns:
Formatted output string
Utilities
The utils module contains utility functions for logging, performance monitoring, and system information.
Utility functions and logging configuration for the document-based question answering system.
- src.utils.setup_logging(log_level='INFO', log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')[source]
Set up centralized logging configuration.
- src.utils.log_performance(func)[source]
Decorator to log function performance metrics.
- Parameters:
func – Function to decorate
- Returns:
Decorated function
- src.utils.batch_process(items, batch_size, process_func, logger, description='Processing')[source]
Process items in batches with progress logging.
Data Structures
Core data structures used throughout the system:
Document Ingestion Pipeline
Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.
- class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]
Bases:
object
Metadata for a text chunk.
- __init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
- class src.ingest.DocumentChunk(text, metadata)[source]
Bases:
object
A chunk of text from a document with metadata.
-
metadata:
ChunkMetadata
- __init__(text, metadata)
-
metadata:
- class src.ingest.PDFProcessor(engine='pymupdf')[source]
Bases:
object
Handles PDF text extraction using different engines.
- __init__(engine='pymupdf')[source]
Initialize PDF processor.
- Parameters:
engine (
str
) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)
- extract_text(pdf_path)[source]
Extract text from PDF with page numbers.
- Parameters:
pdf_path (
Path
) – Path to PDF file- Return type:
- Returns:
List of (text, page_number) tuples
- Raises:
ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist
- class src.ingest.TextCleaner(config)[source]
Bases:
object
Handles text cleaning and normalization.
- class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]
Bases:
object
Handles text chunking with sliding window.
- class src.ingest.DocumentIngester(config)[source]
Bases:
object
Main class for document ingestion pipeline.
- ingest_documents(documents_path)[source]
Ingest all PDF documents from the given path.
- Parameters:
documents_path (
Path
) – Path to directory containing PDF files- Return type:
- Returns:
List of all document chunks
- Raises:
ValueError – If documents_path doesn’t exist or contains no PDFs
- src.ingest.ingest_documents(documents_path, config, args)[source]
Main function for document ingestion.
Query Engine
Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.
- class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]
Bases:
object
Result of a query with relevant chunks and metadata.
-
chunks:
list
[DocumentChunk
]
- __init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
-
chunks:
- class src.query.QueryEngine(config, index_path=None)[source]
Bases:
object
Main class for query processing and similarity search.
- search(query, top_k=None, similarity_threshold=None)[source]
Search for chunks similar to the query.
- class src.query.QueryProcessor(config, index_path=None)[source]
Bases:
object
High-level query processor with additional functionality.
- process_query(query, top_k=None, similarity_threshold=None)[source]
Process a user query and return relevant chunks.
- format_results(result, include_metadata=True)[source]
Format query results as a readable string.
- Parameters:
result (
QueryResult
) – QueryResult to formatinclude_metadata (
bool
) – Whether to include chunk metadata
- Return type:
- Returns:
Formatted string representation of results
- get_relevant_context(result, max_chars=2000)[source]
Get relevant context from search results for LLM input.
- Parameters:
result (
QueryResult
) – QueryResult from searchmax_chars (
int
) – Maximum characters to include
- Return type:
- Returns:
Formatted context string for LLM
- src.query.format_query_output(result, verbose=False)[source]
Format query results for output.
- Parameters:
result (
QueryResult
) – QueryResult to formatverbose (
bool
) – Whether to include detailed output
- Return type:
- Returns:
Formatted output string
Configuration
The system uses YAML configuration files. See ../configuration for detailed configuration options.
Error Handling
The system provides comprehensive error handling for various scenarios:
FileNotFoundError: When documents or models are not found
ValueError: When configuration is invalid
RuntimeError: When models fail to load or process
MemoryError: When system runs out of memory
Performance Considerations
Memory Usage: Monitor memory usage with log_memory_usage()
Batch Processing: Use batch processing for large datasets
Caching: Enable caching for frequently accessed data
Optimization: Use optimize_memory() for memory cleanup
Examples
See ../user_guide/examples for practical usage examples.