API Reference

This section contains the complete API reference for the Document-Based Question Answering System.

Core Modules

CLI Interface

The main entry point for the system is the main.py module, which provides a command-line interface for document ingestion and querying.

Document-based Question Answering System

A local, modular RAG (retrieval-augmented generation) system that processes PDF documents and enables natural language queries.

Usage:

python main.py –mode ingest –documents ./data/ python main.py –mode query –query “What is Consult+ prediction for Tesla stock?”

main.load_config(config_path='config.yaml')[source]

Load configuration from YAML file.

Parameters:

config_path (str) – Path to the configuration file

Return type:

dict

Returns:

Dictionary containing configuration parameters

Raises:
  • FileNotFoundError – If config file doesn’t exist

  • yaml.YAMLError – If config file is malformed

main.validate_ingest_args(args)[source]

Validate arguments for ingest mode.

Parameters:

args (Namespace) – Parsed command line arguments

Raises:

ValueError – If validation fails

Return type:

None

main.validate_query_args(args)[source]

Validate arguments for query mode.

Parameters:

args (Namespace) – Parsed command line arguments

Raises:

ValueError – If validation fails

Return type:

None

main.merge_config_with_args(config, args)[source]

Merge CLI arguments with configuration, CLI args take precedence.

Parameters:
  • config (Dict[str, Any]) – Configuration dictionary

  • args (Namespace) – Parsed command line arguments

Return type:

Dict[str, Any]

Returns:

Updated configuration dictionary

main.main()[source]

Main entry point for the document-based question answering system.

Handles command line argument parsing and routes to appropriate functionality based on the selected mode.

Return type:

None

Document Ingestion

The ingest module handles PDF document processing, text extraction, and chunking.

Document Ingestion Pipeline

Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.

class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]

Bases: object

Metadata for a text chunk.

file_name: str
page_number: int
chunk_index: int
chunk_start: int
chunk_end: int
chunk_size: int
text_length: int
__init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
class src.ingest.DocumentChunk(text, metadata)[source]

Bases: object

A chunk of text from a document with metadata.

text: str
metadata: ChunkMetadata
__init__(text, metadata)
class src.ingest.PDFProcessor(engine='pymupdf')[source]

Bases: object

Handles PDF text extraction using different engines.

__init__(engine='pymupdf')[source]

Initialize PDF processor.

Parameters:

engine (str) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)

extract_text(pdf_path)[source]

Extract text from PDF with page numbers.

Parameters:

pdf_path (Path) – Path to PDF file

Return type:

list[tuple[str, int]]

Returns:

List of (text, page_number) tuples

Raises:
class src.ingest.TextCleaner(config)[source]

Bases: object

Handles text cleaning and normalization.

__init__(config)[source]

Initialize text cleaner.

Parameters:

config (dict[str, Any]) – Configuration dictionary with cleaning parameters

clean_text(text)[source]

Clean and normalize text.

Parameters:

text (str) – Raw text to clean

Return type:

str

Returns:

Cleaned text

class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]

Bases: object

Handles text chunking with sliding window.

__init__(chunk_size=1000, chunk_overlap=200)[source]

Initialize text chunker.

Parameters:
  • chunk_size (int) – Size of each chunk in characters

  • chunk_overlap (int) – Overlap between chunks in characters

chunk_text(text, file_name, page_number)[source]

Split text into overlapping chunks.

Parameters:
  • text (str) – Text to chunk

  • file_name (str) – Name of the source file

  • page_number (int) – Page number

Return type:

list[DocumentChunk]

Returns:

List of DocumentChunk objects

class src.ingest.DocumentIngester(config)[source]

Bases: object

Main class for document ingestion pipeline.

__init__(config)[source]

Initialize document ingester.

Parameters:

config (dict[str, Any]) – Configuration dictionary

ingest_documents(documents_path)[source]

Ingest all PDF documents from the given path.

Parameters:

documents_path (Path) – Path to directory containing PDF files

Return type:

list[DocumentChunk]

Returns:

List of all document chunks

Raises:

ValueError – If documents_path doesn’t exist or contains no PDFs

save_chunks(chunks, output_path)[source]

Save chunks and metadata to disk.

Parameters:
Return type:

None

src.ingest.ingest_documents(documents_path, config, args)[source]

Main function for document ingestion.

Parameters:
  • documents_path (str) – Path to documents directory

  • config (dict[str, Any]) – Configuration dictionary

  • args (Any) – Command line arguments

Return type:

None

Embedding Generation

The embed module manages embedding model loading, vector generation, and FAISS indexing.

Embedding Pipeline

Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.

class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]

Bases: object

Configuration for embedding generation.

model_name: str
normalize_embeddings: bool
device: str
similarity_threshold: float
top_k: int
__init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)
class src.embed.EmbeddingModel(config)[source]

Bases: object

Handles embedding model loading and text embedding generation.

__init__(config)[source]

Initialize embedding model.

Parameters:

config (EmbeddingConfig) – Embedding configuration

generate_embeddings(texts)[source]

Generate embeddings for a list of texts.

Parameters:

texts (list[str]) – List of text strings to embed

Return type:

ndarray

Returns:

numpy array of embeddings

generate_single_embedding(text)[source]

Generate embedding for a single text.

Parameters:

text (str) – Text string to embed

Return type:

ndarray

Returns:

numpy array of embedding

class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]

Bases: object

Handles FAISS index creation and management.

__init__(dimension, index_type='IndexFlatIP')[source]

Initialize FAISS index.

Parameters:
  • dimension (int) – Dimension of embeddings

  • index_type (str) – Type of FAISS index to use

add_embeddings(embeddings, chunk_metadata)[source]

Add embeddings to the index.

Parameters:
  • embeddings (ndarray) – numpy array of embeddings

  • chunk_metadata (list[ChunkMetadata]) – List of chunk metadata corresponding to embeddings

Return type:

None

search(query_embedding, k)[source]

Search for similar embeddings.

Parameters:
  • query_embedding (ndarray) – Query embedding

  • k (int) – Number of results to return

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (distances, indices)

get_chunk_by_index(index)[source]

Get chunk metadata by index.

Parameters:

index (int) – Index in the metadata list

Return type:

ChunkMetadata | None

Returns:

Chunk metadata or None if index is invalid

get_total_embeddings()[source]

Get total number of embeddings in index.

Return type:

int

save_index(index_path)[source]

Save FAISS index and metadata to disk.

Parameters:

index_path (Path) – Path to save index

Return type:

None

load_index(index_path)[source]

Load FAISS index and metadata from disk.

Parameters:

index_path (Path) – Path to load index from

Return type:

None

class src.embed.EmbeddingPipeline(config)[source]

Bases: object

Main class for embedding generation and index management.

__init__(config)[source]

Initialize embedding pipeline.

Parameters:

config (dict[str, Any]) – Configuration dictionary

create_embeddings_from_chunks(chunks)[source]

Create embeddings from document chunks and build FAISS index.

Parameters:

chunks (list[DocumentChunk]) – List of document chunks

Return type:

None

save_index(index_path)[source]

Save the FAISS index and metadata.

Parameters:

index_path (Path) – Path to save index

Return type:

None

load_index(index_path)[source]

Load the FAISS index and metadata.

Parameters:

index_path (Path) – Path to load index from

Return type:

None

search_similar_chunks(query, top_k=None)[source]

Search for chunks similar to the query.

Parameters:
  • query (str) – Query text

  • top_k (int | None) – Number of results to return (uses config default if None)

Return type:

list[tuple[DocumentChunk, float]]

Returns:

List of (chunk, similarity_score) tuples

get_index_stats()[source]

Get statistics about the index.

Return type:

dict[str, Any]

Returns:

Dictionary with index statistics

src.embed.create_embeddings_from_chunks_file(chunks_file, config, output_path)[source]

Create embeddings from a chunks.json file.

Parameters:
  • chunks_file (Path) – Path to chunks.json file

  • config (dict[str, Any]) – Configuration dictionary

  • output_path (Path) – Path to save index

Return type:

None

src.embed.load_embedding_pipeline(config, index_path)[source]

Load an embedding pipeline with existing index.

Parameters:
  • config (dict[str, Any]) – Configuration dictionary

  • index_path (Path) – Path to index directory

Return type:

EmbeddingPipeline

Returns:

Loaded EmbeddingPipeline

Query Processing

The query module handles query embedding, similarity search, and result ranking.

Query Engine

Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.

class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]

Bases: object

Result of a query with relevant chunks and metadata.

query: str
chunks: list[DocumentChunk]
similarities: list[float]
total_chunks_searched: int
search_time_ms: float
__init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
class src.query.QueryEngine(config, index_path=None)[source]

Bases: object

Main class for query processing and similarity search.

__init__(config, index_path=None)[source]

Initialize query engine.

Parameters:
  • config (dict[str, Any]) – Configuration dictionary

  • index_path (Path | None) – Path to FAISS index (if None, will use config default)

search(query, top_k=None, similarity_threshold=None)[source]

Search for chunks similar to the query.

Parameters:
  • query (str) – User query text

  • top_k (int | None) – Number of results to return (uses config default if None)

  • similarity_threshold (float | None) – Minimum similarity score (uses config default if None)

Return type:

QueryResult

Returns:

QueryResult with relevant chunks and metadata

get_index_stats()[source]

Get statistics about the loaded index.

Return type:

dict[str, Any]

Returns:

Dictionary with index statistics

validate_index()[source]

Validate that the index is properly loaded and functional.

Return type:

bool

Returns:

True if index is valid, False otherwise

class src.query.QueryProcessor(config, index_path=None)[source]

Bases: object

High-level query processor with additional functionality.

__init__(config, index_path=None)[source]

Initialize query processor.

Parameters:
  • config (dict[str, Any]) – Configuration dictionary

  • index_path (Path | None) – Path to FAISS index

process_query(query, top_k=None, similarity_threshold=None)[source]

Process a user query and return relevant chunks.

Parameters:
  • query (str) – User query text

  • top_k (int | None) – Number of results to return

  • similarity_threshold (float | None) – Minimum similarity score

Return type:

QueryResult

Returns:

QueryResult with relevant chunks and metadata

format_results(result, include_metadata=True)[source]

Format query results as a readable string.

Parameters:
  • result (QueryResult) – QueryResult to format

  • include_metadata (bool) – Whether to include chunk metadata

Return type:

str

Returns:

Formatted string representation of results

get_relevant_context(result, max_chars=2000)[source]

Get relevant context from search results for LLM input.

Parameters:
  • result (QueryResult) – QueryResult from search

  • max_chars (int) – Maximum characters to include

Return type:

str

Returns:

Formatted context string for LLM

src.query.process_query(query, config, args)[source]

Main function for query processing.

Parameters:
  • query (str) – User query text

  • config (dict[str, Any]) – Configuration dictionary

  • args (Any) – Command line arguments

Return type:

QueryResult

Returns:

QueryResult with relevant chunks

src.query.format_query_output(result, verbose=False)[source]

Format query results for output.

Parameters:
  • result (QueryResult) – QueryResult to format

  • verbose (bool) – Whether to include detailed output

Return type:

str

Returns:

Formatted output string

LLM Interface

The llm module provides interfaces for different LLM backends and answer generation.

LLM Interface

Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).

class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]

Bases: object

Configuration for LLM settings.

backend: str
model_path: str
temperature: float
max_tokens: int
top_p: float
repeat_penalty: float
context_window: int
__init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)
class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]

Bases: object

Response from LLM with metadata.

answer: str
prompt_tokens: int
response_tokens: int
generation_time_ms: float
model_used: str
__init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)
class src.llm.BaseLLM(config)[source]

Bases: object

Base class for LLM implementations.

__init__(config)[source]

Initialize LLM with configuration.

Parameters:

config (LLMConfig) – LLM configuration

generate(prompt)[source]

Generate response from prompt.

Parameters:

prompt (str) – Input prompt

Return type:

LLMResponse

Returns:

LLMResponse with answer and metadata

class src.llm.TransformersLLM(config)[source]

Bases: BaseLLM

LLM implementation using transformers library.

generate(prompt)[source]

Generate response using transformers.

Parameters:

prompt (str) – Input prompt

Return type:

LLMResponse

Returns:

LLMResponse with answer and metadata

class src.llm.LlamaCppLLM(config)[source]

Bases: BaseLLM

LLM implementation using llama-cpp-python.

generate(prompt)[source]

Generate response using llama-cpp.

Parameters:

prompt (str) – Input prompt

Return type:

LLMResponse

Returns:

LLMResponse with answer and metadata

class src.llm.OpenAILLM(config)[source]

Bases: BaseLLM

LLM implementation using OpenAI API (optional).

generate(prompt)[source]

Generate response using OpenAI API.

Parameters:

prompt (str) – Input prompt

Return type:

LLMResponse

Returns:

LLMResponse with answer and metadata

class src.llm.LLMInterface(config)[source]

Bases: object

Main interface for LLM operations.

__init__(config)[source]

Initialize LLM interface.

Parameters:

config (dict[str, Any]) – Configuration dictionary

format_prompt(query, context)[source]

Format prompt with query and context.

Parameters:
  • query (str) – User query

  • context (str) – Retrieved document context

Return type:

str

Returns:

Formatted prompt

generate_answer(query, query_result)[source]

Generate answer from query and retrieved chunks.

Parameters:
  • query (str) – User query

  • query_result (QueryResult) – QueryResult with retrieved chunks

Return type:

LLMResponse

Returns:

LLMResponse with generated answer

get_model_info()[source]

Get information about the loaded model.

Return type:

dict[str, Any]

Returns:

Dictionary with model information

src.llm.create_llm_interface(config)[source]

Create LLM interface from configuration.

Parameters:

config (dict[str, Any]) – Configuration dictionary

Return type:

LLMInterface

Returns:

LLMInterface instance

src.llm.generate_answer_from_query(query, query_result, config)[source]

Generate answer from query and query result.

Parameters:
  • query (str) – User query

  • query_result (QueryResult) – QueryResult with retrieved chunks

  • config (dict[str, Any]) – Configuration dictionary

Return type:

str

Returns:

Generated answer string

src.llm.format_llm_response(response, verbose=False)[source]

Format LLM response for output.

Parameters:
  • response (LLMResponse) – LLMResponse to format

  • verbose (bool) – Whether to include metadata

Return type:

str

Returns:

Formatted output string

Utilities

The utils module contains utility functions for logging, performance monitoring, and system information.

Utility functions and logging configuration for the document-based question answering system.

src.utils.setup_logging(log_level='INFO', log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')[source]

Set up centralized logging configuration.

Parameters:
  • log_level (str) – Logging level (DEBUG, INFO, WARNING, ERROR)

  • log_file (str | None) – Optional log file path

  • log_format (str) – Log message format

Return type:

None

src.utils.get_logger(name)[source]

Get a logger instance with the given name.

Parameters:

name (str) – Logger name

Return type:

Logger

Returns:

Configured logger instance

src.utils.log_memory_usage(logger, context='')[source]

Log current memory usage.

Parameters:
  • logger (Logger) – Logger instance

  • context (str) – Context string for the log message

Return type:

None

src.utils.log_performance(func)[source]

Decorator to log function performance metrics.

Parameters:

func – Function to decorate

Returns:

Decorated function

src.utils.batch_process(items, batch_size, process_func, logger, description='Processing')[source]

Process items in batches with progress logging.

Parameters:
  • items (list) – List of items to process

  • batch_size (int) – Size of each batch

  • process_func – Function to apply to each batch

  • logger (Logger) – Logger instance

  • description (str) – Description for progress logging

Return type:

list

Returns:

List of processed results

src.utils.optimize_memory()[source]

Perform memory optimization operations.

src.utils.create_cache_directory(cache_dir)[source]

Create and validate cache directory.

Parameters:

cache_dir (str) – Cache directory path

Return type:

Path

Returns:

Path to cache directory

src.utils.get_system_info()[source]

Get system information for logging.

Return type:

dict[str, Any]

Returns:

Dictionary with system information

src.utils.log_system_info(logger)[source]

Log system information.

Parameters:

logger (Logger) – Logger instance

Return type:

None

class src.utils.ProgressTracker(total_items, logger, description='Processing')[source]

Bases: object

Track and log progress of long-running operations.

__init__(total_items, logger, description='Processing')[source]
update(count=1)[source]

Update progress and log periodically.

Return type:

None

finish()[source]

Log completion statistics.

Return type:

None

Data Structures

Core data structures used throughout the system:

Document Ingestion Pipeline

Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.

class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]

Bases: object

Metadata for a text chunk.

file_name: str
page_number: int
chunk_index: int
chunk_start: int
chunk_end: int
chunk_size: int
text_length: int
__init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
class src.ingest.DocumentChunk(text, metadata)[source]

Bases: object

A chunk of text from a document with metadata.

text: str
metadata: ChunkMetadata
__init__(text, metadata)
class src.ingest.PDFProcessor(engine='pymupdf')[source]

Bases: object

Handles PDF text extraction using different engines.

__init__(engine='pymupdf')[source]

Initialize PDF processor.

Parameters:

engine (str) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)

extract_text(pdf_path)[source]

Extract text from PDF with page numbers.

Parameters:

pdf_path (Path) – Path to PDF file

Return type:

list[tuple[str, int]]

Returns:

List of (text, page_number) tuples

Raises:
class src.ingest.TextCleaner(config)[source]

Bases: object

Handles text cleaning and normalization.

__init__(config)[source]

Initialize text cleaner.

Parameters:

config (dict[str, Any]) – Configuration dictionary with cleaning parameters

clean_text(text)[source]

Clean and normalize text.

Parameters:

text (str) – Raw text to clean

Return type:

str

Returns:

Cleaned text

class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]

Bases: object

Handles text chunking with sliding window.

__init__(chunk_size=1000, chunk_overlap=200)[source]

Initialize text chunker.

Parameters:
  • chunk_size (int) – Size of each chunk in characters

  • chunk_overlap (int) – Overlap between chunks in characters

chunk_text(text, file_name, page_number)[source]

Split text into overlapping chunks.

Parameters:
  • text (str) – Text to chunk

  • file_name (str) – Name of the source file

  • page_number (int) – Page number

Return type:

list[DocumentChunk]

Returns:

List of DocumentChunk objects

class src.ingest.DocumentIngester(config)[source]

Bases: object

Main class for document ingestion pipeline.

__init__(config)[source]

Initialize document ingester.

Parameters:

config (dict[str, Any]) – Configuration dictionary

ingest_documents(documents_path)[source]

Ingest all PDF documents from the given path.

Parameters:

documents_path (Path) – Path to directory containing PDF files

Return type:

list[DocumentChunk]

Returns:

List of all document chunks

Raises:

ValueError – If documents_path doesn’t exist or contains no PDFs

save_chunks(chunks, output_path)[source]

Save chunks and metadata to disk.

Parameters:
Return type:

None

src.ingest.ingest_documents(documents_path, config, args)[source]

Main function for document ingestion.

Parameters:
  • documents_path (str) – Path to documents directory

  • config (dict[str, Any]) – Configuration dictionary

  • args (Any) – Command line arguments

Return type:

None

Query Engine

Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.

class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]

Bases: object

Result of a query with relevant chunks and metadata.

query: str
chunks: list[DocumentChunk]
similarities: list[float]
total_chunks_searched: int
search_time_ms: float
__init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
class src.query.QueryEngine(config, index_path=None)[source]

Bases: object

Main class for query processing and similarity search.

__init__(config, index_path=None)[source]

Initialize query engine.

Parameters:
  • config (dict[str, Any]) – Configuration dictionary

  • index_path (Path | None) – Path to FAISS index (if None, will use config default)

search(query, top_k=None, similarity_threshold=None)[source]

Search for chunks similar to the query.

Parameters:
  • query (str) – User query text

  • top_k (int | None) – Number of results to return (uses config default if None)

  • similarity_threshold (float | None) – Minimum similarity score (uses config default if None)

Return type:

QueryResult

Returns:

QueryResult with relevant chunks and metadata

get_index_stats()[source]

Get statistics about the loaded index.

Return type:

dict[str, Any]

Returns:

Dictionary with index statistics

validate_index()[source]

Validate that the index is properly loaded and functional.

Return type:

bool

Returns:

True if index is valid, False otherwise

class src.query.QueryProcessor(config, index_path=None)[source]

Bases: object

High-level query processor with additional functionality.

__init__(config, index_path=None)[source]

Initialize query processor.

Parameters:
  • config (dict[str, Any]) – Configuration dictionary

  • index_path (Path | None) – Path to FAISS index

process_query(query, top_k=None, similarity_threshold=None)[source]

Process a user query and return relevant chunks.

Parameters:
  • query (str) – User query text

  • top_k (int | None) – Number of results to return

  • similarity_threshold (float | None) – Minimum similarity score

Return type:

QueryResult

Returns:

QueryResult with relevant chunks and metadata

format_results(result, include_metadata=True)[source]

Format query results as a readable string.

Parameters:
  • result (QueryResult) – QueryResult to format

  • include_metadata (bool) – Whether to include chunk metadata

Return type:

str

Returns:

Formatted string representation of results

get_relevant_context(result, max_chars=2000)[source]

Get relevant context from search results for LLM input.

Parameters:
  • result (QueryResult) – QueryResult from search

  • max_chars (int) – Maximum characters to include

Return type:

str

Returns:

Formatted context string for LLM

src.query.process_query(query, config, args)[source]

Main function for query processing.

Parameters:
  • query (str) – User query text

  • config (dict[str, Any]) – Configuration dictionary

  • args (Any) – Command line arguments

Return type:

QueryResult

Returns:

QueryResult with relevant chunks

src.query.format_query_output(result, verbose=False)[source]

Format query results for output.

Parameters:
  • result (QueryResult) – QueryResult to format

  • verbose (bool) – Whether to include detailed output

Return type:

str

Returns:

Formatted output string

Configuration

The system uses YAML configuration files. See ../configuration for detailed configuration options.

Error Handling

The system provides comprehensive error handling for various scenarios:

  • FileNotFoundError: When documents or models are not found

  • ValueError: When configuration is invalid

  • RuntimeError: When models fail to load or process

  • MemoryError: When system runs out of memory

Performance Considerations

  • Memory Usage: Monitor memory usage with log_memory_usage()

  • Batch Processing: Use batch processing for large datasets

  • Caching: Enable caching for frequently accessed data

  • Optimization: Use optimize_memory() for memory cleanup

Examples

See ../user_guide/examples for practical usage examples.