API Reference 

src.ingest.ingest_documents(documents_path, config, args)[source]

Main function for document ingestion.

Parameters:

documents_path (str) – Path to documents directory
config (dict[str, Any]) – Configuration dictionary
args (Any) – Command line arguments

Return type:

Embedding Generation

The embed module manages embedding model loading, vector generation, and FAISS indexing.

Embedding Pipeline

Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.

class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]

Bases: object

Configuration for embedding generation.

model_name: str

normalize_embeddings: bool

device: str

similarity_threshold: float

top_k: int

__init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)

class src.embed.EmbeddingModel(config)[source]

Bases: object

Handles embedding model loading and text embedding generation.

__init__(config)[source]

Initialize embedding model.

Parameters:: config (EmbeddingConfig) – Embedding configuration

generate_embeddings(texts)[source]

Generate embeddings for a list of texts.

Parameters:: texts (list[str]) – List of text strings to embed
Return type:: ndarray
Returns:: numpy array of embeddings

generate_single_embedding(text)[source]

Generate embedding for a single text.

Parameters:: text (str) – Text string to embed
Return type:: ndarray
Returns:: numpy array of embedding

class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]

Bases: object

Handles FAISS index creation and management.

__init__(dimension, index_type='IndexFlatIP')[source]

Initialize FAISS index.

Parameters:

dimension (int) – Dimension of embeddings
index_type (str) – Type of FAISS index to use

add_embeddings(embeddings, chunk_metadata)[source]

Add embeddings to the index.

Parameters:

embeddings (ndarray) – numpy array of embeddings
chunk_metadata (list[ChunkMetadata]) – List of chunk metadata corresponding to embeddings

Return type:

search(query_embedding, k)[source]

Search for similar embeddings.

Parameters:

query_embedding (ndarray) – Query embedding
k (int) – Number of results to return

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (distances, indices)

get_chunk_by_index(index)[source]

Get chunk metadata by index.

Parameters:: index (int) – Index in the metadata list
Return type:: ChunkMetadata | None
Returns:: Chunk metadata or None if index is invalid

get_total_embeddings()[source]

Get total number of embeddings in index.

Return type:: int

save_index(index_path)[source]

Save FAISS index and metadata to disk.

Parameters:: index_path (Path) – Path to save index
Return type:: None

load_index(index_path)[source]

Load FAISS index and metadata from disk.

Parameters:: index_path (Path) – Path to load index from
Return type:: None

class src.embed.EmbeddingPipeline(config)[source]

Bases: object

Main class for embedding generation and index management.

__init__(config)[source]

Initialize embedding pipeline.

Parameters:: config (dict[str, Any]) – Configuration dictionary

create_embeddings_from_chunks(chunks)[source]

Create embeddings from document chunks and build FAISS index.

Parameters:: chunks (list[DocumentChunk]) – List of document chunks
Return type:: None

save_index(index_path)[source]

Save the FAISS index and metadata.

Parameters:: index_path (Path) – Path to save index
Return type:: None

load_index(index_path)[source]

Load the FAISS index and metadata.

Parameters:: index_path (Path) – Path to load index from
Return type:: None

search_similar_chunks(query, top_k=None)[source]

Search for chunks similar to the query.

Parameters:

query (str) – Query text
top_k (int | None) – Number of results to return (uses config default if None)

Return type:

list[tuple[DocumentChunk, float]]

Returns:

List of (chunk, similarity_score) tuples

get_index_stats()[source]

Get statistics about the index.

Return type:: dict[str, Any]
Returns:: Dictionary with index statistics

src.embed.create_embeddings_from_chunks_file(chunks_file, config, output_path)[source]

Create embeddings from a chunks.json file.

Parameters:

chunks_file (Path) – Path to chunks.json file
config (dict[str, Any]) – Configuration dictionary
output_path (Path) – Path to save index

Return type:

src.embed.load_embedding_pipeline(config, index_path)[source]

Load an embedding pipeline with existing index.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path) – Path to index directory

Return type:

EmbeddingPipeline

Returns:

Loaded EmbeddingPipeline

Query Processing

The query module handles query embedding, similarity search, and result ranking.

Query Engine

Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.

class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]

Bases: object

Result of a query with relevant chunks and metadata.

query: str

chunks: list[DocumentChunk]

similarities: list[float]

total_chunks_searched: int

search_time_ms: float

__init__(query, chunks, similarities, total_chunks_searched, search_time_ms)

class src.query.QueryEngine(config, index_path=None)[source]

Bases: object

Main class for query processing and similarity search.

__init__(config, index_path=None)[source]

Initialize query engine.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path | None) – Path to FAISS index (if None, will use config default)

search(query, top_k=None, similarity_threshold=None)[source]

Search for chunks similar to the query.

Parameters:

query (str) – User query text
top_k (int | None) – Number of results to return (uses config default if None)
similarity_threshold (float | None) – Minimum similarity score (uses config default if None)

Return type:

Returns:

QueryResult with relevant chunks and metadata

get_index_stats()[source]

Get statistics about the loaded index.

Return type:: dict[str, Any]
Returns:: Dictionary with index statistics

validate_index()[source]

Validate that the index is properly loaded and functional.

Return type:: bool
Returns:: True if index is valid, False otherwise

class src.query.QueryProcessor(config, index_path=None)[source]

Bases: object

High-level query processor with additional functionality.

__init__(config, index_path=None)[source]

Initialize query processor.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path | None) – Path to FAISS index

process_query(query, top_k=None, similarity_threshold=None)[source]

Process a user query and return relevant chunks.

Parameters:

query (str) – User query text
top_k (int | None) – Number of results to return
similarity_threshold (float | None) – Minimum similarity score

Return type:

Returns:

QueryResult with relevant chunks and metadata

format_results(result, include_metadata=True)[source]

Format query results as a readable string.

Parameters:

result (QueryResult) – QueryResult to format
include_metadata (bool) – Whether to include chunk metadata

Return type:

Returns:

Formatted string representation of results

get_relevant_context(result, max_chars=2000)[source]

Get relevant context from search results for LLM input.

Parameters:

result (QueryResult) – QueryResult from search
max_chars (int) – Maximum characters to include

Return type:

Returns:

Formatted context string for LLM

src.query.process_query(query, config, args)[source]

Main function for query processing.

Parameters:

query (str) – User query text
config (dict[str, Any]) – Configuration dictionary
args (Any) – Command line arguments

Return type:

Returns:

QueryResult with relevant chunks

src.query.format_query_output(result, verbose=False)[source]

Format query results for output.

Parameters:

result (QueryResult) – QueryResult to format
verbose (bool) – Whether to include detailed output

Return type:

Returns:

Formatted output string

LLM Interface

The llm module provides interfaces for different LLM backends and answer generation.

LLM Interface

Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).

class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]

Bases: object

Configuration for LLM settings.

backend: str

model_path: str

temperature: float

max_tokens: int

top_p: float

repeat_penalty: float

context_window: int

__init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)

class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]

Bases: object

Response from LLM with metadata.

answer: str

prompt_tokens: int

response_tokens: int

generation_time_ms: float

model_used: str

__init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)

class src.llm.BaseLLM(config)[source]

Bases: object

Base class for LLM implementations.

__init__(config)[source]

Initialize LLM with configuration.

Parameters:: config (LLMConfig) – LLM configuration

generate(prompt)[source]

Generate response from prompt.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.TransformersLLM(config)[source]

Bases: BaseLLM

LLM implementation using transformers library.

generate(prompt)[source]

Generate response using transformers.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.LlamaCppLLM(config)[source]

Bases: BaseLLM

LLM implementation using llama-cpp-python.

generate(prompt)[source]

Generate response using llama-cpp.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.OpenAILLM(config)[source]

Bases: BaseLLM

LLM implementation using OpenAI API (optional).

generate(prompt)[source]

Generate response using OpenAI API.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.LLMInterface(config)[source]

Bases: object

Main interface for LLM operations.

__init__(config)[source]

Initialize LLM interface.

Parameters:: config (dict[str, Any]) – Configuration dictionary

format_prompt(query, context)[source]

Format prompt with query and context.

Parameters:

query (str) – User query
context (str) – Retrieved document context

Return type:

Returns:

Formatted prompt

generate_answer(query, query_result)[source]

Generate answer from query and retrieved chunks.

Parameters:

query (str) – User query
query_result (QueryResult) – QueryResult with retrieved chunks

Return type:

LLMResponse

Returns:

LLMResponse with generated answer

get_model_info()[source]

Get information about the loaded model.

Return type:: dict[str, Any]
Returns:: Dictionary with model information

src.llm.create_llm_interface(config)[source]

Create LLM interface from configuration.

Parameters:: config (dict[str, Any]) – Configuration dictionary
Return type:: LLMInterface
Returns:: LLMInterface instance

src.llm.generate_answer_from_query(query, query_result, config)[source]

Generate answer from query and query result.

Parameters:

query (str) – User query
query_result (QueryResult) – QueryResult with retrieved chunks
config (dict[str, Any]) – Configuration dictionary

Return type:

Returns:

Generated answer string

src.llm.format_llm_response(response, verbose=False)[source]

Format LLM response for output.

Parameters:

response (LLMResponse) – LLMResponse to format
verbose (bool) – Whether to include metadata

Return type:

Returns:

Formatted output string

Utilities

The utils module contains utility functions for logging, performance monitoring, and system information.

Utility functions and logging configuration for the document-based question answering system.

src.utils.setup_logging(log_level='INFO', log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')[source]

Set up centralized logging configuration.

Parameters:

log_level (str) – Logging level (DEBUG, INFO, WARNING, ERROR)
log_file (str | None) – Optional log file path
log_format (str) – Log message format

Return type:

src.utils.get_logger(name)[source]

Get a logger instance with the given name.

Parameters:: name (str) – Logger name
Return type:: Logger
Returns:: Configured logger instance

src.utils.log_memory_usage(logger, context='')[source]

Log current memory usage.

Parameters:

logger (Logger) – Logger instance
context (str) – Context string for the log message

Return type:

src.utils.log_performance(func)[source]

Decorator to log function performance metrics.

Parameters:: func – Function to decorate
Returns:: Decorated function

src.utils.batch_process(items, batch_size, process_func, logger, description='Processing')[source]

Process items in batches with progress logging.

Parameters:

items (list) – List of items to process
batch_size (int) – Size of each batch
process_func – Function to apply to each batch
logger (Logger) – Logger instance
description (str) – Description for progress logging

Return type:

list

Returns:

List of processed results

src.utils.optimize_memory()[source]: Perform memory optimization operations.

src.utils.create_cache_directory(cache_dir)[source]

Create and validate cache directory.

Parameters:: cache_dir (str) – Cache directory path
Return type:: Path
Returns:: Path to cache directory

src.utils.get_system_info()[source]

Get system information for logging.

Return type:: dict[str, Any]
Returns:: Dictionary with system information

src.utils.log_system_info(logger)[source]

Log system information.

Parameters:: logger (Logger) – Logger instance
Return type:: None

class src.utils.ProgressTracker(total_items, logger, description='Processing')[source]

Bases: object

Track and log progress of long-running operations.

__init__(total_items, logger, description='Processing')[source]

update(count=1)[source]

Update progress and log periodically.

Return type:: None

finish()[source]

Log completion statistics.

Return type:: None

Data Structures

Core data structures used throughout the system:

Document Ingestion Pipeline

Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.

class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]

Bases: object

Metadata for a text chunk.

file_name: str

page_number: int

chunk_index: int

chunk_start: int

chunk_end: int

chunk_size: int

text_length: int

__init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)

class src.ingest.DocumentChunk(text, metadata)[source]

Bases: object

A chunk of text from a document with metadata.

text: str

metadata: ChunkMetadata

__init__(text, metadata)

class src.ingest.PDFProcessor(engine='pymupdf')[source]

Bases: object

Handles PDF text extraction using different engines.

__init__(engine='pymupdf')[source]

Initialize PDF processor.

Parameters:: engine (str) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)

extract_text(pdf_path)[source]

Extract text from PDF with page numbers.

Parameters:

pdf_path (Path) – Path to PDF file

Return type:

list[tuple[str, int]]

Returns:

List of (text, page_number) tuples

Raises:

ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist

class src.ingest.TextCleaner(config)[source]

Bases: object

Handles text cleaning and normalization.

__init__(config)[source]

Initialize text cleaner.

Parameters:: config (dict[str, Any]) – Configuration dictionary with cleaning parameters

clean_text(text)[source]

Clean and normalize text.

Parameters:: text (str) – Raw text to clean
Return type:: str
Returns:: Cleaned text

class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]

Bases: object

Handles text chunking with sliding window.

__init__(chunk_size=1000, chunk_overlap=200)[source]

Initialize text chunker.

Parameters:

chunk_size (int) – Size of each chunk in characters
chunk_overlap (int) – Overlap between chunks in characters

chunk_text(text, file_name, page_number)[source]

Split text into overlapping chunks.

Parameters:

text (str) – Text to chunk
file_name (str) – Name of the source file
page_number (int) – Page number

Return type:

list[DocumentChunk]

Returns:

List of DocumentChunk objects

class src.ingest.DocumentIngester(config)[source]

Bases: object

Main class for document ingestion pipeline.

__init__(config)[source]

Initialize document ingester.

Parameters:: config (dict[str, Any]) – Configuration dictionary

ingest_documents(documents_path)[source]

Ingest all PDF documents from the given path.

Parameters:: documents_path (Path) – Path to directory containing PDF files
Return type:: list[DocumentChunk]
Returns:: List of all document chunks
Raises:: ValueError – If documents_path doesn’t exist or contains no PDFs

save_chunks(chunks, output_path)[source]

Save chunks and metadata to disk.

Parameters:

chunks (list[DocumentChunk]) – List of document chunks
output_path (Path) – Path to save chunks

Return type:

src.ingest.ingest_documents(documents_path, config, args)[source]

Main function for document ingestion.

Parameters:

documents_path (str) – Path to documents directory
config (dict[str, Any]) – Configuration dictionary
args (Any) – Command line arguments

Return type:

Query Engine

Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.

class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]

Bases: object

Result of a query with relevant chunks and metadata.

query: str

chunks: list[DocumentChunk]

similarities: list[float]

total_chunks_searched: int

search_time_ms: float

__init__(query, chunks, similarities, total_chunks_searched, search_time_ms)

class src.query.QueryEngine(config, index_path=None)[source]

Bases: object

Main class for query processing and similarity search.

__init__(config, index_path=None)[source]

Initialize query engine.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path | None) – Path to FAISS index (if None, will use config default)

search(query, top_k=None, similarity_threshold=None)[source]

Search for chunks similar to the query.

Parameters:

query (str) – User query text
top_k (int | None) – Number of results to return (uses config default if None)
similarity_threshold (float | None) – Minimum similarity score (uses config default if None)

Return type:

Returns:

QueryResult with relevant chunks and metadata

get_index_stats()[source]

Get statistics about the loaded index.

Return type:: dict[str, Any]
Returns:: Dictionary with index statistics

validate_index()[source]

Validate that the index is properly loaded and functional.

Return type:: bool
Returns:: True if index is valid, False otherwise

class src.query.QueryProcessor(config, index_path=None)[source]

Bases: object

High-level query processor with additional functionality.

__init__(config, index_path=None)[source]

Initialize query processor.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path | None) – Path to FAISS index

process_query(query, top_k=None, similarity_threshold=None)[source]

Process a user query and return relevant chunks.

Parameters:

query (str) – User query text
top_k (int | None) – Number of results to return
similarity_threshold (float | None) – Minimum similarity score

Return type:

Returns:

QueryResult with relevant chunks and metadata

format_results(result, include_metadata=True)[source]

Format query results as a readable string.

Parameters:

result (QueryResult) – QueryResult to format
include_metadata (bool) – Whether to include chunk metadata

Return type:

Returns:

Formatted string representation of results

get_relevant_context(result, max_chars=2000)[source]

Get relevant context from search results for LLM input.

Parameters:

result (QueryResult) – QueryResult from search
max_chars (int) – Maximum characters to include

Return type:

Returns:

Formatted context string for LLM

src.query.process_query(query, config, args)[source]

Main function for query processing.

Parameters:

query (str) – User query text
config (dict[str, Any]) – Configuration dictionary
args (Any) – Command line arguments

Return type:

Returns:

QueryResult with relevant chunks

src.query.format_query_output(result, verbose=False)[source]

Format query results for output.

Parameters:

result (QueryResult) – QueryResult to format
verbose (bool) – Whether to include detailed output

Return type: