src package

Submodules

src.embed module

Embedding Pipeline

Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.

class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]

Bases: object

Configuration for embedding generation.

model_name: str

normalize_embeddings: bool

device: str

similarity_threshold: float

top_k: int

__init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)

class src.embed.EmbeddingModel(config)[source]

Bases: object

Handles embedding model loading and text embedding generation.

__init__(config)[source]

Initialize embedding model.

Parameters:: config (EmbeddingConfig) – Embedding configuration

generate_embeddings(texts)[source]

Generate embeddings for a list of texts.

Parameters:: texts (list[str]) – List of text strings to embed
Return type:: ndarray
Returns:: numpy array of embeddings

generate_single_embedding(text)[source]

Generate embedding for a single text.

Parameters:: text (str) – Text string to embed
Return type:: ndarray
Returns:: numpy array of embedding

class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]

Bases: object

Handles FAISS index creation and management.

__init__(dimension, index_type='IndexFlatIP')[source]

Initialize FAISS index.

Parameters:

dimension (int) – Dimension of embeddings
index_type (str) – Type of FAISS index to use

add_embeddings(embeddings, chunk_metadata)[source]

Add embeddings to the index.

Parameters:

embeddings (ndarray) – numpy array of embeddings
chunk_metadata (list[ChunkMetadata]) – List of chunk metadata corresponding to embeddings

Return type:

None

search(query_embedding, k)[source]

Search for similar embeddings.

Parameters:

query_embedding (ndarray) – Query embedding
k (int) – Number of results to return

Return type:

tuple[ndarray, ndarray]

Returns:

Tuple of (distances, indices)

get_chunk_by_index(index)[source]

Get chunk metadata by index.

Parameters:: index (int) – Index in the metadata list
Return type:: ChunkMetadata | None
Returns:: Chunk metadata or None if index is invalid

get_total_embeddings()[source]

Get total number of embeddings in index.

Return type:: int

save_index(index_path)[source]

Save FAISS index and metadata to disk.

Parameters:: index_path (Path) – Path to save index
Return type:: None

load_index(index_path)[source]

Load FAISS index and metadata from disk.

Parameters:: index_path (Path) – Path to load index from
Return type:: None

class src.embed.EmbeddingPipeline(config)[source]

Bases: object

Main class for embedding generation and index management.

__init__(config)[source]

Initialize embedding pipeline.

Parameters:: config (dict[str, Any]) – Configuration dictionary

create_embeddings_from_chunks(chunks)[source]

Create embeddings from document chunks and build FAISS index.

Parameters:: chunks (list[DocumentChunk]) – List of document chunks
Return type:: None

save_index(index_path)[source]

Save the FAISS index and metadata.

Parameters:: index_path (Path) – Path to save index
Return type:: None

load_index(index_path)[source]

Load the FAISS index and metadata.

Parameters:: index_path (Path) – Path to load index from
Return type:: None

search_similar_chunks(query, top_k=None)[source]

Search for chunks similar to the query.

Parameters:

query (str) – Query text
top_k (int | None) – Number of results to return (uses config default if None)

Return type:

list[tuple[DocumentChunk, float]]

Returns:

List of (chunk, similarity_score) tuples

get_index_stats()[source]

Get statistics about the index.

Return type:: dict[str, Any]
Returns:: Dictionary with index statistics

src.embed.create_embeddings_from_chunks_file(chunks_file, config, output_path)[source]

Create embeddings from a chunks.json file.

Parameters:

chunks_file (Path) – Path to chunks.json file
config (dict[str, Any]) – Configuration dictionary
output_path (Path) – Path to save index

Return type:

None

src.embed.load_embedding_pipeline(config, index_path)[source]

Load an embedding pipeline with existing index.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path) – Path to index directory

Return type:

EmbeddingPipeline

Returns:

Loaded EmbeddingPipeline

src.ingest module

Document Ingestion Pipeline

Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.

class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]

Bases: object

Metadata for a text chunk.

file_name: str

page_number: int

chunk_index: int

chunk_start: int

chunk_end: int

chunk_size: int

text_length: int

__init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)

class src.ingest.DocumentChunk(text, metadata)[source]

Bases: object

A chunk of text from a document with metadata.

text: str

metadata: ChunkMetadata

__init__(text, metadata)

class src.ingest.PDFProcessor(engine='pymupdf')[source]

Bases: object

Handles PDF text extraction using different engines.

__init__(engine='pymupdf')[source]

Initialize PDF processor.

Parameters:: engine (str) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)

extract_text(pdf_path)[source]

Extract text from PDF with page numbers.

Parameters:

pdf_path (Path) – Path to PDF file

Return type:

list[tuple[str, int]]

Returns:

List of (text, page_number) tuples

Raises:

ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist

class src.ingest.TextCleaner(config)[source]

Bases: object

Handles text cleaning and normalization.

__init__(config)[source]

Initialize text cleaner.

Parameters:: config (dict[str, Any]) – Configuration dictionary with cleaning parameters

clean_text(text)[source]

Clean and normalize text.

Parameters:: text (str) – Raw text to clean
Return type:: str
Returns:: Cleaned text

class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]

Bases: object

Handles text chunking with sliding window.

__init__(chunk_size=1000, chunk_overlap=200)[source]

Initialize text chunker.

Parameters:

chunk_size (int) – Size of each chunk in characters
chunk_overlap (int) – Overlap between chunks in characters

chunk_text(text, file_name, page_number)[source]

Split text into overlapping chunks.

Parameters:

text (str) – Text to chunk
file_name (str) – Name of the source file
page_number (int) – Page number

Return type:

list[DocumentChunk]

Returns:

List of DocumentChunk objects

class src.ingest.DocumentIngester(config)[source]

Bases: object

Main class for document ingestion pipeline.

__init__(config)[source]

Initialize document ingester.

Parameters:: config (dict[str, Any]) – Configuration dictionary

ingest_documents(documents_path)[source]

Ingest all PDF documents from the given path.

Parameters:: documents_path (Path) – Path to directory containing PDF files
Return type:: list[DocumentChunk]
Returns:: List of all document chunks
Raises:: ValueError – If documents_path doesn’t exist or contains no PDFs

save_chunks(chunks, output_path)[source]

Save chunks and metadata to disk.

Parameters:

chunks (list[DocumentChunk]) – List of document chunks
output_path (Path) – Path to save chunks

Return type:

None

src.ingest.ingest_documents(documents_path, config, args)[source]

Main function for document ingestion.

Parameters:

documents_path (str) – Path to documents directory
config (dict[str, Any]) – Configuration dictionary
args (Any) – Command line arguments

Return type:

None

src.llm module

LLM Interface

Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).

class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]

Bases: object

Configuration for LLM settings.

backend: str

model_path: str

temperature: float

max_tokens: int

top_p: float

repeat_penalty: float

context_window: int

__init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)

class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]

Bases: object

Response from LLM with metadata.

answer: str

prompt_tokens: int

response_tokens: int

generation_time_ms: float

model_used: str

__init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)

class src.llm.BaseLLM(config)[source]

Bases: object

Base class for LLM implementations.

__init__(config)[source]

Initialize LLM with configuration.

Parameters:: config (LLMConfig) – LLM configuration

generate(prompt)[source]

Generate response from prompt.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.TransformersLLM(config)[source]

Bases: BaseLLM

LLM implementation using transformers library.

generate(prompt)[source]

Generate response using transformers.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.LlamaCppLLM(config)[source]

Bases: BaseLLM

LLM implementation using llama-cpp-python.

generate(prompt)[source]

Generate response using llama-cpp.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.OpenAILLM(config)[source]

Bases: BaseLLM

LLM implementation using OpenAI API (optional).

generate(prompt)[source]

Generate response using OpenAI API.

Parameters:: prompt (str) – Input prompt
Return type:: LLMResponse
Returns:: LLMResponse with answer and metadata

class src.llm.LLMInterface(config)[source]

Bases: object

Main interface for LLM operations.

__init__(config)[source]

Initialize LLM interface.

Parameters:: config (dict[str, Any]) – Configuration dictionary

format_prompt(query, context)[source]

Format prompt with query and context.

Parameters:

query (str) – User query
context (str) – Retrieved document context

Return type:

str

Returns:

Formatted prompt

generate_answer(query, query_result)[source]

Generate answer from query and retrieved chunks.

Parameters:

query (str) – User query
query_result (QueryResult) – QueryResult with retrieved chunks

Return type:

LLMResponse

Returns:

LLMResponse with generated answer

get_model_info()[source]

Get information about the loaded model.

Return type:: dict[str, Any]
Returns:: Dictionary with model information

src.llm.create_llm_interface(config)[source]

Create LLM interface from configuration.

Parameters:: config (dict[str, Any]) – Configuration dictionary
Return type:: LLMInterface
Returns:: LLMInterface instance

src.llm.generate_answer_from_query(query, query_result, config)[source]

Generate answer from query and query result.

Parameters:

query (str) – User query
query_result (QueryResult) – QueryResult with retrieved chunks
config (dict[str, Any]) – Configuration dictionary

Return type:

str

Returns:

Generated answer string

src.llm.format_llm_response(response, verbose=False)[source]

Format LLM response for output.

Parameters:

response (LLMResponse) – LLMResponse to format
verbose (bool) – Whether to include metadata

Return type:

str

Returns:

Formatted output string

src.query module

Query Engine

Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.

class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]

Bases: object

Result of a query with relevant chunks and metadata.

query: str

chunks: list[DocumentChunk]

similarities: list[float]

total_chunks_searched: int

search_time_ms: float

__init__(query, chunks, similarities, total_chunks_searched, search_time_ms)

class src.query.QueryEngine(config, index_path=None)[source]

Bases: object

Main class for query processing and similarity search.

__init__(config, index_path=None)[source]

Initialize query engine.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path | None) – Path to FAISS index (if None, will use config default)

search(query, top_k=None, similarity_threshold=None)[source]

Search for chunks similar to the query.

Parameters:

query (str) – User query text
top_k (int | None) – Number of results to return (uses config default if None)
similarity_threshold (float | None) – Minimum similarity score (uses config default if None)

Return type:

QueryResult

Returns:

QueryResult with relevant chunks and metadata

get_index_stats()[source]

Get statistics about the loaded index.

Return type:: dict[str, Any]
Returns:: Dictionary with index statistics

validate_index()[source]

Validate that the index is properly loaded and functional.

Return type:: bool
Returns:: True if index is valid, False otherwise

class src.query.QueryProcessor(config, index_path=None)[source]

Bases: object

High-level query processor with additional functionality.

__init__(config, index_path=None)[source]

Initialize query processor.

Parameters:

config (dict[str, Any]) – Configuration dictionary
index_path (Path | None) – Path to FAISS index

process_query(query, top_k=None, similarity_threshold=None)[source]

Process a user query and return relevant chunks.

Parameters:

query (str) – User query text
top_k (int | None) – Number of results to return
similarity_threshold (float | None) – Minimum similarity score

Return type:

QueryResult

Returns:

QueryResult with relevant chunks and metadata

format_results(result, include_metadata=True)[source]

Format query results as a readable string.

Parameters:

result (QueryResult) – QueryResult to format
include_metadata (bool) – Whether to include chunk metadata

Return type:

str

Returns:

Formatted string representation of results

get_relevant_context(result, max_chars=2000)[source]

Get relevant context from search results for LLM input.

Parameters:

result (QueryResult) – QueryResult from search
max_chars (int) – Maximum characters to include

Return type:

str

Returns:

Formatted context string for LLM

src.query.process_query(query, config, args)[source]

Main function for query processing.

Parameters:

query (str) – User query text
config (dict[str, Any]) – Configuration dictionary
args (Any) – Command line arguments

Return type:

QueryResult

Returns:

QueryResult with relevant chunks

src.query.format_query_output(result, verbose=False)[source]

Format query results for output.

Parameters:

result (QueryResult) – QueryResult to format
verbose (bool) – Whether to include detailed output

Return type:

str

Returns:

Formatted output string

src.utils module

Utility functions and logging configuration for the document-based question answering system.

src.utils.setup_logging(log_level='INFO', log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')[source]

Set up centralized logging configuration.

Parameters:

log_level (str) – Logging level (DEBUG, INFO, WARNING, ERROR)
log_file (str | None) – Optional log file path
log_format (str) – Log message format

Return type:

None

src.utils.get_logger(name)[source]

Get a logger instance with the given name.

Parameters:: name (str) – Logger name
Return type:: Logger
Returns:: Configured logger instance

src.utils.log_memory_usage(logger, context='')[source]

Log current memory usage.

Parameters:

logger (Logger) – Logger instance
context (str) – Context string for the log message

Return type:

None

src.utils.log_performance(func)[source]

Decorator to log function performance metrics.

Parameters:: func – Function to decorate
Returns:: Decorated function

src.utils.batch_process(items, batch_size, process_func, logger, description='Processing')[source]

Process items in batches with progress logging.

Parameters:

items (list) – List of items to process
batch_size (int) – Size of each batch
process_func – Function to apply to each batch
logger (Logger) – Logger instance
description (str) – Description for progress logging

Return type:

list

Returns:

List of processed results

src.utils.optimize_memory()[source]: Perform memory optimization operations.

src.utils.create_cache_directory(cache_dir)[source]

Create and validate cache directory.

Parameters:: cache_dir (str) – Cache directory path
Return type:: Path
Returns:: Path to cache directory

src.utils.get_system_info()[source]

Get system information for logging.

Return type:: dict[str, Any]
Returns:: Dictionary with system information

src.utils.log_system_info(logger)[source]

Log system information.

Parameters:: logger (Logger) – Logger instance
Return type:: None

class src.utils.ProgressTracker(total_items, logger, description='Processing')[source]

Bases: object

Track and log progress of long-running operations.

__init__(total_items, logger, description='Processing')[source]

update(count=1)[source]

Update progress and log periodically.

Return type:: None

finish()[source]

Log completion statistics.

Return type:: None

Module contents

Document-based Question Answering System

A local, modular RAG (retrieval-augmented generation) system.