src package
Submodules
src.embed module
Embedding Pipeline
Handles vector embedding generation for document chunks and FAISS index management. Supports local embedding models and efficient similarity search.
- class src.embed.EmbeddingConfig(model_name, normalize_embeddings, device, similarity_threshold, top_k)[source]
Bases:
object
Configuration for embedding generation.
- __init__(model_name, normalize_embeddings, device, similarity_threshold, top_k)
- class src.embed.EmbeddingModel(config)[source]
Bases:
object
Handles embedding model loading and text embedding generation.
- __init__(config)[source]
Initialize embedding model.
- Parameters:
config (
EmbeddingConfig
) – Embedding configuration
- class src.embed.FAISSIndex(dimension, index_type='IndexFlatIP')[source]
Bases:
object
Handles FAISS index creation and management.
- add_embeddings(embeddings, chunk_metadata)[source]
Add embeddings to the index.
- Parameters:
embeddings (
ndarray
) – numpy array of embeddingschunk_metadata (
list
[ChunkMetadata
]) – List of chunk metadata corresponding to embeddings
- Return type:
- get_chunk_by_index(index)[source]
Get chunk metadata by index.
- Parameters:
index (
int
) – Index in the metadata list- Return type:
- Returns:
Chunk metadata or None if index is invalid
- class src.embed.EmbeddingPipeline(config)[source]
Bases:
object
Main class for embedding generation and index management.
- create_embeddings_from_chunks(chunks)[source]
Create embeddings from document chunks and build FAISS index.
- Parameters:
chunks (
list
[DocumentChunk
]) – List of document chunks- Return type:
src.ingest module
Document Ingestion Pipeline
Handles PDF text extraction, cleaning, chunking, and metadata storage. Supports multiple PDF engines and configurable chunking parameters.
- class src.ingest.ChunkMetadata(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)[source]
Bases:
object
Metadata for a text chunk.
- __init__(file_name, page_number, chunk_index, chunk_start, chunk_end, chunk_size, text_length)
- class src.ingest.DocumentChunk(text, metadata)[source]
Bases:
object
A chunk of text from a document with metadata.
-
metadata:
ChunkMetadata
- __init__(text, metadata)
-
metadata:
- class src.ingest.PDFProcessor(engine='pymupdf')[source]
Bases:
object
Handles PDF text extraction using different engines.
- __init__(engine='pymupdf')[source]
Initialize PDF processor.
- Parameters:
engine (
str
) – PDF processing engine (“pymupdf”, “pdfminer”, “pdfplumber”)
- extract_text(pdf_path)[source]
Extract text from PDF with page numbers.
- Parameters:
pdf_path (
Path
) – Path to PDF file- Return type:
- Returns:
List of (text, page_number) tuples
- Raises:
ValueError – If PDF engine is not supported
FileNotFoundError – If PDF file doesn’t exist
- class src.ingest.TextCleaner(config)[source]
Bases:
object
Handles text cleaning and normalization.
- class src.ingest.TextChunker(chunk_size=1000, chunk_overlap=200)[source]
Bases:
object
Handles text chunking with sliding window.
- class src.ingest.DocumentIngester(config)[source]
Bases:
object
Main class for document ingestion pipeline.
- ingest_documents(documents_path)[source]
Ingest all PDF documents from the given path.
- Parameters:
documents_path (
Path
) – Path to directory containing PDF files- Return type:
- Returns:
List of all document chunks
- Raises:
ValueError – If documents_path doesn’t exist or contains no PDFs
src.llm module
LLM Interface
Handles local LLM loading, prompt formatting, and answer generation. Supports multiple backends: transformers, llama-cpp, and OpenAI (optional).
- class src.llm.LLMConfig(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)[source]
Bases:
object
Configuration for LLM settings.
- __init__(backend, model_path, temperature, max_tokens, top_p, repeat_penalty, context_window)
- class src.llm.LLMResponse(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)[source]
Bases:
object
Response from LLM with metadata.
- __init__(answer, prompt_tokens, response_tokens, generation_time_ms, model_used)
- class src.llm.BaseLLM(config)[source]
Bases:
object
Base class for LLM implementations.
- __init__(config)[source]
Initialize LLM with configuration.
- Parameters:
config (
LLMConfig
) – LLM configuration
- class src.llm.TransformersLLM(config)[source]
Bases:
BaseLLM
LLM implementation using transformers library.
- class src.llm.LlamaCppLLM(config)[source]
Bases:
BaseLLM
LLM implementation using llama-cpp-python.
- class src.llm.OpenAILLM(config)[source]
Bases:
BaseLLM
LLM implementation using OpenAI API (optional).
- class src.llm.LLMInterface(config)[source]
Bases:
object
Main interface for LLM operations.
- generate_answer(query, query_result)[source]
Generate answer from query and retrieved chunks.
- Parameters:
query (
str
) – User queryquery_result (
QueryResult
) – QueryResult with retrieved chunks
- Return type:
- Returns:
LLMResponse with generated answer
- src.llm.create_llm_interface(config)[source]
Create LLM interface from configuration.
- Parameters:
- Return type:
- Returns:
LLMInterface instance
- src.llm.generate_answer_from_query(query, query_result, config)[source]
Generate answer from query and query result.
- src.llm.format_llm_response(response, verbose=False)[source]
Format LLM response for output.
- Parameters:
response (
LLMResponse
) – LLMResponse to formatverbose (
bool
) – Whether to include metadata
- Return type:
- Returns:
Formatted output string
src.query module
Query Engine
Handles query processing, similarity search, and chunk retrieval. Loads FAISS index and embedding model for efficient query processing.
- class src.query.QueryResult(query, chunks, similarities, total_chunks_searched, search_time_ms)[source]
Bases:
object
Result of a query with relevant chunks and metadata.
-
chunks:
list
[DocumentChunk
]
- __init__(query, chunks, similarities, total_chunks_searched, search_time_ms)
-
chunks:
- class src.query.QueryEngine(config, index_path=None)[source]
Bases:
object
Main class for query processing and similarity search.
- search(query, top_k=None, similarity_threshold=None)[source]
Search for chunks similar to the query.
- class src.query.QueryProcessor(config, index_path=None)[source]
Bases:
object
High-level query processor with additional functionality.
- process_query(query, top_k=None, similarity_threshold=None)[source]
Process a user query and return relevant chunks.
- format_results(result, include_metadata=True)[source]
Format query results as a readable string.
- Parameters:
result (
QueryResult
) – QueryResult to formatinclude_metadata (
bool
) – Whether to include chunk metadata
- Return type:
- Returns:
Formatted string representation of results
- get_relevant_context(result, max_chars=2000)[source]
Get relevant context from search results for LLM input.
- Parameters:
result (
QueryResult
) – QueryResult from searchmax_chars (
int
) – Maximum characters to include
- Return type:
- Returns:
Formatted context string for LLM
- src.query.format_query_output(result, verbose=False)[source]
Format query results for output.
- Parameters:
result (
QueryResult
) – QueryResult to formatverbose (
bool
) – Whether to include detailed output
- Return type:
- Returns:
Formatted output string
src.utils module
Utility functions and logging configuration for the document-based question answering system.
- src.utils.setup_logging(log_level='INFO', log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')[source]
Set up centralized logging configuration.
- src.utils.log_performance(func)[source]
Decorator to log function performance metrics.
- Parameters:
func – Function to decorate
- Returns:
Decorated function
- src.utils.batch_process(items, batch_size, process_func, logger, description='Processing')[source]
Process items in batches with progress logging.
Module contents
Document-based Question Answering System
A local, modular RAG (retrieval-augmented generation) system.