
Efficient knowledge work starts with understanding raw documents. Good to Know’s AI assistant does this on its own, turning long reports, contracts or scientific papers into material you can search and question immediately. Below is a concise walk-through of the three main steps it performs and why they matter.
1. Chunking — small, well-defined sections
Large files are divided into logical passages ("chunks"). Each chunk keeps enough neighbouring text so the original context is preserved. Working with smaller units has two benefits:
- Performance – passages fit comfortably inside the context window of modern language models, keeping latency and cost predictable.
- Precision – answers can be traced back to specific portions of the source, not the entire document.
The splitting rules differ by format—headings for Word, page breaks for PDF, and structural tags for HTML—but the goal is the same: produce clean sections that are easy to reference later.
2. Metadata extraction — facts at a glance
While chunking runs, another service scans for descriptive fields: author, creation date, customer names, invoice numbers and any organisation-specific labels you configure. These values become metadata that can be used to filter queries (for example, “only technical reports from 2024”) or enrich answers with quick facts.
Extraction combines pattern matching with entity-recognition models. When the system encounters an unfamiliar field, you can label a few examples; the model will learn to pick it up in future uploads.
3. Semantic search — matching by meaning
Every chunk is converted into a high-dimensional vector—an array of numbers that captures the meaning of the text. A dedicated vector database stores these embeddings. When you ask a question, the query is embedded the same way and compared against the collection. The closest vectors (highest similarity) point to the passages most likely to contain the answer.
This approach recognises concepts rather than exact wording—"vehicle" will match "car", "truck" or "EV"—making search far more resilient than keyword methods.
This technique is called RAG - Retrieval Augmented Generation. To provide natural-language answers with citations it follows these steps:
- Chunk the document.
- Extract metadata.
- Embed chunks and search semantically.
- Respond via the chat interface, always citing the relevant chunks.
All processes run in parallel, so most documents are ready to query within seconds after upload.
Good to Know’s pipeline removes manual tagging and complex setup, giving researchers, analysts and managers immediate access to the information stored in their files. If you spend more time looking for data than using it, let the assistant do the reading for you—join the waitlist here.