Files

2026-05-09 10:31:28 +00:00

11 KiB

Raw Blame History

🧭 AGENT ARCHITECTURE MAP (LIVING DOCUMENT)

Đây là tài liệu dẫn đường dành riêng cho các AI Agent tương lai và lập trình viên bảo trì. Không quét toàn bộ code, hãy đọc file này trước.

Lần cập nhật cuối: Phase 8 Complete (DCE, Text Extraction, ACL, SSO, Logging) Trạng thái Dự án: Phase 8 hoàn thành. Sẵn sàng cho Phase 9 (Production Ready).

1. Bản Đồ Kiến Trúc Lõi (Core Architecture Patterns)

Pipeline hiện tại (ĐÃ HOẠT ĐỘNG)

SharePoint → Ingestion → DCE → [OCR/Extract/Skip] → Chunking → OpenSearch → Search → RAG Chat → FastAPI → Frontend

A. Tầng Ingestion (Thu thập dữ liệu) - Mẫu Modular Provider Pattern

Mục tiêu: Tách biệt lõi hệ thống khỏi nền tảng lưu trữ (SharePoint, Google Drive, v.v.).
Interface: ingestion/providers/base_provider.py (fetch_changes, download_file, get_item_details, get_item_permissions).
Implement hiện tại: ingestion/providers/sharepoint_provider.py. Bọc lại GraphClient, tự động xử lý pagination Delta Query.
Sync Engine: ingestion/sync.py → SyncEngine nhận BaseStorageProvider qua constructor, provider-agnostic.
Nếu cần thêm nguồn dữ liệu mới: Chỉ cần tạo class mới kế thừa BaseStorageProvider.

B. Tầng Extraction (Xử lý chữ & Ảnh) - Mẫu Distributed VLM Pattern

Lịch sử: Đã từng dùng PaddleOCR + VietOCR nhưng gặp lỗi "Rụng dấu" và "Ảo giác". Đã loại bỏ hoàn toàn.
Kiến trúc hiện tại: Hệ thống đóng vai trò VLM Client.
Cách hoạt động: extraction/ocr_service.py render PDF thành ảnh (Matrix=1.2), nén Base64, POST sang server LAN (10.202.50.3:8080) chạy llama.cpp + Vintern-3B.
Lợi ích: Giải phóng RAM cho máy chủ RAG, lấy được Markdown nguyên bản.

C. Tầng Chunking & Vector DB (Semantic Indexing)

Chunking: chunking/markdown_chunker.py chia nhỏ bằng Markdown Rules (Header #, overlap), theo dõi page_from, page_to.
Embedding: sentence-transformers với model keepitreal/vietnamese-sbert (local, 768 chiều).
Database: indexing/vector_store.py → OpenSearch k-NN HNSW. Index: poc_sharepoint_docs.
Dedup: VectorStore.delete_by_file_id() xóa chunks cũ trước khi nạp lại.

D. Tầng Search & RAG Chat

Retriever: search/retriever.py → Semantic Search (k-NN vector) trên OpenSearch.
RAG Engine: chat/rag_engine.py → Search → Augment Context → LLM Generate.
LLM Factory: chat/llm_factory.py → Hỗ trợ Gemini, Groq, Local (config trong .env).

E. Tầng API & Frontend

Backend: api/main.py → FastAPI tại port 8000. Endpoint: /health, /auth/login (SSO), /auth/callback, /auth/login-email, /chat, /sync, /sync/status.
Frontend: frontend/ → Glassmorphism UI với SSO login + email fallback + sync button. Gọi http://localhost:8000.

F. Tầng Cấu hình (Decoupled Configuration)

Toàn bộ thông số trong .env. Load qua core/config.py.
Tuyệt đối KHÔNG hardcode URL, Token hay Password trong code.

2. Bản Đồ File & Thư Mục Hoàn Chỉnh

📁 poc_system/
├── 📁 core/
│   ├── config.py              # ⚙️ Trái tim cấu hình (Load từ .env)
│   ├── models.py              # 🧩 Data Classes (OCRPageResult, DocumentChunk, IngestedDocument)
│   └── logging.py             # 📝 Structured logging (JSON/human formatter)
├── 📁 ingestion/
│   ├── sync.py                # 🔄 SyncEngine (Provider-agnostic)
│   ├── graph_client.py        # 🌐 Microsoft Graph API Client
│   └── 📁 providers/
│       ├── base_provider.py   # 🔌 Interface: fetch_changes, download_file, get_item_details
│       └── sharepoint_provider.py
├── 📁 extraction/
│   ├── dce.py               # 🏷️ Document Classification Engine (phân loại trước khi xử lý)
│   ├── pdf_inspector.py     # 🔎 PDF Inspection (TEXT_PDF / SCAN_PDF / DRAWING_PDF)
│   ├── magic_numbers.py     # 🔢 Magic Number validation (chống giả extension)
│   ├── text_extractor.py    # 📄 Text extraction: DOCX (python-docx), XLSX (openpyxl), TXT
│   └── ocr_service.py       # 👁️ VLM Client (PDF → Markdown qua LAN)
├── 📁 chunking/
│   └── markdown_chunker.py    # ✂️ Semantic Chunking theo Markdown rules
├── 📁 indexing/
│   └── vector_store.py        # 📦 OpenSearch k-NN Index + Embedding
├── 📁 search/
│   └── retriever.py           # 🔍 Semantic Search (k-NN vector)
├── 📁 chat/
│   ├── rag_engine.py          # 🤖 RAG: Search → Context → LLM
│   ├── llm_factory.py         # 🏭 Factory: Gemini / Groq / Local
│   └── 📁 llm_providers/
│       ├── base_llm.py
│       ├── gemini_llm.py
│       ├── groq_llm.py
│       └── local_llm.py
├── 📁 api/
│   └── main.py                # 🚀 FastAPI Backend (port 8000)
├── 📁 frontend/
│   ├── index.html             # 🎨 Glassmorphism UI (Login + Chat + Sync)
│   ├── app.js                 # 💬 Chat, Auth, Sync logic
│   └── style.css              # 🖌️ CSS
├── 📁 doc/                    # 📚 Tài liệu dự án
│   ├── 00.AGENT_ARCHITECTURE_MAP.md  # Bản đồ kiến trúc
│   ├── AGENT_HANDOVER_PROTOCOL.md    # Protocol cho AI Agent
│   ├── DEPLOYMENT_GUIDE.md           # Hướng dẫn triển khai & cấu hình
│   └── ...                           # Các tài liệu khác
├── .env                       # 🔑 Chìa khoá (KHÔNG commit)
├── docker-compose.yml         # 🐳 OpenSearch
├── Dockerfile
├── requirements.txt
├── test_rag_pipeline.py       # 🧪 Test toàn bộ pipeline
├── test_graph_smoke.py        # 🧪 Test kết nối Graph API
├── test_modular_architecture.py
├── test_chat.py
├── test_ocr.py
└── test_dce_pipeline.py

3. Lịch Sử Các Lỗi Khét Tiếng & Cách Xử Lý (Known Gotchas)

Lỗi 401 Unauthorized khi tải file từ SharePoint:
- Nguyên nhân: Microsoft chặn download trực tiếp bằng @microsoft.graph.downloadUrl nếu dùng App-Only Token.
- Giải pháp: Dùng endpoint .../items/{item_id}/content kèm Bearer Token.
Lỗi 500 Internal Server Error từ Llama.cpp VLM:
- Nguyên nhân: Ảnh có độ phân giải quá cao (Matrix 2.0) làm tràn Context Window.
- Giải pháp: Hạ Matrix xuống 1.2, hoặc khởi chạy Server với -c 8192.
Lỗi Rụng dấu / Ảo giác của VietOCR:
- Nguyên nhân: PaddleOCR bắt khung quá khít, mô hình vgg_seq2seq nội suy sai.
- Giải pháp triệt để: Đã loại bỏ hoàn toàn VietOCR, chuyển sang VLM (Vintern-3B).
Lỗi UTF-8 Surrogate (\udcc3) trong Terminal WSL:
- Giải pháp: Dùng sys.stdin.buffer.readline() cho CLI. Web API (FastAPI) không bị ảnh hưởng.
Lỗi Link SharePoint không ổn định (Bug #101):
- Nguyên nhân: Delta Query không trả về webUrl và @microsoft.graph.downloadUrl.
- Giải pháp: Thêm get_item_details() vào graph_client.py, base_provider.py, sharepoint_provider.py.
Lỗi Chunks trùng lặp khi chạy lại pipeline:
- Hiện tượng: Mỗi lần chạy test_rag_pipeline.py, chunks mới được thêm chồng lên chunks cũ (cùng file).
- Nguyên nhân: chunk_id dùng UUID ngẫu nhiên, không có bước xóa cũ.
- Giải pháp: VectorStore.delete_by_file_id(file_id) gọi trước embed_and_index().
Lỗi DCE download PDF 401 Unauthorized:
- Hiện tượng: DCE không phân loại được PDF vì download file bị 401.
- Nguyên nhân: DCE dùng httpx trực tiếp với @microsoft.graph.downloadUrl (không có Bearer Token).
- Giải pháp: Truyền provider (BaseStorageProvider) vào DCE constructor, dùng provider.download_file() thay vì httpx.
Lỗi DCE download 404 (items/None/content):
- Hiện tượng: DCE download PDF bị 404 vì URL có items/None/content.
- Nguyên nhân: ingestion_output.json dùng key item_id nhưng download_file() cần id.
- Giải pháp: DCE tự chuẩn hóa item_id → id khi thiếu.
Lỗi OpenSearch hostname không resolve khi chạy ngoài Docker:
- Hiện tượng: ConnectionError: Failed to resolve 'opensearch'.
- Nguyên nhân: Config .env có opensearch_host=opensearch (Docker hostname).
- Giải pháp: VectorStore và SearchRetriever tự detect: nếu host là "opensearch" và ENV != "docker" → đổi sang "localhost".
Lỗi k-NN query format sai cho OpenSearch 2.x:
- Hiện tượng: Unknown key for a START_OBJECT in [knn].
- Nguyên nhân: Đặt knn ở top level thay vì trong query.
- Giải pháp: Đặt knn bên trong query object.

4. Nhiệm Vụ Tiếp Theo (Phase 9 - Production Ready)

Đã hoàn thành ✅

Ingestion: SharePoint Provider + Delta Query + Pagination
DCE: Document Classification Engine (phân loại file theo extension + PDF inspection)
PDF Inspection: Detect text layer, classify TEXT_PDF / SCAN_PDF / DRAWING_PDF
Conditional OCR: Chỉ OCR SCAN_PDF, TEXT_PDF extract trực tiếp, skip DRAWING/UNSUPPORTED
Extraction: VLM OCR (Vintern-3B qua LAN)
Chunking: Semantic Markdown Chunker
Indexing: OpenSearch k-NN HNSW + vietnamese-sbert
Search: Semantic Retriever
RAG Chat: LLM Factory (Gemini/Groq/Local)
API: FastAPI Backend (/chat, /health, /sync, /sync/status)
Frontend: Glassmorphism UI
Bug fixes: SharePoint links, Chunk dedup
Refactor: SyncEngine provider-agnostic
Logging: Structured logging utility (core/logging.py)
Permission: ACL extraction từ SharePoint + filter search theo user
Auth UI: Simple email login + SSO Azure AD + user context cho API calls
DOCX Text Extraction: python-docx (paragraphs + tables)
XLSX Text Extraction: openpyxl (sheets + cells)

Chưa triển khai (Phase 9 - Production Ready)

Ưu tiên trung bình

Cấu hình Azure AD cho SSO: Thêm Redirect URI http://localhost:8000/auth/callback và bật "ID tokens" trong App Registration.

Ưu tiên thấp

Monitoring Dashboard: Health metrics, ingestion status, OCR success rate.
Multi-tenant: Hỗ trợ nhiều SharePoint site/tenant.

5. Tiêu chuẩn Lập trình & Môi trường (Coding Standards)

A. Quản lý Mã hóa (Encoding)

Quy tắc vàng: Luôn sử dụng encoding='utf-8' trong mọi lệnh open().
Môi trường: PYTHONIOENCODING=utf-8 trong Docker/WSL.

B. Mẫu Provider (Provider Pattern)

Mọi kết nối tới dịch vụ bên thứ ba (Storage, LLM) phải thông qua Interface/BaseClass.
BaseStorageProvider cho Storage, BaseLLMProvider cho LLM.

C. Quy tắc an toàn

Không commit .env, không hardcode secrets.
Không thay đổi kiến trúc đã chốt trong doc/14.Project-Bridge-Context-for-New-Chat.md mà không có lý do kỹ thuật rõ ràng.

11 KiB Raw Blame History