# 🧭 AGENT ARCHITECTURE MAP (LIVING DOCUMENT)
*Đây là tài liệu dẫn đường dành riêng cho các AI Agent tương lai và lập trình viên bảo trì. Không quét toàn bộ code, hãy đọc file này trước.*

**Lần cập nhật cuối:** Phase 8 Complete (DCE, Text Extraction, ACL, SSO, Logging)
**Trạng thái Dự án:** Phase 8 hoàn thành. Sẵn sàng cho Phase 9 (Production Ready).

---

## 1. Bản Đồ Kiến Trúc Lõi (Core Architecture Patterns)

### Pipeline hiện tại (ĐÃ HOẠT ĐỘNG)
```
SharePoint → Ingestion → DCE → [OCR/Extract/Skip] → Chunking → OpenSearch → Search → RAG Chat → FastAPI → Frontend
```

### A. Tầng Ingestion (Thu thập dữ liệu) - Mẫu Modular Provider Pattern
- **Mục tiêu:** Tách biệt lõi hệ thống khỏi nền tảng lưu trữ (SharePoint, Google Drive, v.v.).
- **Interface:** `ingestion/providers/base_provider.py` (`fetch_changes`, `download_file`, `get_item_details`, `get_item_permissions`).
- **Implement hiện tại:** `ingestion/providers/sharepoint_provider.py`. Bọc lại `GraphClient`, tự động xử lý pagination Delta Query.
- **Sync Engine:** `ingestion/sync.py` → `SyncEngine` nhận `BaseStorageProvider` qua constructor, provider-agnostic.
- **Nếu cần thêm nguồn dữ liệu mới:** Chỉ cần tạo class mới kế thừa `BaseStorageProvider`.

### B. Tầng Extraction (Xử lý chữ & Ảnh) - Mẫu Distributed VLM Pattern
- **Lịch sử:** Đã từng dùng PaddleOCR + VietOCR nhưng gặp lỗi "Rụng dấu" và "Ảo giác". Đã loại bỏ hoàn toàn.
- **Kiến trúc hiện tại:** Hệ thống đóng vai trò **VLM Client**.
- **Cách hoạt động:** `extraction/ocr_service.py` render PDF thành ảnh (Matrix=1.2), nén Base64, POST sang server LAN (`10.202.50.3:8080`) chạy `llama.cpp` + `Vintern-3B`.
- **Lợi ích:** Giải phóng RAM cho máy chủ RAG, lấy được Markdown nguyên bản.

### C. Tầng Chunking & Vector DB (Semantic Indexing)
- **Chunking:** `chunking/markdown_chunker.py` chia nhỏ bằng Markdown Rules (Header `#`, overlap), theo dõi `page_from`, `page_to`.
- **Embedding:** `sentence-transformers` với model `keepitreal/vietnamese-sbert` (local, 768 chiều).
- **Database:** `indexing/vector_store.py` → OpenSearch `k-NN HNSW`. Index: `poc_sharepoint_docs`.
- **Dedup:** `VectorStore.delete_by_file_id()` xóa chunks cũ trước khi nạp lại.

### D. Tầng Search & RAG Chat
- **Retriever:** `search/retriever.py` → Semantic Search (k-NN vector) trên OpenSearch.
- **RAG Engine:** `chat/rag_engine.py` → Search → Augment Context → LLM Generate.
- **LLM Factory:** `chat/llm_factory.py` → Hỗ trợ Gemini, Groq, Local (config trong `.env`).

### E. Tầng API & Frontend
- **Backend:** `api/main.py` → FastAPI tại port 8000. Endpoint: `/health`, `/auth/login` (SSO), `/auth/callback`, `/auth/login-email`, `/chat`, `/sync`, `/sync/status`, `/sync/history`, `/sync/history/{run_id}`.
- **Frontend:** `frontend/` → Glassmorphism UI với SSO login + email fallback + sync button + sync history panel. Gọi `http://localhost:8000`.
- **Audit:** `audit/sync_audit.py` → Lưu lịch sử sync vào `audit/sync_log.json`.

### F. Tầng Cấu hình (Decoupled Configuration)
- Toàn bộ thông số trong `.env`. Load qua `core/config.py`.
- **Tuyệt đối KHÔNG hardcode URL, Token hay Password trong code.**

---

## 2. Bản Đồ File & Thư Mục Hoàn Chỉnh

```text
📁 poc_system/
├── 📁 core/
│   ├── config.py              # ⚙️ Trái tim cấu hình (Load từ .env)
│   ├── models.py              # 🧩 Data Classes (OCRPageResult, DocumentChunk, IngestedDocument)
│   └── logging.py             # 📝 Structured logging (JSON/human formatter)
├── 📁 ingestion/
│   ├── sync.py                # 🔄 SyncEngine (Provider-agnostic)
│   ├── graph_client.py        # 🌐 Microsoft Graph API Client
│   └── 📁 providers/
│       ├── base_provider.py   # 🔌 Interface: fetch_changes, download_file, get_item_details
│       └── sharepoint_provider.py
├── 📁 extraction/
│   ├── dce.py               # 🏷️ Document Classification Engine (phân loại trước khi xử lý)
│   ├── pdf_inspector.py     # 🔎 PDF Inspection (TEXT_PDF / SCAN_PDF / DRAWING_PDF)
│   ├── magic_numbers.py     # 🔢 Magic Number validation (chống giả extension)
│   ├── text_extractor.py    # 📄 Text extraction: DOCX (python-docx), XLSX (openpyxl), TXT
│   └── ocr_service.py       # 👁️ VLM Client (PDF → Markdown qua LAN)
├── 📁 chunking/
│   └── markdown_chunker.py    # ✂️ Semantic Chunking theo Markdown rules
├── 📁 indexing/
│   └── vector_store.py        # 📦 OpenSearch k-NN Index + Embedding
├── 📁 search/
│   └── retriever.py           # 🔍 Semantic Search (k-NN vector)
├── 📁 chat/
│   ├── rag_engine.py          # 🤖 RAG: Search → Context → LLM
│   ├── llm_factory.py         # 🏭 Factory: Gemini / Groq / Local
│   └── 📁 llm_providers/
│       ├── base_llm.py
│       ├── gemini_llm.py
│       ├── groq_llm.py
│       └── local_llm.py
├── 📁 api/
│   └── main.py                # 🚀 FastAPI Backend (port 8000)
├── 📁 audit/
│   ├── sync_audit.py          # 📋 Sync audit logging (ghi lịch sử sync)
│   └── sync_log.json          # 📄 Audit log data
├── 📁 frontend/
│   ├── index.html             # 🎨 Glassmorphism UI (Login + Chat + Sync)
│   ├── app.js                 # 💬 Chat, Auth, Sync logic
│   └── style.css              # 🖌️ CSS
├── 📁 doc/                    # 📚 Tài liệu dự án
│   ├── 00.AGENT_ARCHITECTURE_MAP.md  # Bản đồ kiến trúc
│   ├── AGENT_HANDOVER_PROTOCOL.md    # Protocol cho AI Agent
│   ├── DEPLOYMENT_GUIDE.md           # Hướng dẫn triển khai & cấu hình
│   └── ...                           # Các tài liệu khác
├── .env                       # 🔑 Chìa khoá (KHÔNG commit)
├── docker-compose.yml         # 🐳 OpenSearch
├── Dockerfile
├── requirements.txt
├── test_rag_pipeline.py       # 🧪 Test toàn bộ pipeline
├── test_graph_smoke.py        # 🧪 Test kết nối Graph API
├── test_modular_architecture.py
├── test_chat.py
├── test_ocr.py
└── test_dce_pipeline.py
```

---

## 3. Lịch Sử Các Lỗi Khét Tiếng & Cách Xử Lý (Known Gotchas)

1. **Lỗi 401 Unauthorized khi tải file từ SharePoint:**
   - *Nguyên nhân:* Microsoft chặn download trực tiếp bằng `@microsoft.graph.downloadUrl` nếu dùng App-Only Token.
   - *Giải pháp:* Dùng endpoint `.../items/{item_id}/content` kèm Bearer Token.

2. **Lỗi 500 Internal Server Error từ Llama.cpp VLM:**
   - *Nguyên nhân:* Ảnh có độ phân giải quá cao (Matrix 2.0) làm tràn Context Window.
   - *Giải pháp:* Hạ `Matrix` xuống `1.2`, hoặc khởi chạy Server với `-c 8192`.

3. **Lỗi Rụng dấu / Ảo giác của VietOCR:**
   - *Nguyên nhân:* PaddleOCR bắt khung quá khít, mô hình `vgg_seq2seq` nội suy sai.
   - *Giải pháp triệt để:* Đã loại bỏ hoàn toàn VietOCR, chuyển sang VLM (Vintern-3B).

4. **Lỗi UTF-8 Surrogate (\udcc3) trong Terminal WSL:**
   - *Giải pháp:* Dùng `sys.stdin.buffer.readline()` cho CLI. Web API (FastAPI) không bị ảnh hưởng.

5. **Lỗi Link SharePoint không ổn định (Bug #101):**
   - *Nguyên nhân:* Delta Query không trả về `webUrl` và `@microsoft.graph.downloadUrl`.
   - *Giải pháp:* Thêm `get_item_details()` vào `graph_client.py`, `base_provider.py`, `sharepoint_provider.py`.

6. **Lỗi Chunks trùng lặp khi chạy lại pipeline:**
   - *Hiện tượng:* Mỗi lần chạy `test_rag_pipeline.py`, chunks mới được thêm chồng lên chunks cũ (cùng file).
   - *Nguyên nhân:* `chunk_id` dùng UUID ngẫu nhiên, không có bước xóa cũ.
   - *Giải pháp:* `VectorStore.delete_by_file_id(file_id)` gọi trước `embed_and_index()`.

7. **Lỗi DCE download PDF 401 Unauthorized:**
   - *Hiện tượng:* DCE không phân loại được PDF vì download file bị 401.
   - *Nguyên nhân:* DCE dùng httpx trực tiếp với `@microsoft.graph.downloadUrl` (không có Bearer Token).
   - *Giải pháp:* Truyền `provider` (BaseStorageProvider) vào DCE constructor, dùng `provider.download_file()` thay vì httpx.

8. **Lỗi DCE download 404 (items/None/content):**
   - *Hiện tượng:* DCE download PDF bị 404 vì URL có `items/None/content`.
   - *Nguyên nhân:* `ingestion_output.json` dùng key `item_id` nhưng `download_file()` cần `id`.
   - *Giải pháp:* DCE tự chuẩn hóa `item_id` → `id` khi thiếu.

9. **Lỗi OpenSearch hostname không resolve khi chạy ngoài Docker:**
   - *Hiện tượng:* `ConnectionError: Failed to resolve 'opensearch'`.
   - *Nguyên nhân:* Config `.env` có `opensearch_host=opensearch` (Docker hostname).
   - *Giải pháp:* `VectorStore` và `SearchRetriever` tự detect: nếu host là "opensearch" và ENV != "docker" → đổi sang "localhost".

10. **Lỗi k-NN query format sai cho OpenSearch 2.x:**
    - *Hiện tượng:* `Unknown key for a START_OBJECT in [knn]`.
    - *Nguyên nhân:* Đặt `knn` ở top level thay vì trong `query`.
    - *Giải pháp:* Đặt `knn` bên trong `query` object.

---

## 4. Nhiệm Vụ Tiếp Theo (Phase 9 - Production Ready)

### Đã hoàn thành ✅
- [x] Ingestion: SharePoint Provider + Delta Query + Pagination
- [x] DCE: Document Classification Engine (phân loại file theo extension + PDF inspection)
- [x] PDF Inspection: Detect text layer, classify TEXT_PDF / SCAN_PDF / DRAWING_PDF
- [x] Conditional OCR: Chỉ OCR SCAN_PDF, TEXT_PDF extract trực tiếp, skip DRAWING/UNSUPPORTED
- [x] Extraction: VLM OCR (Vintern-3B qua LAN)
- [x] Chunking: Semantic Markdown Chunker
- [x] Indexing: OpenSearch k-NN HNSW + vietnamese-sbert
- [x] Search: Semantic Retriever
- [x] RAG Chat: LLM Factory (Gemini/Groq/Local)
- [x] API: FastAPI Backend (/chat, /health, /sync, /sync/status)
- [x] Frontend: Glassmorphism UI
- [x] Bug fixes: SharePoint links, Chunk dedup
- [x] Refactor: SyncEngine provider-agnostic
- [x] Logging: Structured logging utility (`core/logging.py`)
- [x] Permission: ACL extraction từ SharePoint + filter search theo user
- [x] Auth UI: Simple email login + SSO Azure AD + user context cho API calls
- [x] DOCX Text Extraction: python-docx (paragraphs + tables)
- [x] XLSX Text Extraction: openpyxl (sheets + cells)
- [x] Sync Audit: Lịch sử sync persist vào file + API + Frontend panel

### Chưa triển khai (Phase 9 - Production Ready)

#### Ưu tiên thấp
- [ ] **Monitoring Dashboard:** Health metrics, ingestion status, OCR success rate.
- [ ] **Multi-tenant:** Hỗ trợ nhiều SharePoint site/tenant.

---

## 5. Tiêu chuẩn Lập trình & Môi trường (Coding Standards)

### A. Quản lý Mã hóa (Encoding)
- **Quy tắc vàng:** Luôn sử dụng `encoding='utf-8'` trong mọi lệnh `open()`.
- **Môi trường:** `PYTHONIOENCODING=utf-8` trong Docker/WSL.

### B. Mẫu Provider (Provider Pattern)
- Mọi kết nối tới dịch vụ bên thứ ba (Storage, LLM) phải thông qua Interface/BaseClass.
- `BaseStorageProvider` cho Storage, `BaseLLMProvider` cho LLM.

### C. Quy tắc an toàn
- Không commit `.env`, không hardcode secrets.
- Không thay đổi kiến trúc đã chốt trong `doc/14.Project-Bridge-Context-for-New-Chat.md` mà không có lý do kỹ thuật rõ ràng.