Xu ly SSO

2026-05-09 10:31:28 +00:00
parent 9d04e7484c
commit f937d1a98e
21 changed files with 2515 additions and 271 deletions
--- a/doc/AGENT_HANDOVER_PROTOCOL.md
+++ b/doc/AGENT_HANDOVER_PROTOCOL.md
@@ -0,0 +1,98 @@
+# 🤖 AGENT HANDOVER PROTOCOL (Dành cho AI Agent)
+
+> **QUAN TRỌNG:** Nếu bạn là AI Agent mới, hãy đọc file này kết hợp với `doc/00.AGENT_ARCHITECTURE_MAP.md` trước khi viết bất kỳ dòng code nào.
+
+## 1. Tóm tắt "Bộ nhớ" Dự án (Memory Snapshot)
+Dự án này là một hệ thống **Enterprise RAG** (Retrieval-Augmented Generation) với các đặc điểm kỹ thuật:
+- **Distributed VLM OCR:** Dùng máy chủ LAN (`10.202.50.3:8080`) chạy `Vintern-3B` để trích xuất Markdown từ PDF.
+- **Modular Provider Pattern:** Tách biệt Storage (SharePoint) và LLM (Gemini, Groq, Local).
+- **Semantic Indexing:** Dùng `vietnamese-sbert` (local) tạo vector 768 chiều, lưu vào OpenSearch k-NN HNSW.
+- **FastAPI Backend:** API tại port 8000. Endpoint: `/health`, `/chat`.
+- **Glassmorphism UI:** Giao diện web tại `frontend/`, gọi `http://localhost:8000/chat`.
+
+**Pipeline hiện tại (ĐÃ HOẠT ĐỘNG):**
+```
+SharePoint → Ingestion → VLM OCR → Chunking → OpenSearch → Search → RAG Chat → FastAPI → Frontend
+```
+
+## 2. Trạng thái triển khai
+
+### ✅ Đã hoàn thành
+| Module | File | Mô tả |
+|--------|------|-------|
+| Ingestion | `ingestion/providers/sharepoint_provider.py` | Delta Query + Pagination + get_item_details |
+| Sync Engine | `ingestion/sync.py` | Provider-agnostic, nhận BaseStorageProvider |
+| DCE | `extraction/dce.py` | Document Classification Engine (phân loại file) |
+| PDF Inspector | `extraction/pdf_inspector.py` | TEXT_PDF / SCAN_PDF / DRAWING_PDF / AMBIGUOUS_PDF |
+| Magic Numbers | `extraction/magic_numbers.py` | Header byte validation |
+| OCR | `extraction/ocr_service.py` | VLM Client (Vintern-3B qua LAN) |
+| Chunking | `chunking/markdown_chunker.py` | Semantic Markdown rules + page tracking |
+| Indexing | `indexing/vector_store.py` | OpenSearch k-NN + delete_by_file_id dedup |
+| Search | `search/retriever.py` | Semantic k-NN vector search |
+| RAG Chat | `chat/rag_engine.py` | Search → Context → LLM |
+| LLM Factory | `chat/llm_factory.py` | Gemini / Groq / Local |
+| API | `api/main.py` | FastAPI port 8000 |
+| Frontend | `frontend/` | Glassmorphism UI (HTML/CSS/JS) |
+| Bug fixes | Nhiều file | SharePoint links (Bug #101), Chunk dedup |
+
+### ❌ Chưa triển khai (Phase 8)
+- **DOCX Text Extraction:** Trích xuất text từ DOCX không cần OCR
+- **XLSX Text Extraction:** Trích xuất header + key columns từ Excel
+- **Permission Enforcement:** ACL filtering theo user/group
+- **Authentication UI:** OAuth2 login
+- **Ingestion API:** Trigger sync từ frontend
+- **Logging & Audit:** Structured logging
+
+## 3. Hướng dẫn dành cho AI Agent tiếp theo
+1.  **Luôn kiểm tra `.env`:** Toàn bộ cấu hình nằm ở đây. Không bao giờ hardcode.
+2.  **Sử dụng `core/config.py`:** Cửa ngõ duy nhất để truy cập cài đặt.
+3.  **UTF-8:** Mọi I/O phải có `encoding='utf-8'`. Đặt `export PYTHONIOENCODING=utf-8`.
+4.  **Cập nhật tài liệu:** Khi hoàn thành Phase hoặc thay đổi kiến trúc, BẮT BUỘC cập nhật file này và `00.AGENT_ARCHITECTURE_MAP.md`.
+5.  **Đọc `doc/14.Project-Bridge-Context-for-New-Chat.md`:** Đây là "hợp đồng kiến trúc" - không thay đổi các quyết định đã chốt.
+
+## 4. Cách cập nhật Tài liệu (Protocol for Updates)
+- **Bước 1:** Cập nhật trạng thái trong `doc/00.AGENT_ARCHITECTURE_MAP.md` (đánh dấu ✅ vào checkbox).
+- **Bước 2:** Nếu phát hiện lỗi mới, ghi lại vào mục **"Lịch sử các lỗi khét tiếng"** kèm giải pháp.
+- **Bước 3:** Cập nhật mục **Trạng thái triển khai** trong file này.
+
+## 5. Lệnh chạy nhanh (Quick Start)
+```bash
+# Khởi động OpenSearch
+docker-compose up -d opensearch
+
+# Chạy Backend (FastAPI port 8000)
+python3 api/main.py
+
+# Mở Frontend
+# Mở frontend/index.html trong trình duyệt (hoặc dùng Live Server)
+
+# Nạp dữ liệu từ SharePoint → OCR → Chunk → Index
+python3 test_rag_pipeline.py
+```
+
+## 6. Kiểm tra nhanh (Verification)
+```bash
+# 1. Kiểm tra cú pháp Python
+python3 -m py_compile ingestion/graph_client.py
+python3 -m py_compile ingestion/providers/sharepoint_provider.py
+python3 -m py_compile ingestion/sync.py
+python3 -m py_compile indexing/vector_store.py
+python3 -m py_compile api/main.py
+python3 -m py_compile test_rag_pipeline.py
+
+# 2. Test kết nối Graph API
+python3 test_graph_smoke.py
+
+# 3. Test toàn bộ pipeline (cần OpenSearch + VLM server)
+python3 test_rag_pipeline.py
+
+# 4. Kiểm tra metadata
+cat ingestion_output.json | python3 -m json.tool | grep -E '"web_url"|"download_url"'
+
+# 5. Test API endpoint
+curl http://localhost:8000/health
+curl -X POST http://localhost:8000/chat -H "Content-Type: application/json" -d '{"query":"test"}'
+```
+
+---
+*Chúc may mắn, Agent đồng nghiệp! Pipeline RAG đã hoạt động. Tiếp theo: DCE, Permission, Hardening.*