upload all project

3bb240fb · Cao Duc Anh · 76d75890 · 3bb240fb · 3bb240fb · 3bb240fb
Commit 3bb240fb authored Jul 11, 2024 by Cao Duc Anh
39 changed files
--- a/README.md
+++ b/README.md
-# Vietnamese text moderation
+# Kiểm duyệt nội dung văn bản tiếng Việt
+
+## Định nghĩa các nhãn phân loại
+
+- **Phản động:** Trong lịch sử Việt Nam, "phản động" thường được dùng để mô tả các lực lượng hoặc cá nhân chống lại chính quyền cách mạng hoặc chế độ xã hội chủ nghĩa. <br>
+Ở các quốc gia khác, từ "phản động" có thể được dùng để mô tả những người hoặc nhóm người chống lại sự thay đổi hoặc tiến bộ xã hội, nhằm duy trì trật tự hoặc hệ thống cũ.
+- **Thù ghét:** Là những từ ngữ hoặc cụm từ có thể gây khó chịu, xúc phạm, hoặc tổn thương người nghe hoặc người đọc. <br>
+Bao gồm các từ chửi thề, nói tục, hoặc bất kỳ từ ngữ nào có mục đích làm tổn thương hoặc hạ thấp người khác, những lời lẽ kích động bạo lực hoặc thù hằn đối với một nhóm người hoặc cá nhân cụ thể, các từ ngữ thể hiện sự khinh miệt, coi thường, hoặc thiếu tôn trọng đối với người khác. 
+- **Khiêu dâm:** Là những tài liệu hoặc văn bản chứa các mô tả, hình ảnh, hoặc thông tin có mục đích kích thích tình dục một cách trực tiếp và rõ ràng. <br>
+Nội dung thường chứa các mô tả cụ thể, rõ ràng về các hành động tình dục, bao gồm cả các chi tiết về cơ thể và hành vi tình dục.
+- **Khác:** Không nằm trong các nhãn trên. 
+
+## Dataset
+https://drive.google.com/drive/folders/13EC69eegzPDL4ZCUSsbf-OPf4cLNerif?usp=sharing <br>
+Các file dữ liệu sẵn sàng cho huấn luyện ở định dạng *.csv gồm 2 cột: <br>
+- text: dữ liệu văn bản
+- label: nhãn phân loại, chữ thường không dấu, phân cách từ bằng dấu "_"
+
+## Hướng dẫn triển khai nhanh
+
+### 1. Download model pretrain
+Download pytorch-model.bin: https://huggingface.co/vinai/phobert-base/resolve/main/pytorch_model.bin?download=true <br>
+then save in **./phobert-base/pytorch_model.bin**
+
+### 2. Build docker images
+Server inference
+```
+docker build -f infer.Dockerfile -t vn-text-moderation .
+```
+Server training
+```
+docker build -f train.Dockerfile -t vn-text-moderation-train .
+```
+Server manage data
+```
+docker build -f data.Dockerfile -t vn-text-moderation-data .
+```
+
+### 3. Run
+```
+docker compose up
+``` 
+
+## Danh sách API
+### 1. API phân loại văn bản
+**Request** <br>
+*Giới hạn 10000 ký tự. Điều chỉnh trong file config.yaml: limit_infer_length*
+```
+curl -X POST http://10.3.2.100:8001/text-classify \
+     -H "Content-Type: application/json" \
+     -d '{
+           "paragraph": "Cũng từ thời điểm này, luật Lực lượng tham gia bảo vệ ANTT ở cơ sở chính thức có hiệu lực. Không chỉ tại TP.HCM, cùng ngày, các địa phương trên cả nước đồng loạt ra mắt lực lượng này. Đây là lực lượng được kiện toàn từ 3 lực lượng: bảo vệ dân phố, công an xã bán chuyên trách và đội trưởng, đội phó dân phòng.\nThì ra hắn lật tới trang 17 của cuốn sách của công ty Remax hắn bắt gặp nhỏ Mỹ Hạnh, nhỏ này xinh ghê, nhìn mặt mày sáng sủa nhưng nhìn là biết dâm rồi, chân mày con nhỏ đậm ghê, trời trời nó chụp hình mà còn lòi cái vùng trăng trắng ở ngực ra nữa chứ. Điệu này là chết ngắt với thằng quỷ Cường này rồi. Kể ra thì cũng tội nghiệp nó, nó mới chia tay với con bồ Khánh Ly mới có 1 tuần mà gặp đàn bà con gái là nó thèm rỏ dãi. Nhớ cái thứ Bảy tuần rồi hắn đi chợ 88 nhìn cái mông lắc lắc của con bé mới có 20 mà tâm hồn hắn tê dại, hắn muốn nhào tới bóp mông con quỷ sứ khêu gợi 1 cái cho bỏ ghét nhưng hắn tự kiềm chế. Sống ở cái xứ Canada quỷ này, chạm ba cái vùng cấm đó là bị thưa như chơi. Nó tức nó tại bữa nọ lớn tiếng với con bồ vì lỡ tay làm bể cái chậu cá yêu qúy của nó, làm con nhỏ tội nghiệp giận nó không thèm tới nhà nó cho… nó đụ nữa.\nTạm dừng dùng Facebook làm gì anh ơi... , dm mặc kệ miệng lưỡi thiên hạ họ nói gì thì nói , đâu phải ai cũng là người trong cuộc đâu . Vì vậy anh hãy tiếp tục sử dụng Facebook đi, để còn biết trên Facebook anh đang bị cđm chửi sml chứ"
+         }'
+```
+**Response** <br>
+Status 200 OK
+```
+[
+{
+"chunk":"Cũng từ thời điểm này, luật Lực lượng tham gia bảo vệ ANTT ở cơ sở chính thức có hiệu lực . Không chỉ tại TP.HCM, cùng ngày, các địa phương trên cả nước đồng loạt ra mắt lực lượng này .",
+"label":"khac",
+"confidence":5.4834747314453125
+},
+{
+"chunk":"Đây là lực lượng được kiện toàn từ 3 lực lượng: bảo vệ dân phố, công an xã bán chuyên trách và đội trưởng, đội phó dân phòng .",
+"label":"khac",
+"confidence":1.2335361242294312
+},
+{
+"chunk":"Thì ra hắn lật tới trang 17 của cuốn sách của công ty Remax hắn bắt gặp nhỏ Mỹ Hạnh, nhỏ này xinh ghê, nhìn mặt mày sáng sủa nhưng nhìn là biết dâm rồi, chân mày con nhỏ đậm ghê, trời trời nó chụp hình mà còn lòi cái vùng trăng trắng ở ngực ra nữa chứ .",
+"label":"khieu_dam",
+"confidence":7.714803218841553
+},
+{
+"chunk":"Điệu này là chết ngắt với thằng quỷ Cường này rồi . Kể ra thì cũng tội nghiệp nó, nó mới chia tay với con bồ Khánh Ly mới có 1 tuần mà gặp đàn bà con gái là nó thèm rỏ dãi .",
+"label":"thu_ghet",
+"confidence":6.8846821784973145
+},
+{
+"chunk":"Nhớ cái thứ Bảy tuần rồi hắn đi chợ 88 nhìn cái mông lắc lắc của con bé mới có 20 mà tâm hồn hắn tê dại, hắn muốn nhào tới bóp mông con quỷ sứ khêu gợi 1 cái cho bỏ ghét nhưng hắn tự kiềm chế .",
+"label":"khieu_dam",
+"confidence":10.557592391967773
+},
+{
+"chunk":"Sống ở cái xứ Canada quỷ này, chạm ba cái vùng cấm đó là bị thưa như chơi . Nó tức nó tại bữa nọ lớn tiếng với con bồ vì lỡ tay làm bể cái chậu cá yêu qúy của nó, làm con nhỏ tội nghiệp giận nó không thèm tới nhà nó cho… nó đụ nữa .",
+"label":"khieu_dam",
+"confidence":10.554232597351074
+},
+{
+"chunk":"Tạm dừng dùng Facebook làm gì anh ơi.. . , dm mặc kệ miệng lưỡi thiên hạ họ nói gì thì nói , đâu phải ai cũng là người trong cuộc đâu . Vì vậy anh hãy tiếp tục sử dụng Facebook đi, để còn biết trên Facebook anh đang bị cđm chửi sml chứ",
+"label":"thu_ghet",
+"confidence":6.883141994476318
+}
+]
+```
+Status 400 Bad Request
+```
+{
+    "detail": "Max length: 10000"
+}
+```
+Status 500 Internal server error
+```
+{
+    "detail": "detail of error"
+}
+```
+### 2. API bổ sung dữ liệu đã gắn nhãn
+**Request**
+```
+curl -X POST http://10.3.2.100:8008/add-data \
+     -H "Content-Type: application/json" \
+     -d '{
+           "paragraph": "Cũng từ thời điểm này, luật Lực lượng tham gia bảo vệ ANTT ở cơ sở chính thức có hiệu lực. Không chỉ tại TP.HCM, cùng ngày, các địa phương trên cả nước đồng loạt ra mắt lực lượng này.",
+           "label": "khac"
+         }'
+
+```
+**Response** <br>
+Status 200 OK
+```
+{
+    "paragraph": "Cũng từ thời điểm này, luật Lực lượng tham gia bảo vệ ANTT ở cơ sở chính thức có hiệu lực. Không chỉ tại TP.HCM, cùng ngày, các địa phương trên cả nước đồng loạt ra mắt lực lượng này.",
+    "label": "khac"
+}
+```
+Status 500 Internal server error
+```
+{
+    "detail": "detail of error"
+}
+```
+### 3. API huấn luyện mô hình
+**Request**
+```
+curl -X POST http://10.3.2.100:8000/start-training \
+     -H "Content-Type: application/json" \
+     -d '{
+           "pretrain": "latest"
+         }'
+```
+pretrain: 
+- "" : huấn luyện model từ đầu
+- "latest" : huấn luyện tiếp từ model huấn luyện gần nhất
+- "model_name.pth" : huấn luyện tiếp từ model cụ thể 
+
+**Response** <br>
+Status 200 OK
+```
+{
+    "status": "Training started",
+    "message": "Tracking and visualizing metrics on TensorBoard UI: http://localhost:6006/"
+}
+```
+Status 400 Bad Request
+```
+{
+    "detail": "Training already in progress"
+}
+```
+Status 500 Internal server error
+```
+{
+    "detail": "detail of error"
+}
+```
+### 4. API dừng huấn luyện mô hình
+**Request**
+```
+curl -X POST http://10.3.2.100:8000/stop-training 
+```
+**Response** <br>
+Status 200 OK
+```
+{
+    "status": "Training stopped",
+    "message": "VRam GPU are released"
+}
+```
+Status 400 Bad Request
+```
+{
+    "detail": "No training in progress"
+}
+```
+
+## Các thành phần trong hệ thống
+### MinIO
+Quản lý model, dữ liệu huấn luyện.
+### PostgreSQL
+Lưu trữ dữ liệu đã gắn nhãn từ người kiểm duyệt nội dung trên nền tảng MXH, được bổ sung liên tục để tiến hóa mô hình.
+### Adminer
+Công cụ quản trị dữ liệu trên PostgreSQL DB.
+### DataManager
+Server tự động khởi tạo bucket minio, các bảng trong cơ sở dữ liệu và cung cấp api bổ sung dữ liệu.
+### NlpCore
+Server cung cấp api phân loại văn bản.
+### Nginx
+Thực hiện phân tải: phân phối yêu cầu của người dùng đến các NlpCore.
+### NlpTraining
+Server cung cấp api huấn luyện model.
+### Tensorboard
+Giao diện theo dõi các chỉ số trong quá trình huấn luyện model.
+
+## Cải tiến trong tương lai
+Dữ liệu huấn luyện đang chưa đồng nhất: câu dài câu ngắn. 
+Hiện tại khi đưa 1 đoạn văn vào suy luận, đoạn sẽ được chia thành các đoạn nhỏ theo chunk_size. Việc quyết định chunk_size là bao nhiêu phụ thuộc vào dữ liệu huấn luyện, càng gần với độ dài câu trong dữ liệu huấn luyện cho độ chính xác càng cao. 

--- a/config.yaml
+++ b/config.yaml
+device: cuda
+classes: ["khac", "phan_dong", "thu_ghet", "khieu_dam"]
+model_checkpoint: /src/phobert-base/checkpoint_best.pth
+chunk_size: 64
+limit_infer_length: 10000
+
+vocab: 'aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0123456789!"#$%&''()*+,-./:;<=>?@[\]^_`{|}~ '
+
+minio:
+  server: minio:9000
+  data_labeled: data-annotated
+  model_trained: model
+
+sqldb:
+  server: sqldb:5432
+  table: data_annotated
+
+vncorenlp:
+  save_dir: /src/VnCoreNLP/
+
+phobert_base:
+  save_dir: /src/phobert-base/
+  max_token_length: 256
+
+training: 
+  epoch: 100
+  batch_size: 8
+  load_data_worker: 2
+  k_fold: 5
+  test_ratio: 0.1
--- a/data.Dockerfile
+++ b/data.Dockerfile
+FROM python:3.11
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=True \
+    PORT=9090
+
+# Install dependencies
+RUN apt-get update \
+    && apt-get install -y git 
+
+WORKDIR /src
+COPY ./server_manage_data/requirements.txt /src/
+RUN pip install --no-cache-dir -r requirements.txt
+COPY ./server_manage_data/*.py /src/
+
+CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8001"]
\ No newline at end of file
--- a/data/chong_pha_nha_nuoc.xlsx
+++ b/data/chong_pha_nha_nuoc.xlsx
--- a/data/crawl_data.ipynb
+++ b/data/crawl_data.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'url': 'https://truyensexcogiao.com/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/2/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/3/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/4/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/5/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/6/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/7/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/8/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/9/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/10/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/11/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/12/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/13/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/14/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensexcogiao.com/page/15/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/2/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/3/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/4/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/5/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/6/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/7/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/8/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}, {'url': 'https://truyensex88.net/page/9/', 'div_class': 'noibat', 'child_div_class': 'ndtruyen'}]\n",
+      "Empty DataFrame\n",
+      "Columns: [text, label]\n",
+      "Index: []\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "import pandas as pd\n",
+    "                    \n",
+    "with open('list_data_sources.json', 'r') as f:         \n",
+    "    data_sources = json.load(f)\n",
+    "print(data_sources)\n",
+    "\n",
+    "df = pd.DataFrame(columns=[\n",
+    "    'text', 'label'\n",
+    "])\n",
+    "print(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "https://truyensexcogiao.com/co-vo-ngoan-hien-cua-toi/\n",
+      "https://truyensexcogiao.com/tag/number-seven/\n",
+      "https://truyensexcogiao.com/vo-va-anh-tho-ho-nha-hang-xom/\n",
+      "https://truyensexcogiao.com/ga-tinh-co-giao-han/\n",
+      "https://truyensexcogiao.com/can-nha-toi-loi/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/chi-toi-5/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/cha-nuoi-anh-nuoi-va-em-gai/\n",
+      "https://truyensexcogiao.com/tag/buom-trang/\n",
+      "https://truyensexcogiao.com/tiet-hoc-sang-khoai/\n",
+      "https://truyensexcogiao.com/anh-re-toi-2/\n",
+      "https://truyensexcogiao.com/tag/thu-minh/\n",
+      "https://truyensexcogiao.com/ba-me-con-5/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/cuop-vo-ong-hang-xom/\n",
+      "https://truyensexcogiao.com/tag/kz-vo-danh/\n",
+      "https://truyensexcogiao.com/be-ti/\n",
+      "https://truyensexcogiao.com/tag/anh-khoa/\n",
+      "https://truyensexcogiao.com/ben-bo-ho-washington/\n",
+      "https://truyensexcogiao.com/cai-so-cua-ba/\n",
+      "https://truyensexcogiao.com/tag/tuyet-van/\n",
+      "https://truyensexcogiao.com/song-kiem-hop-bich/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/tinh-hoc-tro/\n",
+      "https://truyensexcogiao.com/vo-va-anh-tho-lat-gach-nha/\n",
+      "https://truyensexcogiao.com/co-ban-lop-ben/\n",
+      "https://truyensexcogiao.com/co-ban-nam-lun/\n",
+      "https://truyensexcogiao.com/tag/darking-tozora/\n",
+      "https://truyensexcogiao.com/chuyen-di-bien-cua-nguoi-me/\n",
+      "https://truyensexcogiao.com/amy/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/anh-re-toi/\n",
+      "https://truyensexcogiao.com/tag/thu-minh/\n",
+      "https://truyensexcogiao.com/nu-hoang-khoai-lac/\n",
+      "https://truyensexcogiao.com/tag/lam-linh/\n",
+      "https://truyensexcogiao.com/quynh-hoa-5/\n",
+      "https://truyensexcogiao.com/tag/ngoc-suong/\n",
+      "https://truyensexcogiao.com/phan-ma-hong/\n",
+      "https://truyensexcogiao.com/tag/jun-ying/\n",
+      "https://truyensexcogiao.com/buom-chua/\n",
+      "https://truyensexcogiao.com/tag/ngoc-linh/\n",
+      "https://truyensexcogiao.com/ngoi-sao-phuong-dong/\n",
+      "https://truyensexcogiao.com/ngua-hong/\n",
+      "https://truyensexcogiao.com/moi-tinh-dau-5/\n",
+      "https://truyensexcogiao.com/tag/kitotana/\n",
+      "https://truyensexcogiao.com/chuyen-trong-nha-ve-sinh/\n",
+      "https://truyensexcogiao.com/tag/number-seven/\n",
+      "https://truyensexcogiao.com/con-trai-la-chong-tuong-lai/\n",
+      "https://truyensexcogiao.com/tag/kz-vo-danh/\n",
+      "https://truyensexcogiao.com/doi-nguoi-yeu-voi-cu-em-khi-di-du-lich/\n",
+      "https://truyensexcogiao.com/me-man/\n",
+      "https://truyensexcogiao.com/lam-tinh-voi-ma/\n",
+      "https://truyensexcogiao.com/tag/anh-khoa/\n",
+      "https://truyensexcogiao.com/con-ut/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/ma-vu-dai/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/me-cua-nguoi-tinh/\n",
+      "https://truyensexcogiao.com/trai-doi/\n",
+      "https://truyensexcogiao.com/dong-tinh-luyen-ai/\n",
+      "https://truyensexcogiao.com/tag/cuong-vu/\n",
+      "https://truyensexcogiao.com/khoanh-khac-5/\n",
+      "https://truyensexcogiao.com/tag/mph/\n",
+      "https://truyensexcogiao.com/lam-lo/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/dieu-gi-do-khong-den/\n",
+      "https://truyensexcogiao.com/tag/nam-phuong/\n",
+      "https://truyensexcogiao.com/doi-em-loan/\n",
+      "https://truyensexcogiao.com/tag/steven-tran/\n",
+      "https://truyensexcogiao.com/truyen-sex-gay-cha-va-toi/\n",
+      "https://truyensexcogiao.com/tag/bruce-nguyen/\n",
+      "https://truyensexcogiao.com/dia-nguc-tran-gian/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/ba-me-con-hang-xom/\n",
+      "https://truyensexcogiao.com/tag/ngoc-linh/\n",
+      "https://truyensexcogiao.com/choi-di-o-khach-san/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/em-ke-18-la-cua-toi/\n",
+      "https://truyensexcogiao.com/loan-luan-voi-di-ut-dep/\n",
+      "https://truyensexcogiao.com/thang-ban-than/\n",
+      "https://truyensexcogiao.com/thang-chan-trau-va-con-di/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/dit-len-co-dau-ngay-chup-anh-cuoi/\n",
+      "https://truyensexcogiao.com/con-duong-ba-chu-quyen-15/\n",
+      "https://truyensexcogiao.com/tag/akay-hau/\n",
+      "https://truyensexcogiao.com/truyen-sex-gay-anh-em-ket-nghia/\n",
+      "https://truyensexcogiao.com/con-di-ben-ha/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/dua-em-sang-song/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/ba-chi-dau-dam-loan/\n",
+      "https://truyensexcogiao.com/tag/ngoc-linh/\n",
+      "https://truyensexcogiao.com/em-tiep-vien-hang-khong/\n",
+      "https://truyensexcogiao.com/con-di-5/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/co-lang-gieng-5/\n",
+      "https://truyensexcogiao.com/tag/phuong-nam/\n",
+      "https://truyensexcogiao.com/du-chi-ho/\n",
+      "https://truyensexcogiao.com/loi-tai-anh/\n",
+      "https://truyensexcogiao.com/gia-dinh-hanh-phuc-8/\n",
+      "https://truyensexcogiao.com/tag/may-may/\n",
+      "https://truyensexcogiao.com/chuyen-mot-gia-dinh/\n",
+      "https://truyensexcogiao.com/tag/paris/\n",
+      "https://truyensexcogiao.com/chich-chi-cua-ban-gai/\n",
+      "https://truyensexcogiao.com/chuyen-nguoi-nu-pham/\n",
+      "https://truyensexcogiao.com/tag/jamenet/\n",
+      "https://truyensexcogiao.com/pha-trinh-em-sinh-vien-nam-nhat/\n",
+      "https://truyensexcogiao.com/be-gai/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/ben-dong-song-trem/\n",
+      "https://truyensexcogiao.com/tag/ky-phong/\n",
+      "https://truyensexcogiao.com/cay-man-sau-nha/\n",
+      "https://truyensexcogiao.com/tag/kinh-bich-lich/\n",
+      "https://truyensexcogiao.com/ca-mau-yeu-dau/\n",
+      "https://truyensexcogiao.com/tag/blue-jeans/\n",
+      "https://truyensexcogiao.com/dua-chau-dam-cua-ai-nghi/\n",
+      "https://truyensexcogiao.com/muon-dit-em-lien/\n",
+      "https://truyensexcogiao.com/chi-toi-va-co-thu-ky-cua-bo-toi/\n",
+      "https://truyensexcogiao.com/tag/ngoc-linh/\n",
+      "https://truyensexcogiao.com/ba-nam-tinh-cu/\n",
+      "https://truyensexcogiao.com/tag/nha-que/\n",
+      "https://truyensexcogiao.com/he-thong-tinh-duc/\n",
+      "https://truyensexcogiao.com/cac-ba-chi-ho-cua-toi/\n",
+      "https://truyensexcogiao.com/tag/ngoc-linh/\n",
+      "https://truyensexcogiao.com/cuu-canh-luc-nua-dem-2/\n",
+      "https://truyensexcogiao.com/di-toi-6/\n",
+      "https://truyensexcogiao.com/co-em-vo-tui-than-2/\n",
+      "https://truyensexcogiao.com/tag/le-cuong/\n",
+      "https://truyensexcogiao.com/em-duoc-sep-bu-lon-2/\n",
+      "https://truyensexcogiao.com/tag/le-cuong/\n",
+      "https://truyensexcogiao.com/ngoai-tinh-voi-ban-than-chong-2/\n",
+      "https://truyensexcogiao.com/4-tieng-may-mua-cung-em-vang-chong-2/\n",
+      "https://truyensexcogiao.com/tam-su-cua-9x-sap-lay-chong-2/\n",
+      "https://truyensexcogiao.com/vo-dam-va-nhung-nguoi-ban-cua-chong-2/\n",
+      "https://truyensexcogiao.com/pha-trinh-gai-to-2/\n",
+      "https://truyensexcogiao.com/tim-lai-bau-troi-2/\n",
+      "https://truyensexcogiao.com/toi-thay-ba-cham-soc-me-2/\n",
+      "https://truyensexcogiao.com/mong-hoi-xuan-2/\n",
+      "https://truyensexcogiao.com/gai-dam-tu-truyen-2/\n",
+      "https://truyensexcogiao.com/tren-tinh-ban-duoi-tinh-yeu-2/\n",
+      "https://truyensexcogiao.com/tag/trichua18/\n",
+      "https://truyensexcogiao.com/chuyen-tinh-thoi-tam-quoc-2/\n",
+      "https://truyensexcogiao.com/tam-su-cua-mot-dan-choi-2/\n",
+      "https://truyensexcogiao.com/vo-toi-me-cho-duc-2/\n",
+      "https://truyensexcogiao.com/ba-me-dam-dang-2/\n",
+      "https://truyensexcogiao.com/mai-gia-me-trai-2/\n",
+      "https://truyensexcogiao.com/chuyen-tinh-tuoi-xe-2/\n",
+      "https://truyensexcogiao.com/gai-mien-son-cuoc-2/\n",
+      "https://truyensexcogiao.com/chup-anh-cho-gai-dam-3/\n",
+      "https://truyensexcogiao.com/len-lut-cung-vo-thay-mung-3-tet-3/\n",
+      "https://truyensexcogiao.com/tag/thichcogiao/\n",
+      "https://truyensexcogiao.com/ban-tinh-4/\n",
+      "https://truyensexcogiao.com/ban-tinh-3/\n",
+      "https://truyensexcogiao.com/vao-doi-3/\n",
+      "https://truyensexcogiao.com/anh-chi-va-cau-ut-3/\n",
+      "https://truyensexcogiao.com/than-tham-3/\n",
+      "https://truyensexcogiao.com/di-va-chu-3/\n",
+      "https://truyensexcogiao.com/nang-dau-dam-dang-3/\n",
+      "https://truyensexcogiao.com/mua-dich-covid-19-3/\n",
+      "https://truyensexcogiao.com/nhung-con-cho-cai-2/\n",
+      "https://truyensexcogiao.com/vet-so-long-2/\n",
+      "https://truyensexcogiao.com/tag/le-cuong/\n",
+      "https://truyensexcogiao.com/cuckold-bat-dau-the-nao-voi-chung-toi-2/\n",
+      "https://truyensexcogiao.com/nguoi-chi-thuong-nho-2/\n",
+      "https://truyensexcogiao.com/hau-cung-tran-the-2/\n",
+      "https://truyensexcogiao.com/co-em-vo-da-tinh-2/\n",
+      "https://truyensexcogiao.com/tag/le-cuong/\n",
+      "https://truyensexcogiao.com/ky-niem-voi-ba-giao-chu-nhiem-2/\n",
+      "https://truyensexcogiao.com/tan-anh-hung-xa-dieu-2/\n",
+      "https://truyensexcogiao.com/chi-ha-3/\n",
+      "https://truyensexcogiao.com/mau-public-3/\n",
+      "https://truyensexcogiao.com/gia-dinh-hanh-phuc-7/\n",
+      "https://truyensexcogiao.com/cuong-loan-mua-covid-3/\n",
+      "https://truyensexcogiao.com/tag/le-cuong/\n",
+      "https://truyensexcogiao.com/di-cuoi-ma-hai-than-3/\n",
+      "https://truyensexcogiao.com/ngoai-tinh-voi-chi-u37-3/\n",
+      "https://truyensexcogiao.com/vo-dam-va-tinh-cu-2/\n",
+      "https://truyensexcogiao.com/ngoai-tinh-voi-chi-hang-xom-2/\n",
+      "https://truyensexcogiao.com/chich-gai-dam-da-co-chong-2/\n",
+      "https://truyensexcogiao.com/chich-em-linh-ke-toan-2/\n",
+      "https://truyensexcogiao.com/truyen-sex-tru-tien-3/\n",
+      "https://truyensexcogiao.com/vua-xem-len-vua-du-2/\n",
+      "https://truyensexcogiao.com/tam-su-chuyen-bo-me-2/\n",
+      "https://truyensexcogiao.com/bi-ban-chong-pha-trinh-trong-dem-tan-hon-2/\n",
+      "https://truyensexcogiao.com/so-lon-chi-ho-vo-2/\n",
+      "https://truyensexcogiao.com/ngoai-tinh-tu-buoi-hop-lop-2/\n",
+      "https://truyensexcogiao.com/thu-tinh-nhung-be-dang-co-nguoi-yeu-2/\n",
+      "https://truyensexcogiao.com/thu-dam-cung-co-giao-2/\n",
+      "https://truyensexcogiao.com/tam-su-cua-mot-tai-xe-xe-cong-nghe-2/\n",
+      "https://truyensexcogiao.com/can-nha-ngoai-o-2/\n",
+      "https://truyensexcogiao.com/chich-co-giao-cu-2/\n",
+      "https://truyensexcogiao.com/su-menh-canh-sat-dau-tranh-cho-cong-bang-2/\n",
+      "https://truyensexcogiao.com/tag/nhu-linh/\n",
+      "https://truyensexcogiao.com/co-my-dung-2/\n",
+      "https://truyensexcogiao.com/be-dam-cung-day-tro-2/\n",
+      "https://truyensexcogiao.com/dit-em-moi-gioi-chung-khoan-2/\n",
+      "https://truyensexcogiao.com/ngu-ngo-lon-vung-dai-2/\n",
+      "https://truyensexcogiao.com/tag/le-cuong/\n",
+      "https://truyensexcogiao.com/toi-va-em-nguoi-yeu-2/\n",
+      "https://truyensexcogiao.com/meo-con-miet-vuon-3/\n",
+      "https://truyensexcogiao.com/nguyen-tac-cua-em-2/\n",
+      "https://truyensexcogiao.com/dit-em-gai-gian-chong-2/\n",
+      "https://truyensexcogiao.com/ky-niem-voi-gai-ca-mau-2/\n",
+      "https://truyensexcogiao.com/ong-chanh-2/\n",
+      "https://truyensexcogiao.com/bi-nham-la-di-2/\n",
+      "https://truyensexcogiao.com/ngoc-huyen-2/\n",
+      "https://truyensexcogiao.com/nhung-em-rau-dang-nho-2/\n",
+      "https://truyensex88.net/tinh-hoc-tro/\n",
+      "https://truyensex88.net/vo-va-anh-tho-lat-gach-nha/\n",
+      "https://truyensex88.net/co-ban-lop-ben/\n",
+      "https://truyensex88.net/co-ban-nam-lun/\n",
+      "https://truyensex88.net/tag/darking-tozora/\n",
+      "https://truyensex88.net/chuyen-di-bien-cua-nguoi-me/\n",
+      "https://truyensex88.net/amy/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/anh-re-toi/\n",
+      "https://truyensex88.net/tag/thu-minh/\n",
+      "https://truyensex88.net/ba-me-con-2/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/cuop-vo-ong-hang-xom/\n",
+      "https://truyensex88.net/tag/kz-vo-danh/\n",
+      "https://truyensex88.net/ben-bo-ho-washington/\n",
+      "https://truyensex88.net/be-ti/\n",
+      "https://truyensex88.net/tag/anh-khoa/\n",
+      "https://truyensex88.net/quynh-hoa-2/\n",
+      "https://truyensex88.net/tag/ngoc-suong/\n",
+      "https://truyensex88.net/phan-ma-hong/\n",
+      "https://truyensex88.net/tag/jun-ying/\n",
+      "https://truyensex88.net/song-kiem-hop-bich/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/moi-tinh-dau-2/\n",
+      "https://truyensex88.net/tag/kitotana/\n",
+      "https://truyensex88.net/chuyen-trong-nha-ve-sinh/\n",
+      "https://truyensex88.net/tag/number-seven/\n",
+      "https://truyensex88.net/con-trai-la-chong-tuong-lai/\n",
+      "https://truyensex88.net/tag/kz-vo-danh/\n",
+      "https://truyensex88.net/doi-nguoi-yeu-voi-cu-em-khi-di-du-lich/\n",
+      "https://truyensex88.net/me-man/\n",
+      "https://truyensex88.net/buom-chua/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/nu-hoang-khoai-lac/\n",
+      "https://truyensex88.net/tag/lam-linh/\n",
+      "https://truyensex88.net/lam-tinh-voi-ma/\n",
+      "https://truyensex88.net/tag/anh-khoa/\n",
+      "https://truyensex88.net/con-ut/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/ma-vu-dai/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/me-cua-nguoi-tinh/\n",
+      "https://truyensex88.net/ngoi-sao-phuong-dong/\n",
+      "https://truyensex88.net/ngua-hong/\n",
+      "https://truyensex88.net/trai-doi/\n",
+      "https://truyensex88.net/khoanh-khac-2/\n",
+      "https://truyensex88.net/tag/mph/\n",
+      "https://truyensex88.net/lam-lo/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/dieu-gi-do-khong-den/\n",
+      "https://truyensex88.net/tag/nam-phuong/\n",
+      "https://truyensex88.net/doi-em-loan/\n",
+      "https://truyensex88.net/tag/steven-tran/\n",
+      "https://truyensex88.net/dong-tinh-luyen-ai-2/\n",
+      "https://truyensex88.net/tag/cuong-vu/\n",
+      "https://truyensex88.net/dong-tinh-luyen-ai/\n",
+      "https://truyensex88.net/tag/cuong-vu/\n",
+      "https://truyensex88.net/ba-me-con-hang-xom/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/thang-chan-trau-va-con-di/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/dia-nguc-tran-gian/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/truyen-sex-gay-cha-va-toi/\n",
+      "https://truyensex88.net/tag/bruce-nguyen/\n",
+      "https://truyensex88.net/choi-di-o-khach-san/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/con-duong-ba-chu-quyen-15/\n",
+      "https://truyensex88.net/tag/akay-hau/\n",
+      "https://truyensex88.net/em-ke-18-la-cua-toi/\n",
+      "https://truyensex88.net/loan-luan-voi-di-ut-dep/\n",
+      "https://truyensex88.net/thang-ban-than/\n",
+      "https://truyensex88.net/dit-len-co-dau-ngay-chup-anh-cuoi/\n",
+      "https://truyensex88.net/truyen-sex-gay-anh-em-ket-nghia-2/\n",
+      "https://truyensex88.net/truyen-sex-gay-anh-em-ket-nghia/\n",
+      "https://truyensex88.net/dua-em-sang-song/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/em-tiep-vien-hang-khong/\n",
+      "https://truyensex88.net/con-di-2/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/du-chi-ho/\n",
+      "https://truyensex88.net/con-di-ben-ha/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/ba-chi-dau-dam-loan/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/chuyen-nguoi-nu-pham/\n",
+      "https://truyensex88.net/tag/jamenet/\n",
+      "https://truyensex88.net/pha-trinh-em-sinh-vien-nam-nhat/\n",
+      "https://truyensex88.net/loi-tai-anh/\n",
+      "https://truyensex88.net/co-lang-gieng-2/\n",
+      "https://truyensex88.net/tag/phuong-nam/\n",
+      "https://truyensex88.net/be-gai/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/ben-dong-song-trem/\n",
+      "https://truyensex88.net/tag/ky-phong/\n",
+      "https://truyensex88.net/cay-man-sau-nha/\n",
+      "https://truyensex88.net/tag/kinh-bich-lich/\n",
+      "https://truyensex88.net/ca-mau-yeu-dau/\n",
+      "https://truyensex88.net/tag/blue-jeans/\n",
+      "https://truyensex88.net/gia-dinh-hanh-phuc-3/\n",
+      "https://truyensex88.net/tag/may-may/\n",
+      "https://truyensex88.net/chuyen-mot-gia-dinh/\n",
+      "https://truyensex88.net/tag/paris/\n",
+      "https://truyensex88.net/chich-chi-cua-ban-gai/\n",
+      "https://truyensex88.net/dua-chau-dam-cua-ai-nghi/\n",
+      "https://truyensex88.net/muon-dit-em-lien/\n",
+      "https://truyensex88.net/chi-toi-va-co-thu-ky-cua-bo-toi/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/ba-nam-tinh-cu-2/\n",
+      "https://truyensex88.net/tag/nha-que/\n",
+      "https://truyensex88.net/ba-nam-tinh-cu/\n",
+      "https://truyensex88.net/tag/nha-que/\n",
+      "https://truyensex88.net/stalker/\n",
+      "https://truyensex88.net/me-toi-tho-san-cu/\n",
+      "https://truyensex88.net/tag/sazham/\n",
+      "https://truyensex88.net/phan-dien-dau-pha/\n",
+      "https://truyensex88.net/tag/tieu-ca-ca/\n",
+      "https://truyensex88.net/he-thong-tinh-duc/\n",
+      "https://truyensex88.net/cac-ba-chi-ho-cua-toi/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/ba-chu-nha-tot-bung/\n",
+      "https://truyensex88.net/tag/kim-lien/\n",
+      "https://truyensex88.net/chich-ban-em-nguoi-yeu-cu/\n",
+      "https://truyensex88.net/cau-hai-duc/\n",
+      "https://truyensex88.net/tag/mai-suong-suong/\n",
+      "https://truyensex88.net/bien-co-cuoc-doi/\n",
+      "https://truyensex88.net/lam-thue/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/ky-niem-kho-phai/\n",
+      "https://truyensex88.net/tag/v/\n",
+      "https://truyensex88.net/be-gai-loli-duoc-chu-giao-giuc-gioi-tinh-2/\n",
+      "https://truyensex88.net/tag/dien-vy/\n",
+      "https://truyensex88.net/be-gai-loli-duoc-chu-giao-giuc-gioi-tinh/\n",
+      "https://truyensex88.net/tag/dien-vy/\n",
+      "https://truyensex88.net/lao-an-may/\n",
+      "https://truyensex88.net/tag/shadow/\n",
+      "https://truyensex88.net/truyen-co-that/\n",
+      "https://truyensex88.net/hanh-phuc-gia-dinh-2/\n",
+      "https://truyensex88.net/tag/ngoc-linh/\n",
+      "https://truyensex88.net/doi-thong-da-lat-2/\n",
+      "https://truyensex88.net/doi-thong-da-lat/\n",
+      "https://truyensex88.net/mua-va-em/\n",
+      "https://truyensex88.net/chi-em-loan-luan-2/\n",
+      "https://truyensex88.net/vang-chong-2/\n",
+      "https://truyensex88.net/buoi-hop-lop/\n"
+     ]
+    }
+   ],
+   "source": [
+    "import requests\n",
+    "from bs4 import BeautifulSoup\n",
+    "import operator\n",
+    "from collections import Counter\n",
+    "import time\n",
+    "\n",
+    "for data_source in data_sources:\n",
+    "    source_html = requests.get(data_source['url']).text\n",
+    "    soup = BeautifulSoup(source_html, 'html.parser')\n",
+    "\n",
+    "    for div in soup.findAll('div', {'class': data_source['div_class']}):\n",
+    "        a_tags = div.find_all('a', href=True)\n",
+    "        for a_tag in a_tags:\n",
+    "            link = a_tag['href']\n",
+    "            print(link)\n",
+    "            response = requests.get(link)\n",
+    "            response.encoding = 'utf-8'\n",
+    "            content_html = response.text\n",
+    "            content_soup = BeautifulSoup(content_html, 'html.parser')\n",
+    "\n",
+    "            for child_div in content_soup.findAll('div', {'class': data_source['child_div_class']}):\n",
+    "                p_tags = child_div.find_all('p')\n",
+    "                for p_tag in p_tags:\n",
+    "                    # print(p_tag.text)\n",
+    "                    df.loc[len(df)] = [p_tag.text, '']\n",
+    "                    \n",
+    "            # To avoid being blocked or overloading the server\n",
+    "            time.sleep(1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Long làm việc chiều tối mới về. Không biết sao...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>– Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>– Long ngày nào cũng gọi nói nhảm vậy đó hả? –...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12432</th>\n",
+       "      <td>– Ahhh!!! Anh chịu hết nổi rồi… Chắc anh ra… m...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12433</th>\n",
+       "      <td>Nói xong nàng gồng người xuất tinh chan chứa. ...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12434</th>\n",
+       "      <td>Tôi ghì chặt lấy Hương một lúc lâu cho đến khi...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12435</th>\n",
+       "      <td>– Không chịu! Anh nói là xuất tinh cho em thấy...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12436</th>\n",
+       "      <td>Nói xong tôi hôn lên môi nàng một lần nữa, rồi...</td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>12437 rows × 2 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                    text label\n",
+       "0      Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...      \n",
+       "1      Long làm việc chiều tối mới về. Không biết sao...      \n",
+       "2      – Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...      \n",
+       "3      Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...      \n",
+       "4      – Long ngày nào cũng gọi nói nhảm vậy đó hả? –...      \n",
+       "...                                                  ...   ...\n",
+       "12432  – Ahhh!!! Anh chịu hết nổi rồi… Chắc anh ra… m...      \n",
+       "12433  Nói xong nàng gồng người xuất tinh chan chứa. ...      \n",
+       "12434  Tôi ghì chặt lấy Hương một lúc lâu cho đến khi...      \n",
+       "12435  – Không chịu! Anh nói là xuất tinh cho em thấy...      \n",
+       "12436  Nói xong tôi hôn lên môi nàng một lần nữa, rồi...      \n",
+       "\n",
+       "[12437 rows x 2 columns]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "DataFrame has been saved to van_ban_khieu_dam.xlsx\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save the DataFrame to an Excel file\n",
+    "file_path = 'van_ban_khieu_dam.xlsx'  # Specify the file path where you want to save the Excel file\n",
+    "df.to_excel(file_path, index=False)\n",
+    "\n",
+    "print(f'DataFrame has been saved to {file_path}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "bert",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/data/data_all.csv
+++ b/data/data_all.csv
--- a/data/list_data_sources.json
+++ b/data/list_data_sources.json
+[
+    {
+        "url": "https://truyensexcogiao.com/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/2/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/3/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/4/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/5/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/6/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/7/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/8/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/9/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/10/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/11/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/12/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/13/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/14/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensexcogiao.com/page/15/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/2/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/3/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/4/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/5/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/6/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/7/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/8/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    },
+    {
+        "url": "https://truyensex88.net/page/9/",
+        "div_class": "noibat",
+        "child_div_class": "ndtruyen"
+    }
+]
--- a/data/ngon_tu_thu_ghet_all.xlsx
+++ b/data/ngon_tu_thu_ghet_all.xlsx
--- a/data/pre_anotation.ipynb
+++ b/data/pre_anotation.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd \n",
+    "\n",
+    "def get_data(path):\n",
+    "    if path.lower().endswith(\".csv\"):\n",
+    "        df = pd.read_csv(path)\n",
+    "    elif path.lower().endswith(\".xlsx\"):\n",
+    "        df = pd.read_excel(path, sheet_name=None)['Sheet1']\n",
+    "    return df\n",
+    "\n",
+    "df = get_data('van_ban_khieu_dam.xlsx')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Long làm việc chiều tối mới về. Không biết sao...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>– Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>– Long ngày nào cũng gọi nói nhảm vậy đó hả? –...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Long hoảng hốt tôi đang nghe gì thế này, hình ...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>– Phụt! Em nói thật đi, cái gì, nãy em nói tý ...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Long cảm giác hơi choáng. Rõ ràng là giọng Lin...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>– Bú chim anh đi! – Giong nam lại cất lên… – D...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Long cảm thấy cơn ghen tuông nổi lên, cơn tức ...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                text  label\n",
+       "0  Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...    NaN\n",
+       "1  Long làm việc chiều tối mới về. Không biết sao...    NaN\n",
+       "2  – Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...    NaN\n",
+       "3  Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...    NaN\n",
+       "4  – Long ngày nào cũng gọi nói nhảm vậy đó hả? –...    NaN\n",
+       "5  Long hoảng hốt tôi đang nghe gì thế này, hình ...    NaN\n",
+       "6  – Phụt! Em nói thật đi, cái gì, nãy em nói tý ...    NaN\n",
+       "7  Long cảm giác hơi choáng. Rõ ràng là giọng Lin...    NaN\n",
+       "8  – Bú chim anh đi! – Giong nam lại cất lên… – D...    NaN\n",
+       "9  Long cảm thấy cơn ghen tuông nổi lên, cơn tức ...    NaN"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['địt', 'xuất tinh', 'dương vật', 'âm đạo', 'dâm thủy', 'lồn', 'cặc', 'buồi', 'đụ', 'xoa vú', 'nứng', 'cặp mông', 'làm tình', 'mông', 'hậu môn', 'bú chim', 'chim to', 'nhấp liên tục', 'con cu', 'tử cung', 'bím', 'hột le', 'đầu vú', 'bầu vú', 'nắc liên tục', 'núm vú', 'âm hộ', 'bú vú', 'cặp nhũ hoa', 'cặp vú', 'bóp chặt cu', 'truyensex', 'vào háng']\n"
+     ]
+    }
+   ],
+   "source": [
+    "sex_words = []\n",
+    "with open(\"tu_khieu_dam.txt\", 'r') as f:\n",
+    "    for line in f:\n",
+    "        sex_words.append(line.replace('\\n', ''))\n",
+    "print(sex_words)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/tmp/ipykernel_3229728/1826687499.py:4: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'khieu_dam' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.\n",
+      "  df.loc[index, 'label'] = 'khieu_dam'\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Long làm việc chiều tối mới về. Không biết sao...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>– Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>– Long ngày nào cũng gọi nói nhảm vậy đó hả? –...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Long hoảng hốt tôi đang nghe gì thế này, hình ...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>– Phụt! Em nói thật đi, cái gì, nãy em nói tý ...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Long cảm giác hơi choáng. Rõ ràng là giọng Lin...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>– Bú chim anh đi! – Giong nam lại cất lên… – D...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Long cảm thấy cơn ghen tuông nổi lên, cơn tức ...</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                text      label\n",
+       "0  Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...        NaN\n",
+       "1  Long làm việc chiều tối mới về. Không biết sao...        NaN\n",
+       "2  – Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...        NaN\n",
+       "3  Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...        NaN\n",
+       "4  – Long ngày nào cũng gọi nói nhảm vậy đó hả? –...  khieu_dam\n",
+       "5  Long hoảng hốt tôi đang nghe gì thế này, hình ...        NaN\n",
+       "6  – Phụt! Em nói thật đi, cái gì, nãy em nói tý ...  khieu_dam\n",
+       "7  Long cảm giác hơi choáng. Rõ ràng là giọng Lin...  khieu_dam\n",
+       "8  – Bú chim anh đi! – Giong nam lại cất lên… – D...  khieu_dam\n",
+       "9  Long cảm thấy cơn ghen tuông nổi lên, cơn tức ...        NaN"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "for index, row in df.iloc[0:].iterrows():\n",
+    "    for word in sex_words:\n",
+    "        if word in row['text'].lower():\n",
+    "            df.loc[index, 'label'] = 'khieu_dam'\n",
+    "            continue\n",
+    "\n",
+    "df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3658\n"
+     ]
+    }
+   ],
+   "source": [
+    "label_count = df['label'].value_counts().get('khieu_dam', 0)\n",
+    "print(label_count)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_excel(\"van_ban_khieu_dam_processed.xlsx\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "bert",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/data/tu_khieu_dam.txt
+++ b/data/tu_khieu_dam.txt
+địt
+xuất tinh
+dương vật
+âm đạo
+dâm thủy
+lồn
+cặc
+buồi
+đụ
+xoa vú
+nứng
+cặp mông
+làm tình
+mông
+hậu môn
+bú chim
+chim to
+nhấp liên tục
+con cu
+tử cung
+bím
+hột le
+đầu vú
+bầu vú
+nắc liên tục
+núm vú
+âm hộ
+bú vú
+cặp nhũ hoa
+cặp vú
+bóp chặt cu
+truyensex
+vào háng
--- a/data/van_ban_khieu_dam.xlsx
+++ b/data/van_ban_khieu_dam.xlsx
--- a/docker-compose.yml
+++ b/docker-compose.yml
+version: '3.9' 
+
+# Settings and configurations that are common for containers
+x-nlpcore-common: &nlpcore-common
+  image: vn-text-moderation:latest
+  restart: always 
+  env_file: 
+    - env_file/minio.env
+  depends_on:
+    - minio
+  volumes:
+      - ./config.yaml:/src/config.yaml
+  deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              device_ids: ['0']
+              capabilities: [gpu]
+
+services:
+  minio:
+    image: minio/minio
+    restart: always
+    env_file: 
+      - env_file/minio.env
+    ports:
+      - "9090:9000"
+      - "9091:9001"
+    volumes:
+      - ./minio_data:/data
+    command: server --console-address ":9001" /data
+
+  sqldb:
+    image: postgres:13
+    restart: always
+    env_file: 
+      - env_file/sql.env
+      - env_file/minio.env
+    ports:
+      - "5432:5432"
+    volumes:
+      - ./postgres_data:/var/lib/postgresql/data
+
+  adminer:
+    image: adminer
+    environment:
+      ADMINER_DEFAULT_SERVER: sqldb
+    ports:
+      - 8080:8080
+
+  datamanager:
+    image: vn-text-moderation-data
+    restart: always 
+    env_file: 
+      - env_file/sql.env
+      - env_file/minio.env
+    depends_on:
+      - minio
+      - sqldb
+    volumes:
+      - ./config.yaml:/src/config.yaml
+    ports:
+      - "8008:8001"
+
+  nlpcore01:
+    <<: *nlpcore-common
+    hostname: nlpcore01 
+    ports: 
+      - "8002:8001"
+
+  nlpcore02:
+    <<: *nlpcore-common
+    hostname: nlpcore02 
+    ports: 
+      - "8003:8001"
+
+  # Load balancing API use nginx
+  nginx:
+    image: nginx:1.25.0
+    restart: always 
+    depends_on:
+      - nlpcore01
+      - nlpcore02
+    volumes:
+      - ./nginx/conf.d:/etc/nginx/conf.d
+      - ./nginx/log:/var/log/nginx/
+    ports:
+      - "8001:8001"
+
+  nlptraining:
+    image: vn-text-moderation-train:latest
+    restart: always
+    env_file: 
+      - env_file/minio.env
+    volumes:
+      - ./config.yaml:/src/config.yaml
+      - ./runs:/runs
+    ports: 
+      - "8000:8000"
+    deploy:
+        resources:
+          reservations:
+            devices:
+              - driver: nvidia
+                device_ids: ['0']
+                capabilities: [gpu]
+
+  tensorboard:
+    image: tensorflow/tensorflow:latest-py3
+    command: tensorboard --logdir=/logs --host 0.0.0.0
+    ports:
+      - "6006:6006"
+    volumes:
+      - ./runs:/logs  # Thư mục chứa log của TensorBoard
--- a/env_file/minio.env
+++ b/env_file/minio.env
+MINIO_ROOT_USER=vivas
+MINIO_ROOT_PASSWORD=pad12345
\ No newline at end of file
--- a/env_file/sql.env
+++ b/env_file/sql.env
+POSTGRES_USER=vivas
+POSTGRES_PASSWORD=pad12345
+POSTGRES_DB=text_moderation
\ No newline at end of file
--- a/helper_tools/edit_data.ipynb
+++ b/helper_tools/edit_data.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Edit label data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd \n",
+    "\n",
+    "def get_data(path):\n",
+    "    if path.lower().endswith(\".csv\"):\n",
+    "        df = pd.read_csv(path)\n",
+    "    elif path.lower().endswith(\".xlsx\"):\n",
+    "        df = pd.read_excel(path, sheet_name=\"Sheet1\")\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Long làm việc chiều tối mới về. Không biết sao...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>– Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>– Long ngày nào cũng gọi nói nhảm vậy đó hả? –...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12432</th>\n",
+       "      <td>– Ahhh!!! Anh chịu hết nổi rồi… Chắc anh ra… m...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12433</th>\n",
+       "      <td>Nói xong nàng gồng người xuất tinh chan chứa. ...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12434</th>\n",
+       "      <td>Tôi ghì chặt lấy Hương một lúc lâu cho đến khi...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12435</th>\n",
+       "      <td>– Không chịu! Anh nói là xuất tinh cho em thấy...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12436</th>\n",
+       "      <td>Nói xong tôi hôn lên môi nàng một lần nữa, rồi...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>12437 rows × 2 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                    text      label\n",
+       "0      Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...       khac\n",
+       "1      Long làm việc chiều tối mới về. Không biết sao...       khac\n",
+       "2      – Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...       khac\n",
+       "3      Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...       khac\n",
+       "4      – Long ngày nào cũng gọi nói nhảm vậy đó hả? –...  khieu_dam\n",
+       "...                                                  ...        ...\n",
+       "12432  – Ahhh!!! Anh chịu hết nổi rồi… Chắc anh ra… m...       khac\n",
+       "12433  Nói xong nàng gồng người xuất tinh chan chứa. ...  khieu_dam\n",
+       "12434  Tôi ghì chặt lấy Hương một lúc lâu cho đến khi...  khieu_dam\n",
+       "12435  – Không chịu! Anh nói là xuất tinh cho em thấy...  khieu_dam\n",
+       "12436  Nói xong tôi hôn lên môi nàng một lần nữa, rồi...       khac\n",
+       "\n",
+       "[12437 rows x 2 columns]"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = get_data('/home/anhcd/Products/kiem-duyet-noi-dung-van-ban-tieng-viet/data/van_ban_khieu_dam.xlsx')\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Long làm việc chiều tối mới về. Không biết sao...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>– Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>– Long ngày nào cũng gọi nói nhảm vậy đó hả? –...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                text      label\n",
+       "0  Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...       khac\n",
+       "1  Long làm việc chiều tối mới về. Không biết sao...       khac\n",
+       "2  – Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...       khac\n",
+       "3  Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...       khac\n",
+       "4  – Long ngày nào cũng gọi nói nhảm vậy đó hả? –...  khieu_dam"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.rename(columns={'content': 'text'}, inplace=True)\n",
+    "df.rename(columns={'index_spans': 'label'}, inplace=True)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def process_row(row):\n",
+    "    # Kiểm tra xem label có hợp lệ không \n",
+    "    if row['label'] == \"phan_cam\":\n",
+    "        return \"thu_ghet\"\n",
+    "    else:\n",
+    "        return row['label']\n",
+    "\n",
+    "df['label'] = df.apply(process_row, axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Long làm việc chiều tối mới về. Không biết sao...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>– Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>– Long ngày nào cũng gọi nói nhảm vậy đó hả? –...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12432</th>\n",
+       "      <td>– Ahhh!!! Anh chịu hết nổi rồi… Chắc anh ra… m...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12433</th>\n",
+       "      <td>Nói xong nàng gồng người xuất tinh chan chứa. ...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12434</th>\n",
+       "      <td>Tôi ghì chặt lấy Hương một lúc lâu cho đến khi...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12435</th>\n",
+       "      <td>– Không chịu! Anh nói là xuất tinh cho em thấy...</td>\n",
+       "      <td>khieu_dam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12436</th>\n",
+       "      <td>Nói xong tôi hôn lên môi nàng một lần nữa, rồi...</td>\n",
+       "      <td>khac</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>12437 rows × 2 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                    text      label\n",
+       "0      Hôm nay ngày đẹp trời. Long tâm trạng cực kỳ v...       khac\n",
+       "1      Long làm việc chiều tối mới về. Không biết sao...       khac\n",
+       "2      – Alo anh yêu à! Em nghe nè! Có chuyện gì khôn...       khac\n",
+       "3      Long nhớ vợ quá! Mới gọi chút cũng chả đỡ nhớ,...       khac\n",
+       "4      – Long ngày nào cũng gọi nói nhảm vậy đó hả? –...  khieu_dam\n",
+       "...                                                  ...        ...\n",
+       "12432  – Ahhh!!! Anh chịu hết nổi rồi… Chắc anh ra… m...       khac\n",
+       "12433  Nói xong nàng gồng người xuất tinh chan chứa. ...  khieu_dam\n",
+       "12434  Tôi ghì chặt lấy Hương một lúc lâu cho đến khi...  khieu_dam\n",
+       "12435  – Không chịu! Anh nói là xuất tinh cho em thấy...  khieu_dam\n",
+       "12436  Nói xong tôi hôn lên môi nàng một lần nữa, rồi...       khac\n",
+       "\n",
+       "[12437 rows x 2 columns]"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_excel(\"/home/anhcd/Products/kiem-duyet-noi-dung-van-ban-tieng-viet/data/van_ban_khieu_dam.xlsx\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Append DataFrames (concatenate rows)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd \n",
+    "\n",
+    "def get_data(path):\n",
+    "    if path.lower().endswith(\".csv\"):\n",
+    "        df = pd.read_csv(path)\n",
+    "    elif path.lower().endswith(\".xlsx\"):\n",
+    "        df = pd.read_excel(path, sheet_name=\"Sheet1\")\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df1 = get_data('/home/anhcd/Products/kiem-duyet-noi-dung-van-ban-tieng-viet/data/ngon_tu_phan_cam1.xlsx')\n",
+    "df2 = get_data('/home/anhcd/Products/kiem-duyet-noi-dung-van-ban-tieng-viet/data/ngon_tu_phan_cam2.xlsx')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "8844"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(df1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1106"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(df2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = [df1, df2]\n",
+    "df = pd.concat(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "9950"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_excel(\"/home/anhcd/Products/kiem-duyet-noi-dung-van-ban-tieng-viet/data/ngon_tu_phan_cam_all.xlsx\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "bert",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/infer.Dockerfile
+++ b/infer.Dockerfile
+FROM pytorch/pytorch:2.3.1-cuda11.8-cudnn8-runtime
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=True \
+    PORT=9090
+
+# Install dependencies
+RUN apt-get update \
+    && apt-get install -y git default-jre default-jdk
+
+WORKDIR /src
+RUN git clone https://github.com/vncorenlp/VnCoreNLP.git
+RUN git clone https://huggingface.co/vinai/phobert-base/
+COPY ./phobert-base/pytorch_model.bin /src/phobert-base/pytorch_model.bin 
+COPY ./server_infer/requirements.txt /src/requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+COPY ./server_infer/*.py /src/
+
+CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8001"]
\ No newline at end of file
--- a/lab/finetune_bert_classify.ipynb
+++ b/lab/finetune_bert_classify.ipynb
--- a/lab/preprocess_data.ipynb
+++ b/lab/preprocess_data.ipynb
--- a/nginx/conf.d/manager.conf
+++ b/nginx/conf.d/manager.conf
+upstream manager {
+    server nlpcore01:8001;
+    server nlpcore02:8001;
+}
+
+server {
+    listen 8001;
+
+    location / {
+        proxy_set_header        X-Real-IP  $remote_addr;
+        proxy_set_header        Host $host;
+        proxy_connect_timeout   1;
+        proxy_pass              http://manager;
+    }
+}
--- a/nginx/log/access.log
+++ b/nginx/log/access.log
+10.3.3.60 - - [03/Jul/2024:09:42:06 +0000] "POST /text-classify HTTP/1.1" 502 157 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [03/Jul/2024:10:00:38 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [03/Jul/2024:10:01:27 +0000] "POST /text-classify HTTP/1.1" 499 0 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [03/Jul/2024:10:01:29 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [04/Jul/2024:08:30:25 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [04/Jul/2024:08:38:10 +0000] "\xFF\xF4\xFF\xFD\x06" 400 157 "-" "-" "-"
+10.3.3.60 - - [04/Jul/2024:08:38:20 +0000] "]" 400 157 "-" "-" "-"
+10.3.3.60 - - [04/Jul/2024:08:41:38 +0000] "POST /start-training HTTP/1.1" 404 22 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:03:50:04 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:46 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:49 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:50 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:52 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:53 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:53 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:54 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:55 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:04:10:56 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:09:29:33 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [05/Jul/2024:10:25:57 +0000] "POST /text-classify HTTP/1.1" 200 2309 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:09:46:05 +0000] "POST /text-classify HTTP/1.1" 200 2323 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:09:46:09 +0000] "POST /text-classify HTTP/1.1" 200 2328 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:09:46:10 +0000] "POST /text-classify HTTP/1.1" 200 2326 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:09:46:11 +0000] "POST /text-classify HTTP/1.1" 200 2327 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:09:46:12 +0000] "POST /text-classify HTTP/1.1" 200 2326 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:09:56:27 +0000] "POST /text-classify HTTP/1.1" 200 2306 "-" "PostmanRuntime/7.37.3" "-"
+10.3.3.60 - - [09/Jul/2024:10:33:07 +0000] "POST /text-classify HTTP/1.1" 200 2306 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:06:52:22 +0000] "POST /text-classify HTTP/1.1" 200 2313 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:08:40:49 +0000] "POST /text-classify HTTP/1.1" 200 2313 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:08:40:52 +0000] "POST /text-classify HTTP/1.1" 200 2313 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:09:59:05 +0000] "POST /text-classify HTTP/1.1" 200 2300 "-" "curl/7.81.0" "-"
+10.3.2.100 - - [10/Jul/2024:10:23:22 +0000] "POST /text-classify HTTP/1.1" 502 157 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:10:23:25 +0000] "POST /text-classify HTTP/1.1" 502 157 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:10:23:29 +0000] "POST /text-classify HTTP/1.1" 502 157 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [10/Jul/2024:10:23:35 +0000] "POST /text-classify HTTP/1.1" 200 2318 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [11/Jul/2024:06:33:03 +0000] "POST /text-classify HTTP/1.1" 200 2318 "-" "PostmanRuntime/7.37.3" "-"
+10.3.2.100 - - [11/Jul/2024:06:41:13 +0000] "POST /text-classify HTTP/1.1" 200 2318 "-" "PostmanRuntime/7.37.3" "-"
--- a/nginx/log/error.log
+++ b/nginx/log/error.log
--- a/server_infer/bert_model.py
+++ b/server_infer/bert_model.py
+import torch.nn as nn
+from transformers import AutoModel
+
+class BERTClassifier(nn.Module):
+    def __init__(self, model_bert, n_classes):
+        super(BERTClassifier, self).__init__()
+        self.bert = AutoModel.from_pretrained(model_bert, local_files_only=True)
+        self.drop = nn.Dropout(p=0.3)
+        self.fc = nn.Linear(self.bert.config.hidden_size, n_classes)
+        nn.init.normal_(self.fc.weight, std=0.02)
+        nn.init.normal_(self.fc.bias, 0)
+
+    def forward(self, input_ids, attention_mask):
+        _, output = self.bert(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            return_dict=False # Dropout will errors if without this
+        )
+
+        x = self.drop(output)
+        x = self.fc(x)
+        return x
--- a/server_infer/preprocess.py
+++ b/server_infer/preprocess.py
+import py_vncorenlp
+from utils import get_data_from_yaml
+
+config = get_data_from_yaml("config.yaml")
+VNCORENLP_DIR = config.get("vncorenlp")["save_dir"]
+
+
+# Load the word and sentence segmentation component
+rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=['wseg'], save_dir=VNCORENLP_DIR)
+
+def preprocess_text(text):
+    content = rdrsegmenter.word_segment(text)
+    content = ''.join(content)
+    return content
\ No newline at end of file
--- a/server_infer/requirements.txt
+++ b/server_infer/requirements.txt
+py_vncorenlp
+fastapi
+uvicorn
+numpy
+transformers
+minio==7.2.7
\ No newline at end of file
--- a/server_infer/server.py
+++ b/server_infer/server.py
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+import torch
+from transformers import AutoTokenizer
+from contextlib import asynccontextmanager
+from minio import Minio
+from minio.error import S3Error
+import os
+from typing import List
+
+from bert_model import BERTClassifier
+from utils import get_data_from_yaml, split_chunk
+from preprocess import preprocess_text
+
+config = get_data_from_yaml("/src/config.yaml")
+DEVICE = config.get("device")
+CLASSES = config.get("classes")
+MINIO_SERVER = config.get("minio")["server"]
+MINIO_DATA_LABELED = config.get("minio")["data_labeled"]
+MINIO_MODEL_TRAINED = config.get("minio")["model_trained"]
+VNCORENLP_DIR = config.get("vncorenlp")["save_dir"]
+PHOBERTBASE_DIR = config.get("phobert_base")["save_dir"]
+MAX_TOKEN_LENGTH = config.get("phobert_base")["max_token_length"]
+MODEL_CHECKPOINT = config.get("model_checkpoint")
+CHUNK_SIZE = config.get("chunk_size")
+INFER_LENGTH = config.get("limit_infer_length")
+
+minio_client = Minio(
+    endpoint=MINIO_SERVER,
+    access_key=os.getenv("MINIO_ROOT_USER"),
+    secret_key=os.getenv("MINIO_ROOT_PASSWORD"),
+    secure=False
+)
+
+def download_latest_model():
+    # List objects in the bucket
+    objects = minio_client.list_objects(MINIO_MODEL_TRAINED)
+    latest_obj = None
+    latest_time = None
+    
+    for obj in objects:
+        if "best" in obj.object_name:
+            if latest_time is None or obj.last_modified > latest_time:
+                latest_time = obj.last_modified
+                latest_obj = obj
+
+    if latest_obj is not None:
+        try:
+            minio_client.fget_object(MINIO_MODEL_TRAINED, latest_obj.object_name, MODEL_CHECKPOINT)
+        except S3Error as exc:
+            print(f"Error occurred: {exc}")
+        return latest_obj.object_name
+    else:
+        raise Exception("No *best* models found in the bucket")
+
+
+
+tokenizer = AutoTokenizer.from_pretrained(PHOBERTBASE_DIR, local_files_only=True, use_fast=False)
+
+def infer(text, model, tokenizer, class_names, max_len=MAX_TOKEN_LENGTH+2):
+    encoded_review = tokenizer.encode_plus(
+        text,
+        max_length=max_len,
+        truncation=True,
+        add_special_tokens=True,
+        padding='max_length',
+        return_attention_mask=True,
+        return_token_type_ids=False,
+        return_tensors='pt',
+    )
+
+    input_ids = encoded_review['input_ids'].to(DEVICE)
+    attention_mask = encoded_review['attention_mask'].to(DEVICE)
+
+    output = model(input_ids, attention_mask)
+    conf, y_pred = torch.max(output, dim=1)
+
+    return conf, class_names[y_pred]
+
+
+model = BERTClassifier(model_bert=PHOBERTBASE_DIR, n_classes=len(CLASSES))
+model.to(DEVICE)
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global model
+    try:
+        update_model = download_latest_model()
+        model.load_state_dict(torch.load(MODEL_CHECKPOINT))
+        model.eval()
+        print(f"Model updated: {update_model}")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+    yield
+    torch.cuda.empty_cache()
+    torch.cuda.ipc_collect()
+
+class ParagraphRequest(BaseModel):
+    paragraph: str
+
+class ChunkLabelResponse(BaseModel):
+    chunk: str
+    label: str
+    confidence: float
+
+app = FastAPI(lifespan=lifespan)
+
+@app.post("/text-classify", response_model=List[ChunkLabelResponse])
+async def process_paragraph(request: ParagraphRequest):
+    if len(request.paragraph) > INFER_LENGTH:
+        raise HTTPException(status_code=400, detail=f"Max length: {INFER_LENGTH}")
+    chunks = split_chunk(request.paragraph, max_words=CHUNK_SIZE)
+    response = []
+    for chunk in chunks:
+        text = preprocess_text(chunk)
+        confidence, processed_label = infer(text, model, tokenizer, CLASSES)
+        response.append(ChunkLabelResponse(chunk=chunk, label=processed_label, confidence=confidence))
+    return response
+
+
--- a/server_infer/utils.py
+++ b/server_infer/utils.py
+import yaml 
+import re
+
+def get_data_from_yaml(filename):
+    try:
+        with open(filename, 'r') as f:
+            data = yaml.safe_load(f)
+    except IOError:
+        raise IOError(f"Error opening file: {filename}")
+
+    return data
+
+def split_chunk(text, max_words=200):
+    # Regular expression to match sentences and newlines
+    sentence_endings = re.compile(r'([.!?])\s+|\n')
+    
+    # Split text into segments based on the regular expression
+    segments = sentence_endings.split(text)
+    
+    paragraphs = []
+    current_paragraph = []
+    current_word_count = 0
+    
+    for segment in segments:
+        if segment is not None:
+            words = segment.split()
+            word_count = len(words)
+            
+            # Check if adding this segment would exceed the max_words limit
+            if current_word_count + word_count > max_words:
+                # If so, finalize the current paragraph and start a new one
+                paragraphs.append(' '.join(current_paragraph))
+                current_paragraph = words
+                current_word_count = word_count
+            else:
+                # Add the segment to the current paragraph
+                current_paragraph.extend(words)
+                current_word_count += word_count
+    
+    # Add the last paragraph if any
+    if current_paragraph:
+        paragraphs.append(' '.join(current_paragraph))
+    
+    return paragraphs
--- a/server_manage_data/config.py
+++ b/server_manage_data/config.py
+from utils import get_data_from_yaml
+
+config = get_data_from_yaml("/src/config.yaml")
+DB_SERVER = config.get("sqldb")["server"]
+DB_TABLENAME = config.get("sqldb")["table"]
+MINIO_SERVER = config.get("minio")["server"]
+MINIO_DATA_LABELED = config.get("minio")["data_labeled"]
+MINIO_MODEL_TRAINED = config.get("minio")["model_trained"]
\ No newline at end of file
--- a/server_manage_data/database.py
+++ b/server_manage_data/database.py
+from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
+from sqlalchemy.orm import sessionmaker
+import os
+from config import DB_SERVER
+
+DATABASE_URL = f'postgresql+asyncpg://{os.getenv("POSTGRES_USER")}:{os.getenv("POSTGRES_PASSWORD")}@{DB_SERVER}/{os.getenv("POSTGRES_DB")}'
+
+engine = create_async_engine(DATABASE_URL, echo=True)
+SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine, class_=AsyncSession)
+
+async def get_db():
+    async with SessionLocal() as session:
+        yield session
--- a/server_manage_data/models.py
+++ b/server_manage_data/models.py
+from sqlalchemy import Column, Integer, String
+from sqlalchemy.ext.declarative import declarative_base
+
+from config import DB_TABLENAME
+
+Base = declarative_base()
+
+class LabeledData(Base):
+    __tablename__ = DB_TABLENAME
+    id = Column(Integer, primary_key=True, index=True)
+    paragraph = Column(String, nullable=False)
+    label = Column(String, nullable=False)
--- a/server_manage_data/requirements.txt
+++ b/server_manage_data/requirements.txt
+fastapi 
+sqlalchemy 
+asyncpg 
+psycopg2-binary
+minio==7.2.7
+pandas==2.2.1
--- a/server_manage_data/schemas.py
+++ b/server_manage_data/schemas.py
+from pydantic import BaseModel
+
+class LabeledDataCreate(BaseModel):
+    paragraph: str
+    label: str
--- a/server_manage_data/server.py
+++ b/server_manage_data/server.py
+from fastapi import FastAPI, Depends, HTTPException
+from contextlib import asynccontextmanager
+from sqlalchemy.ext.asyncio import AsyncSession
+from sqlalchemy.future import select
+from sqlalchemy import MetaData, Table, inspect
+from minio import Minio
+from minio.error import S3Error
+import os
+
+from models import Base, LabeledData
+from schemas import LabeledDataCreate
+from sqlalchemy import create_engine, MetaData, Table
+from sqlalchemy.orm import sessionmaker
+import pandas as pd
+from io import StringIO, BytesIO
+from database import engine, get_db, SessionLocal
+from config import MINIO_SERVER, MINIO_MODEL_TRAINED, MINIO_DATA_LABELED
+from utils import check_bucket
+
+minio_client = Minio(
+    endpoint=MINIO_SERVER,
+    access_key=os.getenv("MINIO_ROOT_USER"),
+    secret_key=os.getenv("MINIO_ROOT_PASSWORD"),
+    secure=False
+)
+bucket_names = [MINIO_DATA_LABELED, MINIO_MODEL_TRAINED]
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    check_bucket(minio_client=minio_client, bucket_names=bucket_names)
+    async with engine.begin() as conn:
+        await conn.run_sync(Base.metadata.create_all)
+    yield
+
+app = FastAPI(lifespan=lifespan)
+
+@app.post("/add-data", response_model=LabeledDataCreate)
+async def create_labeled_data(data: LabeledDataCreate, db: AsyncSession = Depends(get_db)):
+    new_data = LabeledData(paragraph=data.paragraph, label=data.label)
+    try:
+        db.add(new_data)
+        await db.commit()
+        await db.refresh(new_data)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+    return new_data
+
--- a/server_manage_data/utils.py
+++ b/server_manage_data/utils.py
+import yaml 
+
+def get_data_from_yaml(filename):
+    try:
+        with open(filename, 'r') as f:
+            data = yaml.safe_load(f)
+    except IOError:
+        raise IOError(f"Error opening file: {filename}")
+
+    return data
+
+def check_bucket(minio_client, bucket_names):
+    # Check if bucket exists
+    for bucket_name in bucket_names:
+        found = minio_client.bucket_exists(bucket_name)
+        if not found:
+            # Create bucket
+            minio_client.make_bucket(bucket_name)
+            print(f"Bucket '{bucket_name}' created successfully.")
+        else:
+            print(f"Bucket '{bucket_name}' already exists.")
\ No newline at end of file
--- a/server_train/bert_model.py
+++ b/server_train/bert_model.py
+import torch.nn as nn
+from transformers import AutoModel
+
+class BERTClassifier(nn.Module):
+    def __init__(self, model_bert, n_classes):
+        super(BERTClassifier, self).__init__()
+        self.bert = AutoModel.from_pretrained(model_bert, local_files_only=True)
+        self.drop = nn.Dropout(p=0.3)
+        self.fc = nn.Linear(self.bert.config.hidden_size, n_classes)
+        nn.init.normal_(self.fc.weight, std=0.02)
+        nn.init.normal_(self.fc.bias, 0)
+
+    def forward(self, input_ids, attention_mask):
+        _, output = self.bert(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            return_dict=False # Dropout will errors if without this
+        )
+
+        x = self.drop(output)
+        x = self.fc(x)
+        return x
--- a/server_train/preprocess.py
+++ b/server_train/preprocess.py
+import py_vncorenlp
+
+from utils import get_data_from_yaml
+
+config = get_data_from_yaml("/src/config.yaml")
+CLASSES = config.get("classes")
+VNCORENLP_DIR = config.get("vncorenlp")["save_dir"]
+
+# Load the word and sentence segmentation component
+rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=['wseg'], save_dir=VNCORENLP_DIR)
+
+def preprocess_row(row):
+    # Kiểm tra xem label có hợp lệ không 
+    if row['label'] in CLASSES:
+        content = rdrsegmenter.word_segment(row['text'])
+        content = ''.join(content)
+        return content
+    else:
+        # Nếu label không hợp lệ, thay đổi text
+        return None
\ No newline at end of file
--- a/server_train/requirements.txt
+++ b/server_train/requirements.txt
+celery[redis]
+py_vncorenlp==0.1.4
+fastapi
+uvicorn
+numpy==1.24.3
+transformers==4.39.2
+pandas==2.2.1
+minio==7.2.7
+scikit-learn==1.4.1.post1
+scipy==1.12.0
+gensim==4.3.2
+tensorboard
\ No newline at end of file
--- a/server_train/server.py
+++ b/server_train/server.py
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+import threading
+import torch
+import numpy as np
+from transformers import AutoTokenizer
+from sklearn.model_selection import train_test_split
+import pandas as pd
+import os
+
+from gensim.utils import simple_preprocess
+from sklearn.model_selection import StratifiedKFold
+from contextlib import asynccontextmanager
+
+import torch.nn as nn
+from torch.optim import AdamW
+from torch.utils.data import Dataset, DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from transformers import get_linear_schedule_with_warmup, AutoTokenizer, AutoModel
+from io import BytesIO
+import pandas as pd
+import sys
+import logging
+import datetime
+from typing import List
+from minio import Minio
+from minio.error import S3Error
+
+from bert_model import BERTClassifier
+from utils import get_data_from_yaml, seed_everything
+from preprocess import preprocess_row
+
+# Global variables to manage the training thread and the stop flag
+train_thread = None
+is_training = False
+stop_training_flag = threading.Event()
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+handler = logging.StreamHandler(sys.stderr)
+formatter = logging.Formatter('%(asctime)s %(levelname)s: %(message)s')
+handler.setFormatter(formatter)
+
+logger.addHandler(handler)
+
+
+config = get_data_from_yaml("/src/config.yaml")
+DEVICE = config.get("device")
+CLASSES = config.get("classes")
+MINIO_SERVER = config.get("minio")["server"]
+MINIO_DATA_LABELED = config.get("minio")["data_labeled"]
+MINIO_MODEL_TRAINED = config.get("minio")["model_trained"]
+VNCORENLP_DIR = config.get("vncorenlp")["save_dir"]
+PHOBERTBASE_DIR = config.get("phobert_base")["save_dir"]
+MAX_TOKEN_LENGTH = config.get("phobert_base")["max_token_length"]
+MODEL_CHECKPOINT = config.get("model_checkpoint")
+CHUNK_SIZE = config.get("chunk_size")
+EPOCH = config.get("training")["epoch"]
+K_FOLD = config.get("training")["k_fold"]
+TEST_RATIO = config.get("training")["test_ratio"]
+BATCH_SIZE = config.get("training")["batch_size"]
+LOAD_DATA_WORKER = config.get("training")["load_data_worker"]
+
+minio_client = Minio(
+    endpoint=MINIO_SERVER,
+    access_key=os.getenv("MINIO_ROOT_USER"),
+    secret_key=os.getenv("MINIO_ROOT_PASSWORD"),
+    secure=False
+)
+bucket_names = [MINIO_DATA_LABELED, MINIO_MODEL_TRAINED]
+
+def download_latest_model():
+    # List objects in the bucket
+    objects = minio_client.list_objects(MINIO_MODEL_TRAINED)
+    latest_obj = None
+    latest_time = None
+    
+    for obj in objects:
+        if "last" in obj.object_name:
+            if latest_time is None or obj.last_modified > latest_time:
+                latest_time = obj.last_modified
+                latest_obj = obj
+
+    if latest_obj is not None:
+        try:
+            minio_client.fget_object(MINIO_MODEL_TRAINED, latest_obj.object_name, MODEL_CHECKPOINT)
+        except S3Error as exc:
+            print(f"Error occurred: {exc}")
+        return latest_obj.object_name
+    else:
+        raise Exception("No *last* models found in the bucket")
+    
+
+def release_training_vram():
+    # Giải phóng bộ nhớ GPU của quá trình training
+    torch.cuda.empty_cache()
+    torch.cuda.ipc_collect()
+
+class ClassifyDataset(Dataset):
+    def __init__(self, df, tokenizer, classes, max_len):
+        self.df = df
+        self.max_len = max_len
+        self.tokenizer = tokenizer
+        self.classes = classes
+    
+    def __len__(self):
+        return len(self.df)
+
+    def __getitem__(self, index):
+        """
+        To customize dataset, inherit from Dataset class and implement
+        __len__ & __getitem__
+        __getitem__ should return 
+            data:
+                input_ids
+                attention_masks
+                text
+                targets
+        """
+        row = self.df.iloc[index]
+        text, label = self.get_input_data(row)
+
+        # Encode_plus will:
+        # (1) split text into token
+        # (2) Add the '[CLS]' and '[SEP]' token to the start and end
+        # (3) Truncate/Pad sentence to max length
+        # (4) Map token to their IDS
+        # (5) Create attention mask
+        # (6) Return a dictionary of outputs
+        encoding = self.tokenizer.encode_plus(
+            text,
+            truncation=True,
+            add_special_tokens=True,
+            max_length=self.max_len,
+            padding='max_length',
+            return_attention_mask=True,
+            return_token_type_ids=False,
+            return_tensors='pt',
+        )
+        
+        return {
+            'text': text,
+            'input_ids': encoding['input_ids'].flatten(),
+            'attention_masks': encoding['attention_mask'].flatten(),
+            'targets': torch.tensor(label, dtype=torch.long),
+        }
+
+
+    def labelencoder(self,text):
+        for i in range(len(self.classes)):
+            if text == self.classes[i]:
+                return i
+
+    def get_input_data(self, row):
+        # Preprocessing: {remove icon, special character, lower}
+        text = row['text']
+        text = ' '.join(simple_preprocess(text))
+        label = self.labelencoder(row['label'])
+
+        return text, label
+
+def eval(model, valid_loader, criterion, device):
+    model.eval()
+    losses = []
+    correct = 0
+
+    with torch.no_grad():
+        for data in valid_loader:
+            input_ids = data['input_ids'].to(device)
+            attention_mask = data['attention_masks'].to(device)
+            targets = data['targets'].to(device)
+
+            outputs = model(
+                input_ids=input_ids,
+                attention_mask=attention_mask
+            )
+
+            _, pred = torch.max(outputs, dim=1)
+
+            loss = criterion(outputs, targets)
+            correct += torch.sum(pred == targets)
+            losses.append(loss.item())
+    
+    logger.info(f'Valid Accuracy: {correct.double()/len(valid_loader.dataset)} Loss: {np.mean(losses)}')
+    return correct.double()/len(valid_loader.dataset)
+
+def prepare_loaders(df, fold, tokenizer, classes, max_len, batch_size, num_workers):
+    df_train = df[df.kfold != fold].reset_index(drop=True)
+    df_valid = df[df.kfold == fold].reset_index(drop=True)
+    
+    train_dataset = ClassifyDataset(df_train, tokenizer, classes, max_len=max_len)
+    valid_dataset = ClassifyDataset(df_valid, tokenizer, classes, max_len=max_len)
+    
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)
+    valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)
+    
+    return train_loader, valid_loader
+
+
+def train_model(skf, train_df, tokenizer, use_pretrain=False):
+    global stop_training_flag, is_training
+    train_version = datetime.datetime.now().strftime("%d-%m-%Y_%Hh%Mm")
+    
+    writer = SummaryWriter(log_dir=f'/runs/{train_version}')
+
+    logger.info("START TRAINING PROCESS")
+    model = BERTClassifier(model_bert=PHOBERTBASE_DIR, n_classes=len(CLASSES)).to(DEVICE)
+    if use_pretrain == True:
+        model.load_state_dict(torch.load(MODEL_CHECKPOINT))
+
+    epoch_each_fold = int(EPOCH/skf.n_splits)
+    cur_epoch = 0
+
+    for fold in range(skf.n_splits):
+        if stop_training_flag.is_set():
+            print("Training stopped.")
+            writer.close()
+            return "Training stopped"
+        logger.info(f'-----------Fold: {fold+1} ------------------')
+        train_loader, valid_loader = prepare_loaders(train_df, fold=fold, tokenizer=tokenizer, classes=CLASSES, \
+                                                     max_len=MAX_TOKEN_LENGTH, batch_size=BATCH_SIZE, num_workers=LOAD_DATA_WORKER)
+        criterion = nn.CrossEntropyLoss()
+        # Recommendation by BERT: lr: 5e-5, 2e-5, 3e-5
+        # Batchsize: 16, 32
+        optimizer = AdamW(model.parameters(), lr=2e-5)
+        
+        lr_scheduler = get_linear_schedule_with_warmup( 
+                    optimizer, 
+                    num_warmup_steps=0, 
+                    num_training_steps=len(train_loader)*EPOCH
+                )
+        best_acc = 0
+        for e in range(epoch_each_fold):
+            if stop_training_flag.is_set():
+                print("Training stopped.")
+                writer.close()
+                return "Training stopped"
+            logger.info(f'Fold {fold+1} Epoch {cur_epoch+1}/{EPOCH}')
+            logger.info('-'*30)
+            # Train ----------------------------------------------------------------------------------
+            model.train()
+            losses = []
+            correct = 0
+
+            for data in train_loader:
+                if stop_training_flag.is_set():
+                    print("Training stopped.")
+                    writer.close()
+                    return "Training stopped"
+                if data is not None:         
+                    input_ids = data['input_ids'].to(DEVICE)
+                    attention_mask = data['attention_masks'].to(DEVICE)
+                    targets = data['targets'].to(DEVICE)
+
+                    optimizer.zero_grad()
+                    outputs = model(
+                        input_ids=input_ids,
+                        attention_mask=attention_mask
+                    )
+
+                    loss = criterion(outputs, targets)
+                    _, pred = torch.max(outputs, dim=1)
+
+                    correct += torch.sum(pred == targets)
+                    losses.append(loss.item())
+                    loss.backward()
+                    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+                    optimizer.step()
+                    lr_scheduler.step()
+                else:
+                    logger.warning("Warning: Empty data received from data loader. Skipping this iteration.")
+
+            train_acc = correct.double()/len(train_loader.dataset)
+            train_loss = np.mean(losses)
+            logger.info(f'Train Accuracy: {train_acc} Loss: {train_loss}')
+            writer.add_scalar("Accuracy/train", train_acc, cur_epoch)
+            writer.add_scalar("Loss/train", train_loss, cur_epoch)
+            # End train ----------------------------------------------------------------------------------
+            if stop_training_flag.is_set():
+                print("Training stopped.")
+                writer.close()
+                return "Training stopped"
+            # Valid --------------------------------------------------------------------------------------
+            model.eval()
+            losses = []
+            correct = 0
+
+            with torch.no_grad():
+                for data in valid_loader:
+                    if stop_training_flag.is_set():
+                        print("Training stopped.")
+                        writer.close()
+                        return "Training stopped"
+                    input_ids = data['input_ids'].to(DEVICE)
+                    attention_mask = data['attention_masks'].to(DEVICE)
+                    targets = data['targets'].to(DEVICE)
+
+                    outputs = model(
+                        input_ids=input_ids,
+                        attention_mask=attention_mask
+                    )
+
+                    _, pred = torch.max(outputs, dim=1)
+
+                    loss = criterion(outputs, targets)
+                    correct += torch.sum(pred == targets)
+                    losses.append(loss.item())
+            
+            val_acc = correct.double()/len(valid_loader.dataset)
+            val_loss = np.mean(losses)
+            logger.info(f'Valid Accuracy: {val_acc} Loss: {val_loss}')
+            writer.add_scalar("Accuracy/valid", val_acc, cur_epoch)
+            writer.add_scalar("Loss/valid", val_loss, cur_epoch)
+            # End valid ----------------------------------------------------------------------------------
+            
+            if stop_training_flag.is_set():
+                print("Training stopped.")
+                writer.close()
+                return "Training stopped"
+            
+            # Save checkpoint
+            torch.save(model.state_dict(), f'/src/phobert_last.pth')
+
+            if val_acc > best_acc:
+                torch.save(model.state_dict(), f'/src/phobert_best.pth')
+                best_acc = val_acc
+                checkpoint_best = f'phobert_best_{train_version}.pth'
+                try:
+                    upload_best = minio_client.fput_object(
+                        MINIO_MODEL_TRAINED,      # Bucket name
+                        checkpoint_best,      # Object name (name in MinIO)
+                        '/src/phobert_best.pth'      # Path to the file you want to upload
+                    )
+                except S3Error as exc:
+                    print(f"Error occurred: {exc}")
+                print(f"File uploaded successfully. {checkpoint_best}")
+
+            if stop_training_flag.is_set():
+                print("Training stopped.")
+                writer.close()
+                return "Training stopped"
+            
+            # Upload file
+            checkpoint_last = f'phobert_last_{train_version}.pth'
+            try:
+                upload_last = minio_client.fput_object(
+                    MINIO_MODEL_TRAINED,      # Bucket name
+                    checkpoint_last,      # Object name (name in MinIO)
+                    '/src/phobert_last.pth'      # Path to the file you want to upload
+                )
+            except S3Error as exc:
+                print(f"Error occurred: {exc}")
+            print(f"File uploaded successfully. {checkpoint_last}")
+
+            cur_epoch = cur_epoch + 1
+
+    print("Training completed.")
+    writer.close()
+    release_training_vram()
+    is_training = False
+    return "Training completed"
+
+
+    
+class TrainingRequest(BaseModel):
+    pretrain: str
+
+class TrainingResponse(BaseModel):
+    status: str
+    message: str
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    print(f'CUDA available: {torch.cuda.is_available()}')
+    yield
+    torch.cuda.empty_cache()
+    torch.cuda.ipc_collect()
+
+app = FastAPI(lifespan=lifespan)
+
+@app.post("/start-training", response_model=TrainingResponse)
+async def start_training(request: TrainingRequest):
+    global train_thread, stop_training_flag, is_training
+    if is_training == True:
+        raise HTTPException(status_code=400, detail="Training already in progress")
+    is_training = True
+    
+    use_pretrain = False
+
+    if request.pretrain == "":
+        print("Training from zero")
+    elif request.pretrain == "latest":
+        try:
+            latest_model = download_latest_model()
+        except Exception as exc:
+            is_training = False
+            raise HTTPException(status_code=500, detail=str(exc))
+        use_pretrain = True
+        print(f"Training from latest: {latest_model}")
+    else:
+        try:
+            minio_client.fget_object(MINIO_MODEL_TRAINED, request.pretrain, MODEL_CHECKPOINT)
+        except Exception as exc:
+            is_training = False
+            raise HTTPException(status_code=500, detail=str(exc))
+        use_pretrain = True
+        print(f"Training from : {request.pretrain}")
+
+    objects = minio_client.list_objects(MINIO_DATA_LABELED, recursive=True)
+
+    dataframes = []
+    
+    for obj in objects:
+        logger.info(f'Data found: {obj.object_name}')
+        try:
+            response = minio_client.get_object(MINIO_DATA_LABELED, obj.object_name)
+            file_data = BytesIO(response.read())
+            response.close()
+            response.release_conn()
+            df = pd.read_csv(file_data)
+            df = df[['text', 'label']]
+            dataframes.append(df)
+        except Exception as exc:
+            is_training = False
+            raise HTTPException(status_code=500, detail=str(exc))
+
+    df = pd.concat(dataframes)
+    df = df.dropna()
+
+    logger.info("Pull data done")
+
+    # Tien xu ly: Phan doan tu 
+    try:
+        df['text'] = df.apply(preprocess_row, axis=1)
+    except Exception as e:
+        logger.error(f"An error occurred: {e}")
+        is_training = False
+        raise HTTPException(status_code=500, detail=str(e))
+    df = df.dropna()
+    label_counts = df['label'].value_counts()
+    logger.info(f'Data label count: {label_counts}')
+    
+    # Sử dụng train_test_split để phân chia dữ liệu
+    train_df, test_df = train_test_split(df, test_size=TEST_RATIO, stratify=df['label'], random_state=42)
+    train_df = train_df.reset_index(drop=True)
+    test_df = test_df.reset_index(drop=True)
+
+    # Kiểm tra kết quả
+    logger.info(f"Train DataFrame size: {train_df.shape}")
+    logger.info(f"Test DataFrame size: {test_df.shape}")
+
+    # We will use Kfold later
+    skf = StratifiedKFold(n_splits=K_FOLD)
+    for fold, (_, val_) in enumerate(skf.split(X=train_df, y=train_df.label)):
+        train_df.loc[val_, "kfold"] = fold
+
+    tokenizer = AutoTokenizer.from_pretrained(PHOBERTBASE_DIR, local_files_only=True, use_fast=False)
+    # This ensures that any random number generation in NumPy, PyTorch (CPU and GPU), and cuDNN will be consistent
+    seed_everything(86)
+
+    stop_training_flag.clear()
+    train_thread = threading.Thread(target=train_model, args=(skf, train_df, tokenizer, use_pretrain))
+    train_thread.start()
+
+    return TrainingResponse(status="Training started", message="Tracking and visualizing metrics on TensorBoard UI: http://localhost:6006/")
+
+@app.post("/stop-training", response_model=TrainingResponse)
+async def stop_training():
+    global stop_training_flag, train_thread, is_training
+
+    if not train_thread or not train_thread.is_alive():
+        raise HTTPException(status_code=400, detail="No training in progress")
+
+    stop_training_flag.set()
+    train_thread.join()  # Wait for the thread to finish
+    release_training_vram()
+    is_training = False
+    return TrainingResponse(status="Training stopped", message="VRam GPU are released")
\ No newline at end of file
--- a/server_train/utils.py
+++ b/server_train/utils.py
+import numpy as np
+import torch
+import yaml 
+
+def get_data_from_yaml(filename):
+    try:
+        with open(filename, 'r') as f:
+            data = yaml.safe_load(f)
+    except IOError:
+        raise IOError(f"Error opening file: {filename}")
+
+    return data
+
+def seed_everything(seed_value):
+    np.random.seed(seed_value)
+    torch.manual_seed(seed_value)
+    
+    if torch.cuda.is_available(): 
+        print("Torch available")
+        torch.cuda.manual_seed(seed_value)
+        torch.cuda.manual_seed_all(seed_value)
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = True
--- a/train.Dockerfile
+++ b/train.Dockerfile
+FROM pytorch/pytorch:2.3.1-cuda11.8-cudnn8-runtime
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=True \
+    PORT=9090
+
+# Install dependencies
+RUN apt-get update \
+    && apt-get install -y git default-jre default-jdk
+
+WORKDIR /src
+RUN git clone https://github.com/vncorenlp/VnCoreNLP.git
+RUN git clone https://huggingface.co/vinai/phobert-base/
+COPY ./phobert-base/pytorch_model.bin /src/phobert-base/pytorch_model.bin
+COPY ./server_train/requirements.txt /src/requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+COPY ./server_train/*.py /src/
+
+CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
\ No newline at end of file