Init commit
This commit is contained in:
commit
9b2a5497d9
|
|
@ -0,0 +1,79 @@
|
|||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# Virtual environments
|
||||
venv/
|
||||
ENV/
|
||||
env/
|
||||
.venv/
|
||||
.env/
|
||||
|
||||
# PyInstaller
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# pyenv
|
||||
.python-version
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# VS Code
|
||||
.vscode/
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
|
||||
# dotenv
|
||||
.env
|
||||
.env.*
|
||||
|
||||
# Local settings
|
||||
local_settings.py
|
||||
|
||||
# System files
|
||||
Thumbs.db
|
||||
ehthumbs.db
|
||||
Desktop.ini
|
||||
|
|
@ -0,0 +1,314 @@
|
|||
# 🧠 AI Lab – Transformers CLI Playground
|
||||
|
||||
> A **pedagogical and technical project** designed for AI practitioners and students to experiment with Hugging Face Transformers through an **interactive Command‑Line Interface (CLI)**.
|
||||
> This playground provides ready‑to‑use NLP pipelines (Sentiment Analysis, Named Entity Recognition, Text Generation, Fill‑Mask, Moderation, etc.) in a modular, extensible, and educational codebase.
|
||||
|
||||
---
|
||||
|
||||
## 📚 Overview
|
||||
|
||||
The **AI Lab – Transformers CLI Playground** allows you to explore multiple natural language processing tasks directly from the terminal.
|
||||
Each task (e.g., sentiment, NER, text generation) is implemented as a **Command Module**, which interacts with a **Pipeline Module** built on top of the `transformers` library.
|
||||
|
||||
The lab is intentionally structured to demonstrate **clean software design for ML codebases** — with strict separation between configuration, pipelines, CLI logic, and display formatting.
|
||||
|
||||
---
|
||||
|
||||
## 🗂️ Project Structure
|
||||
|
||||
```text
|
||||
src/
|
||||
├── __init__.py
|
||||
├── main.py # CLI entry point
|
||||
│
|
||||
├── cli/
|
||||
│ ├── __init__.py
|
||||
│ ├── base.py # CLICommand base class & interactive shell handler
|
||||
│ └── display.py # Console formatting utilities (tables, colors, results)
|
||||
│
|
||||
├── commands/ # User-facing commands wrapping pipeline logic
|
||||
│ ├── __init__.py
|
||||
│ ├── sentiment.py # Sentiment analysis command
|
||||
│ ├── fillmask.py # Masked token prediction command
|
||||
│ ├── textgen.py # Text generation command
|
||||
│ ├── ner.py # Named Entity Recognition command
|
||||
│ └── moderation.py # Toxicity / content moderation command
|
||||
│
|
||||
├── pipelines/ # Machine learning logic (Hugging Face Transformers)
|
||||
│ ├── __init__.py
|
||||
│ ├── template.py # Blueprint for creating new pipelines
|
||||
│ ├── sentiment.py
|
||||
│ ├── fillmask.py
|
||||
│ ├── textgen.py
|
||||
│ ├── ner.py
|
||||
│ └── moderation.py
|
||||
│
|
||||
└── config/
|
||||
├── __init__.py
|
||||
└── settings.py # Global configuration (default models, parameters)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Installation
|
||||
|
||||
### 🧾 Option 1 – Using Poetry (Recommended)
|
||||
|
||||
> Poetry is used as the main dependency manager.
|
||||
|
||||
```bash
|
||||
# 1. Create and activate a new virtual environment
|
||||
poetry shell
|
||||
|
||||
# 2. Install dependencies
|
||||
poetry install
|
||||
```
|
||||
|
||||
This will automatically install all dependencies declared in `pyproject.toml`, including **transformers** and **torch**.
|
||||
|
||||
To run the CLI inside the Poetry environment:
|
||||
```bash
|
||||
poetry run python src/main.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 📦 Option 2 – Using pip and requirements.txt
|
||||
|
||||
If you prefer using `requirements.txt` manually:
|
||||
|
||||
```bash
|
||||
# 1. Create a virtual environment
|
||||
python -m venv .venv
|
||||
|
||||
# 2. Activate it
|
||||
# Linux/macOS
|
||||
source .venv/bin/activate
|
||||
# Windows PowerShell
|
||||
.venv\Scripts\Activate.ps1
|
||||
|
||||
# 3. Install dependencies
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ▶️ Usage
|
||||
|
||||
Once installed, launch the CLI with:
|
||||
|
||||
```bash
|
||||
python -m src.main
|
||||
# or, if using Poetry
|
||||
poetry run python src/main.py
|
||||
```
|
||||
|
||||
You’ll see an interactive menu listing the available commands:
|
||||
|
||||
```
|
||||
Welcome to AI Lab - Transformers CLI Playground
|
||||
Available commands:
|
||||
• sentiment – Analyze the sentiment of a text
|
||||
• fillmask – Predict masked words in a sentence
|
||||
• textgen – Generate text from a prompt
|
||||
• ner – Extract named entities from text
|
||||
• moderation – Detect toxic or unsafe content
|
||||
```
|
||||
|
||||
### Example Sessions
|
||||
|
||||
#### 🔹 Sentiment Analysis
|
||||
```text
|
||||
💬 Enter text: I absolutely love this project!
|
||||
→ Sentiment: POSITIVE (score: 0.998)
|
||||
```
|
||||
|
||||
#### 🔹 Fill‑Mask
|
||||
```text
|
||||
💬 Enter text: The capital of France is [MASK].
|
||||
→ Predictions:
|
||||
1) Paris score: 0.87
|
||||
2) Lyon score: 0.04
|
||||
3) London score: 0.02
|
||||
```
|
||||
|
||||
#### 🔹 Text Generation
|
||||
```text
|
||||
💬 Prompt: Once upon a time
|
||||
→ Output: Once upon a time there was a young AI learning to code...
|
||||
```
|
||||
|
||||
#### 🔹 NER (Named Entity Recognition)
|
||||
```text
|
||||
💬 Enter text: Elon Musk founded SpaceX in California.
|
||||
→ Entities:
|
||||
- Elon Musk (PERSON)
|
||||
- SpaceX (ORG)
|
||||
- California (LOC)
|
||||
```
|
||||
|
||||
#### 🔹 Moderation
|
||||
```text
|
||||
💬 Enter text: I hate everything!
|
||||
→ Result: FLAGGED (toxic content detected)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧠 Architecture Overview
|
||||
|
||||
The internal structure follows a clean **Command ↔ Pipeline ↔ Display** pattern:
|
||||
|
||||
```text
|
||||
┌──────────────────────┐
|
||||
│ InteractiveCLI │
|
||||
│ (src/cli/base.py) │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Command Layer │ ← e.g. sentiment.py
|
||||
│ (user commands) │
|
||||
└───────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Pipeline Layer │ ← e.g. pipelines/sentiment.py
|
||||
│ (ML logic) │
|
||||
└───────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Display Layer │ ← cli/display.py
|
||||
│ (format output) │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
| Layer | Description |
|
||||
|-------|--------------|
|
||||
| **CLI** | Manages user input/output, help menus, and navigation between commands. |
|
||||
| **Command** | Encapsulates a single user-facing operation (e.g., run sentiment). |
|
||||
| **Pipeline** | Wraps Hugging Face’s `transformers.pipeline()` to perform inference. |
|
||||
| **Display** | Handles clean console rendering (colored output, tables, JSON formatting). |
|
||||
| **Config** | Centralizes model names, limits, and global constants. |
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Configuration
|
||||
|
||||
All configuration is centralized in `src/config/settings.py`.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
class Config:
|
||||
DEFAULT_MODELS = {
|
||||
"sentiment": "distilbert-base-uncased-finetuned-sst-2-english",
|
||||
"fillmask": "bert-base-uncased",
|
||||
"textgen": "gpt2",
|
||||
"ner": "dslim/bert-base-NER",
|
||||
"moderation":"unitary/toxic-bert"
|
||||
}
|
||||
MAX_LENGTH = 512
|
||||
BATCH_SIZE = 8
|
||||
```
|
||||
|
||||
You can easily modify model names to experiment with different checkpoints.
|
||||
|
||||
---
|
||||
|
||||
## 🧩 Extending the Playground
|
||||
|
||||
To create a new experiment (e.g., keyword extraction):
|
||||
|
||||
1. **Duplicate** `src/pipelines/template.py` → `src/pipelines/keywords.py`
|
||||
Implement the `run()` or `analyze()` logic using a new Hugging Face pipeline.
|
||||
|
||||
2. **Create a Command** in `src/commands/keywords.py` to interact with users.
|
||||
|
||||
3. **Register the command** inside `src/main.py`:
|
||||
|
||||
```python
|
||||
from src.commands.keywords import KeywordsCommand
|
||||
cli.register_command(KeywordsCommand())
|
||||
```
|
||||
|
||||
4. Optionally, add a model name in `Config.DEFAULT_MODELS`.
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
You can use `pytest` for lightweight validation:
|
||||
|
||||
```bash
|
||||
pip install pytest
|
||||
pytest -q
|
||||
```
|
||||
|
||||
Recommended structure:
|
||||
|
||||
```
|
||||
tests/
|
||||
├── test_sentiment.py
|
||||
├── test_textgen.py
|
||||
└── ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧰 Troubleshooting
|
||||
|
||||
| Issue | Cause / Solution |
|
||||
|-------|------------------|
|
||||
| **`transformers` not found** | Check virtual environment activation. |
|
||||
| **Torch fails to install** | Install CPU-only version from PyTorch index. |
|
||||
| **Models download slowly** | Hugging Face caches them after first run. |
|
||||
| **Unicode / accents broken** | Ensure terminal encoding is UTF‑8. |
|
||||
|
||||
---
|
||||
|
||||
## 🧭 Development Guidelines
|
||||
|
||||
- Keep **Command** classes lightweight — no ML logic inside them.
|
||||
- Reuse the **Pipeline Template** for new experiments.
|
||||
- Format outputs consistently via the `DisplayFormatter`.
|
||||
- Document all new models or commands in `README.md` and `settings.py`.
|
||||
|
||||
---
|
||||
|
||||
## 🧱 Roadmap
|
||||
|
||||
- [ ] Add non-interactive CLI flags (`--text`, `--task`)
|
||||
- [ ] Add multilingual model options
|
||||
- [ ] Add automatic test coverage
|
||||
- [ ] Add logging and profiling utilities
|
||||
- [ ] Add export to JSON/CSV results
|
||||
|
||||
---
|
||||
|
||||
## 🪪 License
|
||||
|
||||
You can include a standard open-source license such as **MIT** or **Apache 2.0** depending on your use case.
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
This repository is meant as an **educational sandbox** for experimenting with Transformers.
|
||||
Pull requests are welcome for new models, better CLI UX, or educational improvements.
|
||||
|
||||
---
|
||||
|
||||
### ✨ Key Takeaways
|
||||
|
||||
- Modular and pedagogical design for training environments
|
||||
- Clean separation between **I/O**, **ML logic**, and **UX**
|
||||
- Easily extensible architecture for adding custom pipelines
|
||||
- Perfect sandbox for students, researchers, and developers to learn modern NLP tools
|
||||
|
||||
---
|
||||
|
||||
> 🧩 Built for experimentation. Learn, break, and rebuild.
|
||||
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,27 @@
|
|||
[project]
|
||||
name = "ai-lab"
|
||||
version = "0.1.0"
|
||||
description = "Lab for testing different uses of transformers"
|
||||
authors = [{ name = "Cyril", email = "decostanzicyril@gmail.com" }]
|
||||
|
||||
[tool.poetry]
|
||||
name = "ai-lab"
|
||||
version = "0.1.0"
|
||||
description = "Lab for testing different uses of transformers"
|
||||
authors = ["Cyril"]
|
||||
packages = [{ include = "src" }]
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = ">=3.12,<3.14"
|
||||
torch = "^2.0.0"
|
||||
transformers = "^4.30.0"
|
||||
tokenizers = "^0.13.0"
|
||||
numpy = "^1.24.0"
|
||||
accelerate = "^0.20.0"
|
||||
|
||||
[tool.poetry.scripts]
|
||||
ai-lab = "src.main:main"
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core"]
|
||||
build-backend = "poetry.core.masonry.api"
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
torch>=2.0.0
|
||||
transformers>=4.30.0
|
||||
tokenizers>=0.13.0
|
||||
numpy>=1.24.0
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
"""
|
||||
AI Lab - Transformers Experimentation
|
||||
"""
|
||||
__version__ = "0.1.0"
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
"""
|
||||
CLI utilities for AI Lab
|
||||
"""
|
||||
from .base import CLICommand, InteractiveCLI
|
||||
from .display import DisplayFormatter
|
||||
|
||||
__all__ = ['CLICommand', 'InteractiveCLI', 'DisplayFormatter']
|
||||
|
|
@ -0,0 +1,87 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any
|
||||
from src.config import Config
|
||||
|
||||
|
||||
class CLICommand(ABC):
|
||||
"""Base class for CLI commands"""
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def name(self) -> str:
|
||||
"""Command name"""
|
||||
pass
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def description(self) -> str:
|
||||
"""Command description"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def run(self) -> None:
|
||||
"""Execute the command"""
|
||||
pass
|
||||
|
||||
|
||||
class InteractiveCLI:
|
||||
"""Interactive CLI handler"""
|
||||
|
||||
def __init__(self):
|
||||
self.commands: Dict[str, CLICommand] = {}
|
||||
|
||||
def register_command(self, command: CLICommand):
|
||||
"""Register a new command"""
|
||||
self.commands[command.name] = command
|
||||
|
||||
def show_menu(self):
|
||||
"""Display available commands"""
|
||||
print(Config.CLI_BANNER)
|
||||
print(Config.CLI_SEPARATOR)
|
||||
print("Available commands:")
|
||||
for name, cmd in self.commands.items():
|
||||
print(f" 📌 {name}: {cmd.description}")
|
||||
print(" 📌 quit: Exit application")
|
||||
print(" 📌 help: Show this help")
|
||||
print("-" * 50)
|
||||
|
||||
def show_help(self):
|
||||
"""Show detailed help"""
|
||||
print("\n📚 Detailed Help")
|
||||
print("-" * 30)
|
||||
print("Navigation:")
|
||||
print(" - Type a command name to execute it")
|
||||
print(" - Type 'back' in a command to return to menu")
|
||||
print(" - Type 'quit' or Ctrl+C to exit")
|
||||
print("\nAvailable commands:")
|
||||
for name, cmd in self.commands.items():
|
||||
print(f" {name}: {cmd.description}")
|
||||
|
||||
def run(self):
|
||||
"""Run the interactive CLI"""
|
||||
self.show_menu()
|
||||
|
||||
while True:
|
||||
try:
|
||||
choice = input("\n💬 Choose a command: ").strip().lower()
|
||||
|
||||
if choice in ['quit', 'exit', 'q']:
|
||||
print("👋 Goodbye!")
|
||||
break
|
||||
|
||||
if choice in ['help', 'h', '?']:
|
||||
self.show_help()
|
||||
continue
|
||||
|
||||
if choice in self.commands:
|
||||
print() # Empty line for readability
|
||||
self.commands[choice].run()
|
||||
print() # Empty line after command
|
||||
else:
|
||||
print("❌ Unknown command. Type 'help' to see available commands.")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n👋 Stopping program")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {e}")
|
||||
|
|
@ -0,0 +1,192 @@
|
|||
from typing import Dict, Any
|
||||
|
||||
|
||||
class DisplayFormatter:
|
||||
"""Utility class for formatting display output"""
|
||||
|
||||
@staticmethod
|
||||
def format_sentiment_result(result: Dict[str, Any]) -> str:
|
||||
"""Format sentiment analysis result for display"""
|
||||
if "error" in result:
|
||||
return f"❌ {result['error']}"
|
||||
|
||||
sentiment = result["sentiment"]
|
||||
confidence = result["confidence"]
|
||||
emoji = "😊" if sentiment == "POSITIVE" else "😞"
|
||||
|
||||
return f"{emoji} Sentiment: {sentiment}\n📊 Confidence: {confidence:.2%}"
|
||||
|
||||
@staticmethod
|
||||
def show_loading(message: str = "Analysis in progress..."):
|
||||
"""Show loading message"""
|
||||
print(f"\n🔍 {message}")
|
||||
|
||||
@staticmethod
|
||||
def show_warning(message: str):
|
||||
"""Show warning message"""
|
||||
print(f"⚠️ {message}")
|
||||
|
||||
@staticmethod
|
||||
def show_error(message: str):
|
||||
"""Show error message"""
|
||||
print(f"❌ {message}")
|
||||
|
||||
@staticmethod
|
||||
def show_success(message: str):
|
||||
"""Show success message"""
|
||||
print(f"✅ {message}")
|
||||
|
||||
@staticmethod
|
||||
def format_fillmask_result(result: Dict[str, Any]) -> str:
|
||||
"""Format fill-mask prediction result for display"""
|
||||
if "error" in result:
|
||||
return f"❌ {result['error']}"
|
||||
|
||||
output = []
|
||||
output.append(f"📝 Original: {result['original_text']}")
|
||||
output.append(f"🎭 Masks found: {result['masks_count']}")
|
||||
output.append("")
|
||||
|
||||
if result['masks_count'] == 1:
|
||||
# Single mask
|
||||
output.append("🔮 Predictions:")
|
||||
for i, pred in enumerate(result['predictions'], 1):
|
||||
confidence_bar = "█" * int(pred['score'] * 10)
|
||||
output.append(f" {i}. '{pred['token']}' ({pred['score']:.1%}) {confidence_bar}")
|
||||
output.append(f" → {pred['sequence']}")
|
||||
else:
|
||||
# Multiple masks
|
||||
for mask_info in result['predictions']:
|
||||
output.append(f"🔮 Mask #{mask_info['mask_position']} predictions:")
|
||||
for i, pred in enumerate(mask_info['predictions'], 1):
|
||||
confidence_bar = "█" * int(pred['score'] * 10)
|
||||
output.append(f" {i}. '{pred['token']}' ({pred['score']:.1%}) {confidence_bar}")
|
||||
output.append("")
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
@staticmethod
|
||||
def format_textgen_result(result: Dict[str, Any]) -> str:
|
||||
"""Format text generation result for display"""
|
||||
if "error" in result:
|
||||
return f"❌ {result['error']}"
|
||||
|
||||
output = []
|
||||
output.append(f"📝 Prompt: {result['prompt']}")
|
||||
output.append(f"⚙️ Parameters: max_length={result['parameters']['max_length']}, "
|
||||
f"temperature={result['parameters']['temperature']}")
|
||||
output.append("-" * 50)
|
||||
|
||||
for i, gen in enumerate(result['generations'], 1):
|
||||
if len(result['generations']) > 1:
|
||||
output.append(f"🎯 Generation {i}:")
|
||||
|
||||
output.append(f"📄 Full text: {gen['text']}")
|
||||
if gen['continuation']:
|
||||
output.append(f"✨ Continuation: {gen['continuation']}")
|
||||
|
||||
if i < len(result['generations']):
|
||||
output.append("-" * 30)
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
@staticmethod
|
||||
def format_moderation_result(result: Dict[str, Any]) -> str:
|
||||
"""Format content moderation result for display"""
|
||||
if "error" in result:
|
||||
return f"❌ {result['error']}"
|
||||
|
||||
output = []
|
||||
output.append(f"📝 Original: {result['original_text']}")
|
||||
|
||||
if result['is_modified']:
|
||||
output.append(f"🛡️ Moderated: {result['moderated_text']}")
|
||||
output.append(f"⚠️ Status: Content modified ({result['words_replaced']} words replaced)")
|
||||
status_emoji = "🔴"
|
||||
else:
|
||||
output.append("✅ Status: Content approved (no modifications needed)")
|
||||
status_emoji = "🟢"
|
||||
|
||||
# Toxicity score bar
|
||||
score = result['toxic_score']
|
||||
score_bar = "█" * int(score * 10)
|
||||
output.append(f"{status_emoji} Toxicity Score: {score:.1%} {score_bar}")
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
@staticmethod
|
||||
def format_ner_result(result: Dict[str, Any]) -> str:
|
||||
"""Format NER result for display"""
|
||||
if "error" in result:
|
||||
return f"❌ {result['error']}"
|
||||
|
||||
output = []
|
||||
output.append(f"📝 Original: {result['original_text']}")
|
||||
output.append(f"✨ Highlighted: {result['highlighted_text']}")
|
||||
output.append(f"🎯 Found {result['total_entities']} entities (threshold: {result['confidence_threshold']:.2f})")
|
||||
|
||||
if result['entities']:
|
||||
output.append("\n📋 Detected Entities:")
|
||||
for entity in result['entities']:
|
||||
confidence_bar = "█" * int(entity['confidence'] * 10)
|
||||
output.append(f" {entity['emoji']} {entity['text']} → {entity['label']} "
|
||||
f"({entity['confidence']:.1%}) {confidence_bar}")
|
||||
|
||||
if result['entity_stats']:
|
||||
output.append("\n📊 Entity Statistics:")
|
||||
for entity_type, stats in result['entity_stats'].items():
|
||||
unique_entities = list(set(stats['entities']))
|
||||
emoji = result['entities'][0]['emoji'] if result['entities'] else "🏷️"
|
||||
for ent in result['entities']:
|
||||
if ent['label'] == entity_type:
|
||||
emoji = ent['emoji']
|
||||
break
|
||||
|
||||
output.append(f" {emoji} {entity_type}: {stats['count']} occurrences")
|
||||
if len(unique_entities) <= 3:
|
||||
output.append(f" → {', '.join(unique_entities)}")
|
||||
else:
|
||||
output.append(f" → {', '.join(unique_entities[:3])}... (+{len(unique_entities)-3} more)")
|
||||
|
||||
return "\n".join(output)
|
||||
|
||||
@staticmethod
|
||||
def format_ner_analysis(result: Dict[str, Any]) -> str:
|
||||
"""Format comprehensive NER document analysis"""
|
||||
if "error" in result:
|
||||
return f"❌ {result['error']}"
|
||||
|
||||
output = []
|
||||
output.append("📊 Document Analysis Results")
|
||||
output.append("=" * 50)
|
||||
|
||||
# Document statistics
|
||||
stats = result['document_stats']
|
||||
output.append(f"📄 Document: {stats['word_count']} words, {stats['char_count']} characters")
|
||||
output.append(f"📝 Structure: ~{stats['sentence_count']} sentences")
|
||||
output.append(f"🎯 Entity Density: {stats['entity_density']:.2%} (entities per word)")
|
||||
|
||||
# Most common entity type
|
||||
if 'most_common_entity_type' in result:
|
||||
common = result['most_common_entity_type']
|
||||
output.append(f"🏆 Most Common: {common['emoji']} {common['type']} ({common['count']} occurrences)")
|
||||
|
||||
output.append(f"\n✨ Highlighted Text:")
|
||||
output.append(result['highlighted_text'])
|
||||
|
||||
if result['entity_stats']:
|
||||
output.append(f"\n📈 Detailed Statistics:")
|
||||
for entity_type, stats in result['entity_stats'].items():
|
||||
unique_entities = list(set(stats['entities']))
|
||||
emoji = "🏷️"
|
||||
for ent in result['entities']:
|
||||
if ent['label'] == entity_type:
|
||||
emoji = ent['emoji']
|
||||
break
|
||||
|
||||
output.append(f"\n{emoji} {entity_type} ({stats['count']} total):")
|
||||
for entity in unique_entities:
|
||||
count = stats['entities'].count(entity)
|
||||
output.append(f" • {entity} ({count}x)")
|
||||
|
||||
return "\n".join(output)
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
"""
|
||||
AI Lab commands
|
||||
"""
|
||||
from .sentiment import SentimentCommand
|
||||
from .fillmask import FillMaskCommand
|
||||
from .textgen import TextGenCommand
|
||||
from .moderation import ModerationCommand
|
||||
from .ner import NERCommand
|
||||
|
||||
__all__ = ['SentimentCommand', 'FillMaskCommand', 'TextGenCommand', 'ModerationCommand', 'NERCommand']
|
||||
|
|
@ -0,0 +1,84 @@
|
|||
from src.cli.base import CLICommand
|
||||
from src.cli.display import DisplayFormatter
|
||||
from src.pipelines.fillmask import FillMaskAnalyzer
|
||||
|
||||
|
||||
class FillMaskCommand(CLICommand):
|
||||
"""Interactive fill-mask prediction command"""
|
||||
|
||||
def __init__(self):
|
||||
self.analyzer = None
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "fillmask"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Interactive fill-mask token prediction"
|
||||
|
||||
def _initialize_analyzer(self):
|
||||
"""Lazy initialization of the analyzer"""
|
||||
if self.analyzer is None:
|
||||
print("🔄 Loading fill-mask model...")
|
||||
self.analyzer = FillMaskAnalyzer()
|
||||
DisplayFormatter.show_success("Model loaded!")
|
||||
|
||||
def _show_instructions(self):
|
||||
"""Show usage instructions"""
|
||||
print("\n📝 Fill-Mask Prediction")
|
||||
print("Replace words with [MASK] token and get predictions")
|
||||
print("\nExamples:")
|
||||
print(" - The weather today is [MASK]")
|
||||
print(" - I love to [MASK] music")
|
||||
print(" - Paris is the capital of [MASK]")
|
||||
print("\nType 'back' to return to main menu")
|
||||
print("Type 'help' to see these instructions again")
|
||||
print("-" * 50)
|
||||
|
||||
def _get_top_k(self) -> int:
|
||||
"""Get number of predictions from user"""
|
||||
while True:
|
||||
try:
|
||||
top_k_input = input("📊 Number of predictions (1-10, default=5): ").strip()
|
||||
if not top_k_input:
|
||||
return 5
|
||||
|
||||
top_k = int(top_k_input)
|
||||
if 1 <= top_k <= 10:
|
||||
return top_k
|
||||
else:
|
||||
DisplayFormatter.show_warning("Please enter a number between 1 and 10")
|
||||
except ValueError:
|
||||
DisplayFormatter.show_warning("Please enter a valid number")
|
||||
|
||||
def run(self):
|
||||
"""Run interactive fill-mask prediction"""
|
||||
self._initialize_analyzer()
|
||||
self._show_instructions()
|
||||
|
||||
while True:
|
||||
text = input("\n💬 Enter text with [MASK]: ").strip()
|
||||
|
||||
if text.lower() in ['back', 'return']:
|
||||
break
|
||||
|
||||
if text.lower() == 'help':
|
||||
self._show_instructions()
|
||||
continue
|
||||
|
||||
if not text:
|
||||
DisplayFormatter.show_warning("Please enter some text")
|
||||
continue
|
||||
|
||||
if "[MASK]" not in text:
|
||||
DisplayFormatter.show_warning("Text must contain [MASK] token")
|
||||
continue
|
||||
|
||||
# Get number of predictions
|
||||
top_k = self._get_top_k()
|
||||
|
||||
DisplayFormatter.show_loading("Predicting tokens...")
|
||||
result = self.analyzer.predict(text, top_k=top_k)
|
||||
formatted_result = DisplayFormatter.format_fillmask_result(result)
|
||||
print(formatted_result)
|
||||
|
|
@ -0,0 +1,73 @@
|
|||
from src.cli.base import CLICommand
|
||||
from src.cli.display import DisplayFormatter
|
||||
from src.pipelines.moderation import ContentModerator
|
||||
|
||||
|
||||
class ModerationCommand(CLICommand):
|
||||
"""Interactive content moderation command"""
|
||||
|
||||
def __init__(self):
|
||||
self.moderator = None
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "moderation"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Content moderation and filtering"
|
||||
|
||||
def _initialize_moderator(self):
|
||||
"""Lazy initialization of the moderator"""
|
||||
if self.moderator is None:
|
||||
print("🔄 Loading content moderation model...")
|
||||
self.moderator = ContentModerator()
|
||||
DisplayFormatter.show_success("Moderation model loaded!")
|
||||
|
||||
def run(self):
|
||||
"""Run interactive content moderation"""
|
||||
self._initialize_moderator()
|
||||
|
||||
print("\n🛡️ Content Moderation")
|
||||
print("Type 'back' to return to main menu")
|
||||
print("Type 'settings' to adjust moderation sensitivity")
|
||||
print("-" * 40)
|
||||
|
||||
while True:
|
||||
text = input("\n📝 Enter text to moderate: ").strip()
|
||||
|
||||
if text.lower() in ['back', 'return']:
|
||||
break
|
||||
|
||||
if text.lower() == 'settings':
|
||||
self._show_settings()
|
||||
continue
|
||||
|
||||
if not text:
|
||||
DisplayFormatter.show_warning("Please enter some text")
|
||||
continue
|
||||
|
||||
DisplayFormatter.show_loading("Analyzing content...")
|
||||
result = self.moderator.moderate(text)
|
||||
formatted_result = DisplayFormatter.format_moderation_result(result)
|
||||
print(formatted_result)
|
||||
|
||||
def _show_settings(self):
|
||||
"""Show and allow modification of moderation settings"""
|
||||
print(f"\n⚙️ Current Settings:")
|
||||
print(f"Toxicity threshold: {self.moderator.toxicity_threshold:.2f}")
|
||||
print("\nOptions:")
|
||||
print("1. Change threshold (0.0 = very strict, 1.0 = very permissive)")
|
||||
print("2. Back to moderation")
|
||||
|
||||
choice = input("\nChoose option (1-2): ").strip()
|
||||
|
||||
if choice == "1":
|
||||
try:
|
||||
new_threshold = float(input("Enter new threshold (0.0-1.0): "))
|
||||
self.moderator.set_threshold(new_threshold)
|
||||
DisplayFormatter.show_success(f"Threshold set to {new_threshold:.2f}")
|
||||
except ValueError:
|
||||
DisplayFormatter.show_error("Invalid threshold value")
|
||||
elif choice != "2":
|
||||
DisplayFormatter.show_warning("Invalid option")
|
||||
|
|
@ -0,0 +1,137 @@
|
|||
from src.cli.base import CLICommand
|
||||
from src.cli.display import DisplayFormatter
|
||||
from src.pipelines.ner import NamedEntityRecognizer
|
||||
|
||||
|
||||
class NERCommand(CLICommand):
|
||||
"""Interactive Named Entity Recognition command"""
|
||||
|
||||
def __init__(self):
|
||||
self.recognizer = None
|
||||
self.confidence_threshold = 0.9
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "ner"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Named Entity Recognition - Extract people, places, organizations"
|
||||
|
||||
def _initialize_recognizer(self):
|
||||
"""Lazy initialization of the recognizer"""
|
||||
if self.recognizer is None:
|
||||
print("🔄 Loading NER model...")
|
||||
self.recognizer = NamedEntityRecognizer()
|
||||
DisplayFormatter.show_success("NER model loaded!")
|
||||
|
||||
def _show_instructions(self):
|
||||
"""Show usage instructions and examples"""
|
||||
print("\n🎯 Named Entity Recognition")
|
||||
print("Extract and classify entities like people, organizations, locations, etc.")
|
||||
print("\n📝 Examples to try:")
|
||||
print(" - Apple Inc. was founded by Steve Jobs in Cupertino, California.")
|
||||
print(" - Barack Obama visited Paris in 2015 to meet Emmanuel Macron.")
|
||||
print(" - Microsoft acquired GitHub for $7.5 billion in June 2018.")
|
||||
print("\n🎛️ Commands:")
|
||||
print(" 'back' - Return to main menu")
|
||||
print(" 'help' - Show these instructions")
|
||||
print(" 'settings' - Adjust confidence threshold")
|
||||
print(" 'types' - Show entity types")
|
||||
print(" 'analyze' - Detailed document analysis mode")
|
||||
print("-" * 60)
|
||||
|
||||
def _show_entity_types(self):
|
||||
"""Show available entity types"""
|
||||
entity_types = self.recognizer.get_entity_types()
|
||||
print("\n🏷️ Entity Types:")
|
||||
type_descriptions = {
|
||||
"PER": "Person names",
|
||||
"ORG": "Organizations, companies",
|
||||
"LOC": "Locations, places",
|
||||
"MISC": "Miscellaneous entities",
|
||||
"DATE": "Dates and time periods",
|
||||
"TIME": "Specific times",
|
||||
"MONEY": "Monetary amounts",
|
||||
"PERCENT": "Percentages"
|
||||
}
|
||||
|
||||
for entity_type, emoji in entity_types.items():
|
||||
description = type_descriptions.get(entity_type, "Other entities")
|
||||
print(f" {emoji} {entity_type}: {description}")
|
||||
|
||||
def _adjust_settings(self):
|
||||
"""Allow user to adjust confidence threshold"""
|
||||
print(f"\n⚙️ Current confidence threshold: {self.confidence_threshold:.2f}")
|
||||
print("Lower values = more entities detected (but less accurate)")
|
||||
print("Higher values = fewer entities detected (but more accurate)")
|
||||
|
||||
try:
|
||||
new_threshold = input(f"Enter new threshold (0.1-1.0, current: {self.confidence_threshold}): ").strip()
|
||||
if new_threshold:
|
||||
threshold = float(new_threshold)
|
||||
if 0.1 <= threshold <= 1.0:
|
||||
self.confidence_threshold = threshold
|
||||
DisplayFormatter.show_success(f"Threshold set to {threshold:.2f}")
|
||||
else:
|
||||
DisplayFormatter.show_warning("Threshold must be between 0.1 and 1.0")
|
||||
except ValueError:
|
||||
DisplayFormatter.show_error("Invalid threshold value")
|
||||
|
||||
def _analyze_mode(self):
|
||||
"""Document analysis mode with detailed statistics"""
|
||||
print("\n📊 Document Analysis Mode")
|
||||
print("Enter longer text for comprehensive entity analysis")
|
||||
print("Type 'done' when finished")
|
||||
print("-" * 40)
|
||||
|
||||
lines = []
|
||||
while True:
|
||||
line = input("📝 ").strip()
|
||||
if line.lower() == 'done':
|
||||
break
|
||||
if line:
|
||||
lines.append(line)
|
||||
|
||||
if not lines:
|
||||
DisplayFormatter.show_warning("No text entered")
|
||||
return
|
||||
|
||||
document = " ".join(lines)
|
||||
DisplayFormatter.show_loading("Analyzing document...")
|
||||
|
||||
result = self.recognizer.analyze_document(document, self.confidence_threshold)
|
||||
formatted_result = DisplayFormatter.format_ner_analysis(result)
|
||||
print(formatted_result)
|
||||
|
||||
def run(self):
|
||||
"""Run interactive NER"""
|
||||
self._initialize_recognizer()
|
||||
self._show_instructions()
|
||||
|
||||
while True:
|
||||
text = input("\n💬 Enter text to analyze: ").strip()
|
||||
|
||||
if text.lower() == 'back':
|
||||
break
|
||||
elif text.lower() == 'help':
|
||||
self._show_instructions()
|
||||
continue
|
||||
elif text.lower() == 'settings':
|
||||
self._adjust_settings()
|
||||
continue
|
||||
elif text.lower() == 'types':
|
||||
self._show_entity_types()
|
||||
continue
|
||||
elif text.lower() == 'analyze':
|
||||
self._analyze_mode()
|
||||
continue
|
||||
|
||||
if not text:
|
||||
DisplayFormatter.show_warning("Please enter some text")
|
||||
continue
|
||||
|
||||
DisplayFormatter.show_loading("Extracting entities...")
|
||||
result = self.recognizer.recognize(text, self.confidence_threshold)
|
||||
formatted_result = DisplayFormatter.format_ner_result(result)
|
||||
print(formatted_result)
|
||||
|
|
@ -0,0 +1,48 @@
|
|||
from src.cli.base import CLICommand
|
||||
from src.cli.display import DisplayFormatter
|
||||
from src.pipelines.sentiment import SentimentAnalyzer
|
||||
|
||||
|
||||
class SentimentCommand(CLICommand):
|
||||
"""Interactive sentiment analysis command"""
|
||||
|
||||
def __init__(self):
|
||||
self.analyzer = None
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "sentiment"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Interactive sentiment analysis"
|
||||
|
||||
def _initialize_analyzer(self):
|
||||
"""Lazy initialization of the analyzer"""
|
||||
if self.analyzer is None:
|
||||
print("🔄 Loading sentiment model...")
|
||||
self.analyzer = SentimentAnalyzer()
|
||||
DisplayFormatter.show_success("Model loaded!")
|
||||
|
||||
def run(self):
|
||||
"""Run interactive sentiment analysis"""
|
||||
self._initialize_analyzer()
|
||||
|
||||
print("\n📝 Sentiment Analysis")
|
||||
print("Type 'back' to return to main menu")
|
||||
print("-" * 30)
|
||||
|
||||
while True:
|
||||
text = input("\n💬 Enter your text: ").strip()
|
||||
|
||||
if text.lower() in ['back', 'return']:
|
||||
break
|
||||
|
||||
if not text:
|
||||
DisplayFormatter.show_warning("Please enter some text")
|
||||
continue
|
||||
|
||||
DisplayFormatter.show_loading()
|
||||
result = self.analyzer.analyze(text)
|
||||
formatted_result = DisplayFormatter.format_sentiment_result(result)
|
||||
print(formatted_result)
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
from src.cli.base import CLICommand
|
||||
from src.cli.display import DisplayFormatter
|
||||
from src.pipelines.textgen import TextGenerator
|
||||
|
||||
|
||||
class TextGenCommand(CLICommand):
|
||||
"""Interactive text generation command"""
|
||||
|
||||
def __init__(self):
|
||||
self.generator = None
|
||||
self.default_params = {
|
||||
'max_length': 100,
|
||||
'num_return_sequences': 1,
|
||||
'temperature': 1.0,
|
||||
'do_sample': True
|
||||
}
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
return "textgen"
|
||||
|
||||
@property
|
||||
def description(self) -> str:
|
||||
return "Interactive text generation"
|
||||
|
||||
def _initialize_generator(self):
|
||||
"""Lazy initialization of the generator"""
|
||||
if self.generator is None:
|
||||
print("🔄 Loading text generation model...")
|
||||
self.generator = TextGenerator()
|
||||
DisplayFormatter.show_success("Model loaded!")
|
||||
|
||||
def _show_parameters(self):
|
||||
"""Show current generation parameters"""
|
||||
print("\n⚙️ Current parameters:")
|
||||
for key, value in self.default_params.items():
|
||||
print(f" {key}: {value}")
|
||||
|
||||
def _update_parameters(self):
|
||||
"""Allow user to update generation parameters"""
|
||||
print("\n🔧 Update parameters (press Enter to keep current value):")
|
||||
|
||||
try:
|
||||
max_length = input(f"Max length ({self.default_params['max_length']}): ").strip()
|
||||
if max_length:
|
||||
self.default_params['max_length'] = int(max_length)
|
||||
|
||||
num_sequences = input(f"Number of sequences ({self.default_params['num_return_sequences']}): ").strip()
|
||||
if num_sequences:
|
||||
self.default_params['num_return_sequences'] = int(num_sequences)
|
||||
|
||||
temperature = input(f"Temperature ({self.default_params['temperature']}): ").strip()
|
||||
if temperature:
|
||||
self.default_params['temperature'] = float(temperature)
|
||||
|
||||
do_sample = input(f"Use sampling ({self.default_params['do_sample']}): ").strip().lower()
|
||||
if do_sample in ['true', 'false']:
|
||||
self.default_params['do_sample'] = do_sample == 'true'
|
||||
|
||||
DisplayFormatter.show_success("Parameters updated!")
|
||||
|
||||
except ValueError as e:
|
||||
DisplayFormatter.show_error(f"Invalid parameter value: {e}")
|
||||
|
||||
def run(self):
|
||||
"""Run interactive text generation"""
|
||||
self._initialize_generator()
|
||||
|
||||
print("\n📝 Text Generation")
|
||||
print("Commands:")
|
||||
print(" 'back' - Return to main menu")
|
||||
print(" 'params' - Show current parameters")
|
||||
print(" 'config' - Update parameters")
|
||||
print("-" * 40)
|
||||
|
||||
while True:
|
||||
prompt = input("\n💬 Enter your prompt: ").strip()
|
||||
|
||||
if prompt.lower() == 'back':
|
||||
break
|
||||
elif prompt.lower() == 'params':
|
||||
self._show_parameters()
|
||||
continue
|
||||
elif prompt.lower() == 'config':
|
||||
self._update_parameters()
|
||||
continue
|
||||
|
||||
if not prompt:
|
||||
DisplayFormatter.show_warning("Please enter a prompt")
|
||||
continue
|
||||
|
||||
DisplayFormatter.show_loading("Generating text...")
|
||||
result = self.generator.generate(prompt, **self.default_params)
|
||||
formatted_result = DisplayFormatter.format_textgen_result(result)
|
||||
print(formatted_result)
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
"""
|
||||
Project configuration
|
||||
"""
|
||||
from .settings import Config
|
||||
|
||||
__all__ = ['Config']
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
"""
|
||||
Global project configuration
|
||||
"""
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any
|
||||
|
||||
|
||||
class Config:
|
||||
"""Global application configuration"""
|
||||
|
||||
# Paths
|
||||
PROJECT_ROOT = Path(__file__).parent.parent.parent
|
||||
SRC_DIR = PROJECT_ROOT / "src"
|
||||
|
||||
# Default models
|
||||
DEFAULT_MODELS = {
|
||||
"sentiment": "cardiffnlp/twitter-roberta-base-sentiment-latest",
|
||||
"fillmask": "distilbert-base-uncased",
|
||||
"textgen": "gpt2",
|
||||
"moderation": "unitary/toxic-bert",
|
||||
"ner": "dbmdz/bert-large-cased-finetuned-conll03-english",
|
||||
}
|
||||
|
||||
# Interface
|
||||
CLI_BANNER = "🤖 AI Lab - Transformers Experimentation"
|
||||
CLI_SEPARATOR = "=" * 50
|
||||
|
||||
# Performance
|
||||
MAX_BATCH_SIZE = 32
|
||||
DEFAULT_MAX_LENGTH = 512
|
||||
|
||||
@classmethod
|
||||
def get_model(cls, pipeline_name: str) -> str:
|
||||
"""Get default model for a pipeline"""
|
||||
return cls.DEFAULT_MODELS.get(pipeline_name, "")
|
||||
|
||||
@classmethod
|
||||
def get_all_models(cls) -> Dict[str, str]:
|
||||
"""Get all configured models"""
|
||||
return cls.DEFAULT_MODELS.copy()
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
CLI entry point for AI Lab
|
||||
"""
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent directory to PYTHONPATH
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
from src.cli import InteractiveCLI
|
||||
from src.commands import SentimentCommand, FillMaskCommand, TextGenCommand, ModerationCommand, NERCommand
|
||||
|
||||
|
||||
def main():
|
||||
"""Main CLI function"""
|
||||
try:
|
||||
# Create CLI interface
|
||||
cli = InteractiveCLI()
|
||||
|
||||
# Register available commands
|
||||
cli.register_command(SentimentCommand())
|
||||
cli.register_command(FillMaskCommand())
|
||||
cli.register_command(TextGenCommand())
|
||||
cli.register_command(ModerationCommand())
|
||||
cli.register_command(NERCommand())
|
||||
|
||||
# Launch interactive interface
|
||||
cli.run()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n👋 Stopping program")
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
"""
|
||||
Experimentation pipelines with transformers
|
||||
"""
|
||||
from .sentiment import SentimentAnalyzer
|
||||
from .fillmask import FillMaskAnalyzer
|
||||
from .textgen import TextGenerator
|
||||
from .moderation import ContentModerator
|
||||
from .ner import NamedEntityRecognizer
|
||||
from .template import TemplatePipeline
|
||||
|
||||
__all__ = ['SentimentAnalyzer', 'FillMaskAnalyzer', 'TextGenerator', 'ContentModerator', 'NamedEntityRecognizer', 'TemplatePipeline']
|
||||
|
|
@ -0,0 +1,95 @@
|
|||
from transformers import pipeline
|
||||
from typing import Dict, List, Optional
|
||||
from src.config import Config
|
||||
|
||||
|
||||
class FillMaskAnalyzer:
|
||||
"""Fill-mask analyzer using transformers"""
|
||||
|
||||
def __init__(self, model_name: Optional[str] = None):
|
||||
"""
|
||||
Initialize the fill-mask pipeline
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to use (optional)
|
||||
"""
|
||||
self.model_name = model_name or Config.get_model("fillmask")
|
||||
print(f"Loading fill-mask model: {self.model_name}")
|
||||
self.pipeline = pipeline("fill-mask", model=self.model_name)
|
||||
print("Model loaded successfully!")
|
||||
|
||||
def predict(self, text: str, top_k: int = 5) -> Dict:
|
||||
"""
|
||||
Predict masked tokens in text
|
||||
|
||||
Args:
|
||||
text: Text with [MASK] token(s) to predict
|
||||
top_k: Number of top predictions to return
|
||||
|
||||
Returns:
|
||||
Dictionary with predictions and scores
|
||||
"""
|
||||
if not text.strip():
|
||||
return {"error": "Empty text"}
|
||||
|
||||
if "[MASK]" not in text:
|
||||
return {"error": "Text must contain [MASK] token"}
|
||||
|
||||
try:
|
||||
results = self.pipeline(text, top_k=top_k)
|
||||
|
||||
# Handle single mask vs multiple masks
|
||||
if isinstance(results, list) and isinstance(results[0], list):
|
||||
# Multiple masks
|
||||
predictions = []
|
||||
for i, mask_results in enumerate(results):
|
||||
mask_predictions = [
|
||||
{
|
||||
"token": pred["token_str"],
|
||||
"score": round(pred["score"], 4),
|
||||
"sequence": pred["sequence"]
|
||||
}
|
||||
for pred in mask_results
|
||||
]
|
||||
predictions.append({
|
||||
"mask_position": i + 1,
|
||||
"predictions": mask_predictions
|
||||
})
|
||||
|
||||
return {
|
||||
"original_text": text,
|
||||
"masks_count": len(results),
|
||||
"predictions": predictions
|
||||
}
|
||||
else:
|
||||
# Single mask
|
||||
predictions = [
|
||||
{
|
||||
"token": pred["token_str"],
|
||||
"score": round(pred["score"], 4),
|
||||
"sequence": pred["sequence"]
|
||||
}
|
||||
for pred in results
|
||||
]
|
||||
|
||||
return {
|
||||
"original_text": text,
|
||||
"masks_count": 1,
|
||||
"predictions": predictions
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"Prediction error: {str(e)}"}
|
||||
|
||||
def predict_batch(self, texts: List[str], top_k: int = 5) -> List[Dict]:
|
||||
"""
|
||||
Predict masked tokens for multiple texts
|
||||
|
||||
Args:
|
||||
texts: List of texts with [MASK] tokens
|
||||
top_k: Number of top predictions to return
|
||||
|
||||
Returns:
|
||||
List of prediction results
|
||||
"""
|
||||
return [self.predict(text, top_k) for text in texts]
|
||||
|
|
@ -0,0 +1,174 @@
|
|||
from transformers import pipeline
|
||||
from typing import Dict, List, Optional
|
||||
import re
|
||||
from src.config import Config
|
||||
|
||||
|
||||
class ContentModerator:
|
||||
"""Content moderator that detects and replaces inappropriate content"""
|
||||
|
||||
def __init__(self, model_name: Optional[str] = None):
|
||||
"""
|
||||
Initialize the content moderation pipeline
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to use (optional)
|
||||
"""
|
||||
self.model_name = model_name or Config.get_model("moderation")
|
||||
print(f"Loading moderation model: {self.model_name}")
|
||||
self.classifier = pipeline("text-classification", model=self.model_name)
|
||||
print("Moderation model loaded successfully!")
|
||||
|
||||
# Threshold for considering content as toxic
|
||||
self.toxicity_threshold = 0.5
|
||||
|
||||
def moderate(self, text: str, replacement: str = "***") -> Dict:
|
||||
"""
|
||||
Moderate content by detecting and replacing inappropriate words
|
||||
|
||||
Args:
|
||||
text: Text to moderate
|
||||
replacement: String to replace inappropriate content with
|
||||
|
||||
Returns:
|
||||
Dictionary with original text, moderated text, and detection info
|
||||
"""
|
||||
if not text.strip():
|
||||
return {"error": "Empty text"}
|
||||
|
||||
try:
|
||||
# First, check overall toxicity
|
||||
result = self.classifier(text)
|
||||
|
||||
# Handle different model output formats
|
||||
if isinstance(result, list):
|
||||
predictions = result
|
||||
else:
|
||||
predictions = [result]
|
||||
|
||||
# Find toxicity score
|
||||
toxic_score = 0.0
|
||||
is_toxic = False
|
||||
|
||||
for pred in predictions:
|
||||
label = pred["label"].upper()
|
||||
score = pred["score"]
|
||||
|
||||
# Check different possible toxic labels
|
||||
if label in ["TOXIC", "TOXICITY", "HARMFUL", "1"]:
|
||||
toxic_score = max(toxic_score, score)
|
||||
if score > self.toxicity_threshold:
|
||||
is_toxic = True
|
||||
elif label in ["NOT_TOXIC", "CLEAN", "0"]:
|
||||
# For models where high score means NOT toxic
|
||||
toxic_score = max(toxic_score, 1.0 - score)
|
||||
if (1.0 - score) > self.toxicity_threshold:
|
||||
is_toxic = True
|
||||
|
||||
if not is_toxic:
|
||||
return {
|
||||
"original_text": text,
|
||||
"moderated_text": text,
|
||||
"is_modified": False,
|
||||
"toxic_score": toxic_score,
|
||||
"words_replaced": 0
|
||||
}
|
||||
|
||||
# If toxic, analyze word by word to find problematic parts
|
||||
moderated_text, words_replaced = self._moderate_by_words(text, replacement)
|
||||
|
||||
return {
|
||||
"original_text": text,
|
||||
"moderated_text": moderated_text,
|
||||
"is_modified": True,
|
||||
"toxic_score": toxic_score,
|
||||
"words_replaced": words_replaced
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"Moderation error: {str(e)}"}
|
||||
|
||||
def _moderate_by_words(self, text: str, replacement: str) -> tuple[str, int]:
|
||||
"""
|
||||
Moderate text by analyzing individual words and phrases
|
||||
|
||||
Args:
|
||||
text: Original text
|
||||
replacement: Replacement string
|
||||
|
||||
Returns:
|
||||
Tuple of (moderated_text, words_replaced_count)
|
||||
"""
|
||||
words = text.split()
|
||||
moderated_words = []
|
||||
words_replaced = 0
|
||||
|
||||
# Check individual words
|
||||
for word in words:
|
||||
# Clean word for analysis (remove punctuation)
|
||||
clean_word = re.sub(r'[^\w]', '', word)
|
||||
if not clean_word:
|
||||
moderated_words.append(word)
|
||||
continue
|
||||
|
||||
try:
|
||||
word_result = self.classifier(clean_word)
|
||||
|
||||
# Handle different model output formats
|
||||
if isinstance(word_result, list):
|
||||
predictions = word_result
|
||||
else:
|
||||
predictions = [word_result]
|
||||
|
||||
is_word_toxic = False
|
||||
for pred in predictions:
|
||||
label = pred["label"].upper()
|
||||
score = pred["score"]
|
||||
|
||||
if label in ["TOXIC", "TOXICITY", "HARMFUL", "1"]:
|
||||
if score > self.toxicity_threshold:
|
||||
is_word_toxic = True
|
||||
break
|
||||
elif label in ["NOT_TOXIC", "CLEAN", "0"]:
|
||||
if (1.0 - score) > self.toxicity_threshold:
|
||||
is_word_toxic = True
|
||||
break
|
||||
|
||||
if is_word_toxic:
|
||||
# Replace the clean part with asterisks, keep punctuation
|
||||
moderated_word = re.sub(r'\w+', replacement, word)
|
||||
moderated_words.append(moderated_word)
|
||||
words_replaced += 1
|
||||
else:
|
||||
moderated_words.append(word)
|
||||
|
||||
except:
|
||||
# If analysis fails for a word, keep it as is
|
||||
moderated_words.append(word)
|
||||
|
||||
return " ".join(moderated_words), words_replaced
|
||||
|
||||
def moderate_batch(self, texts: List[str], replacement: str = "***") -> List[Dict]:
|
||||
"""
|
||||
Moderate multiple texts
|
||||
|
||||
Args:
|
||||
texts: List of texts to moderate
|
||||
replacement: String to replace inappropriate content with
|
||||
|
||||
Returns:
|
||||
List of moderation results
|
||||
"""
|
||||
return [self.moderate(text, replacement) for text in texts]
|
||||
|
||||
def set_threshold(self, threshold: float):
|
||||
"""
|
||||
Set the toxicity threshold
|
||||
|
||||
Args:
|
||||
threshold: Threshold between 0 and 1
|
||||
"""
|
||||
if 0 <= threshold <= 1:
|
||||
self.toxicity_threshold = threshold
|
||||
else:
|
||||
raise ValueError("Threshold must be between 0 and 1")
|
||||
|
|
@ -0,0 +1,179 @@
|
|||
from transformers import pipeline
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
from src.config import Config
|
||||
|
||||
|
||||
class NamedEntityRecognizer:
|
||||
"""Named Entity Recognition using transformers"""
|
||||
|
||||
def __init__(self, model_name: Optional[str] = None):
|
||||
"""
|
||||
Initialize the NER pipeline
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to use (optional)
|
||||
"""
|
||||
self.model_name = model_name or Config.get_model("ner")
|
||||
print(f"Loading NER model: {self.model_name}")
|
||||
self.pipeline = pipeline("ner", model=self.model_name, aggregation_strategy="simple")
|
||||
print("NER model loaded successfully!")
|
||||
|
||||
# Entity type mappings for better display
|
||||
self.entity_colors = {
|
||||
"PER": "👤", # Person
|
||||
"ORG": "🏢", # Organization
|
||||
"LOC": "📍", # Location
|
||||
"MISC": "🏷️", # Miscellaneous
|
||||
"DATE": "📅", # Date
|
||||
"TIME": "⏰", # Time
|
||||
"MONEY": "💰", # Money
|
||||
"PERCENT": "📊", # Percentage
|
||||
}
|
||||
|
||||
def recognize(self, text: str, confidence_threshold: float = 0.9) -> Dict:
|
||||
"""
|
||||
Recognize named entities in text
|
||||
|
||||
Args:
|
||||
text: Text to analyze
|
||||
confidence_threshold: Minimum confidence score for entities
|
||||
|
||||
Returns:
|
||||
Dictionary with entities and their information
|
||||
"""
|
||||
if not text.strip():
|
||||
return {"error": "Empty text"}
|
||||
|
||||
try:
|
||||
entities = self.pipeline(text)
|
||||
|
||||
# Filter by confidence and process entities
|
||||
filtered_entities = []
|
||||
entity_stats = {}
|
||||
|
||||
for entity in entities:
|
||||
if entity["score"] >= confidence_threshold:
|
||||
entity_type = entity["entity_group"]
|
||||
|
||||
processed_entity = {
|
||||
"text": entity["word"],
|
||||
"label": entity_type,
|
||||
"confidence": round(entity["score"], 4),
|
||||
"start": entity["start"],
|
||||
"end": entity["end"],
|
||||
"emoji": self.entity_colors.get(entity_type, "🏷️")
|
||||
}
|
||||
|
||||
filtered_entities.append(processed_entity)
|
||||
|
||||
# Update statistics
|
||||
if entity_type not in entity_stats:
|
||||
entity_stats[entity_type] = {"count": 0, "entities": []}
|
||||
entity_stats[entity_type]["count"] += 1
|
||||
entity_stats[entity_type]["entities"].append(entity["word"])
|
||||
|
||||
# Create highlighted text
|
||||
highlighted_text = self._highlight_entities(text, filtered_entities)
|
||||
|
||||
return {
|
||||
"original_text": text,
|
||||
"highlighted_text": highlighted_text,
|
||||
"entities": filtered_entities,
|
||||
"entity_stats": entity_stats,
|
||||
"total_entities": len(filtered_entities),
|
||||
"confidence_threshold": confidence_threshold
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"NER processing error: {str(e)}"}
|
||||
|
||||
def _highlight_entities(self, text: str, entities: List[Dict]) -> str:
|
||||
"""
|
||||
Create highlighted version of text with entity markers
|
||||
|
||||
Args:
|
||||
text: Original text
|
||||
entities: List of detected entities
|
||||
|
||||
Returns:
|
||||
Text with highlighted entities
|
||||
"""
|
||||
if not entities:
|
||||
return text
|
||||
|
||||
# Sort entities by start position (reverse order for replacement)
|
||||
sorted_entities = sorted(entities, key=lambda x: x["start"], reverse=True)
|
||||
|
||||
highlighted = text
|
||||
for entity in sorted_entities:
|
||||
start, end = entity["start"], entity["end"]
|
||||
entity_text = entity["text"]
|
||||
emoji = entity["emoji"]
|
||||
label = entity["label"]
|
||||
confidence = entity["confidence"]
|
||||
|
||||
# Create highlighted version
|
||||
highlight = f"{emoji}[{entity_text}]({label}:{confidence:.2f})"
|
||||
highlighted = highlighted[:start] + highlight + highlighted[end:]
|
||||
|
||||
return highlighted
|
||||
|
||||
def analyze_document(self, text: str, confidence_threshold: float = 0.9) -> Dict:
|
||||
"""
|
||||
Perform comprehensive document analysis with entity extraction
|
||||
|
||||
Args:
|
||||
text: Document text to analyze
|
||||
confidence_threshold: Minimum confidence for entities
|
||||
|
||||
Returns:
|
||||
Comprehensive analysis results
|
||||
"""
|
||||
result = self.recognize(text, confidence_threshold)
|
||||
|
||||
if "error" in result:
|
||||
return result
|
||||
|
||||
# Additional analysis
|
||||
analysis = {
|
||||
**result,
|
||||
"document_stats": {
|
||||
"word_count": len(text.split()),
|
||||
"char_count": len(text),
|
||||
"sentence_count": len([s for s in text.split('.') if s.strip()]),
|
||||
"entity_density": len(result["entities"]) / len(text.split()) if text.split() else 0
|
||||
}
|
||||
}
|
||||
|
||||
# Find most common entity types
|
||||
if result["entity_stats"]:
|
||||
most_common_type = max(result["entity_stats"].items(), key=lambda x: x[1]["count"])
|
||||
analysis["most_common_entity_type"] = {
|
||||
"type": most_common_type[0],
|
||||
"count": most_common_type[1]["count"],
|
||||
"emoji": self.entity_colors.get(most_common_type[0], "🏷️")
|
||||
}
|
||||
|
||||
return analysis
|
||||
|
||||
def recognize_batch(self, texts: List[str], confidence_threshold: float = 0.9) -> List[Dict]:
|
||||
"""
|
||||
Recognize entities in multiple texts
|
||||
|
||||
Args:
|
||||
texts: List of texts to analyze
|
||||
confidence_threshold: Minimum confidence for entities
|
||||
|
||||
Returns:
|
||||
List of NER results
|
||||
"""
|
||||
return [self.recognize(text, confidence_threshold) for text in texts]
|
||||
|
||||
def get_entity_types(self) -> Dict[str, str]:
|
||||
"""
|
||||
Get available entity types with their emojis
|
||||
|
||||
Returns:
|
||||
Dictionary mapping entity types to emojis
|
||||
"""
|
||||
return self.entity_colors.copy()
|
||||
|
|
@ -0,0 +1,54 @@
|
|||
from transformers import pipeline
|
||||
from typing import Dict, List, Optional
|
||||
from src.config import Config
|
||||
|
||||
|
||||
class SentimentAnalyzer:
|
||||
"""Sentiment analyzer using transformers"""
|
||||
|
||||
def __init__(self, model_name: Optional[str] = None):
|
||||
"""
|
||||
Initialize the sentiment-analysis pipeline
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to use (optional)
|
||||
"""
|
||||
self.model_name = model_name or Config.get_model("sentiment")
|
||||
print(f"Loading sentiment model: {self.model_name}")
|
||||
self.pipeline = pipeline("sentiment-analysis", model=self.model_name)
|
||||
print("Model loaded successfully!")
|
||||
|
||||
def analyze(self, text: str) -> Dict:
|
||||
"""
|
||||
Analyze the sentiment of a text
|
||||
|
||||
Args:
|
||||
text: Text to analyze
|
||||
|
||||
Returns:
|
||||
Dictionary with label and confidence score
|
||||
"""
|
||||
if not text.strip():
|
||||
return {"error": "Empty text"}
|
||||
|
||||
try:
|
||||
result = self.pipeline(text)[0]
|
||||
return {
|
||||
"text": text,
|
||||
"sentiment": result["label"],
|
||||
"confidence": round(result["score"], 4)
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": f"Analysis error: {str(e)}"}
|
||||
|
||||
def analyze_batch(self, texts: List[str]) -> List[Dict]:
|
||||
"""
|
||||
Analyze the sentiment of multiple texts
|
||||
|
||||
Args:
|
||||
texts: List of texts to analyze
|
||||
|
||||
Returns:
|
||||
List of analysis results
|
||||
"""
|
||||
return [self.analyze(text) for text in texts]
|
||||
|
|
@ -0,0 +1,59 @@
|
|||
"""
|
||||
Template for creating new pipelines
|
||||
Copy this file and adapt it according to your needs
|
||||
"""
|
||||
from transformers import pipeline
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
|
||||
class TemplatePipeline:
|
||||
"""Template for a new pipeline"""
|
||||
|
||||
def __init__(self, model_name: Optional[str] = None):
|
||||
"""
|
||||
Initialize the pipeline
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to use (optional)
|
||||
"""
|
||||
self.model_name = model_name or "distilbert-base-uncased"
|
||||
print(f"Loading model {self.model_name}...")
|
||||
|
||||
# Replace "text-classification" with your task
|
||||
self.pipeline = pipeline("text-classification", model=self.model_name)
|
||||
print("Model loaded successfully!")
|
||||
|
||||
def process(self, text: str) -> Dict:
|
||||
"""
|
||||
Process a text
|
||||
|
||||
Args:
|
||||
text: Text to process
|
||||
|
||||
Returns:
|
||||
Dictionary with results
|
||||
"""
|
||||
if not text.strip():
|
||||
return {"error": "Empty text"}
|
||||
|
||||
try:
|
||||
result = self.pipeline(text)
|
||||
return {
|
||||
"text": text,
|
||||
"result": result,
|
||||
# Add other fields according to your needs
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": f"Processing error: {str(e)}"}
|
||||
|
||||
def process_batch(self, texts: List[str]) -> List[Dict]:
|
||||
"""
|
||||
Process multiple texts
|
||||
|
||||
Args:
|
||||
texts: List of texts to process
|
||||
|
||||
Returns:
|
||||
List of results
|
||||
"""
|
||||
return [self.process(text) for text in texts]
|
||||
|
|
@ -0,0 +1,82 @@
|
|||
from transformers import pipeline
|
||||
from typing import Dict, List, Optional
|
||||
from src.config import Config
|
||||
|
||||
|
||||
class TextGenerator:
|
||||
"""Text generator using transformers"""
|
||||
|
||||
def __init__(self, model_name: Optional[str] = None):
|
||||
"""
|
||||
Initialize the text-generation pipeline
|
||||
|
||||
Args:
|
||||
model_name: Name of the model to use (optional)
|
||||
"""
|
||||
self.model_name = model_name or Config.get_model("textgen")
|
||||
print(f"Loading text generation model: {self.model_name}")
|
||||
self.pipeline = pipeline("text-generation", model=self.model_name)
|
||||
print("Model loaded successfully!")
|
||||
|
||||
def generate(self, prompt: str, max_length: int = 100, num_return_sequences: int = 1,
|
||||
temperature: float = 1.0, do_sample: bool = True) -> Dict:
|
||||
"""
|
||||
Generate text from a prompt
|
||||
|
||||
Args:
|
||||
prompt: Input text prompt
|
||||
max_length: Maximum length of generated text
|
||||
num_return_sequences: Number of sequences to generate
|
||||
temperature: Sampling temperature (higher = more random)
|
||||
do_sample: Whether to use sampling
|
||||
|
||||
Returns:
|
||||
Dictionary with generated texts
|
||||
"""
|
||||
if not prompt.strip():
|
||||
return {"error": "Empty prompt"}
|
||||
|
||||
try:
|
||||
results = self.pipeline(
|
||||
prompt,
|
||||
max_length=max_length,
|
||||
num_return_sequences=num_return_sequences,
|
||||
temperature=temperature,
|
||||
do_sample=do_sample,
|
||||
pad_token_id=self.pipeline.tokenizer.eos_token_id
|
||||
)
|
||||
|
||||
generations = [
|
||||
{
|
||||
"text": result["generated_text"],
|
||||
"continuation": result["generated_text"][len(prompt):].strip()
|
||||
}
|
||||
for result in results
|
||||
]
|
||||
|
||||
return {
|
||||
"prompt": prompt,
|
||||
"parameters": {
|
||||
"max_length": max_length,
|
||||
"num_sequences": num_return_sequences,
|
||||
"temperature": temperature,
|
||||
"do_sample": do_sample
|
||||
},
|
||||
"generations": generations
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"Generation error: {str(e)}"}
|
||||
|
||||
def generate_batch(self, prompts: List[str], **kwargs) -> List[Dict]:
|
||||
"""
|
||||
Generate text for multiple prompts
|
||||
|
||||
Args:
|
||||
prompts: List of input prompts
|
||||
**kwargs: Generation parameters
|
||||
|
||||
Returns:
|
||||
List of generation results
|
||||
"""
|
||||
return [self.generate(prompt, **kwargs) for prompt in prompts]
|
||||
Loading…
Reference in New Issue