Init commit

This commit is contained in:
Cyril 2025-10-11 13:26:06 +02:00
commit 9b2a5497d9
25 changed files with 3343 additions and 0 deletions

79
.gitignore vendored Normal file
View File

@ -0,0 +1,79 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# Virtual environments
venv/
ENV/
env/
.venv/
.env/
# PyInstaller
*.manifest
*.spec
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# VS Code
.vscode/
# macOS
.DS_Store
# Logs
*.log
# dotenv
.env
.env.*
# Local settings
local_settings.py
# System files
Thumbs.db
ehthumbs.db
Desktop.ini

314
README.md Normal file
View File

@ -0,0 +1,314 @@
# 🧠 AI Lab Transformers CLI Playground
> A **pedagogical and technical project** designed for AI practitioners and students to experiment with Hugging Face Transformers through an **interactive CommandLine Interface (CLI)**.
> This playground provides readytouse NLP pipelines (Sentiment Analysis, Named Entity Recognition, Text Generation, FillMask, Moderation, etc.) in a modular, extensible, and educational codebase.
---
## 📚 Overview
The **AI Lab Transformers CLI Playground** allows you to explore multiple natural language processing tasks directly from the terminal.
Each task (e.g., sentiment, NER, text generation) is implemented as a **Command Module**, which interacts with a **Pipeline Module** built on top of the `transformers` library.
The lab is intentionally structured to demonstrate **clean software design for ML codebases** — with strict separation between configuration, pipelines, CLI logic, and display formatting.
---
## 🗂️ Project Structure
```text
src/
├── __init__.py
├── main.py # CLI entry point
├── cli/
│ ├── __init__.py
│ ├── base.py # CLICommand base class & interactive shell handler
│ └── display.py # Console formatting utilities (tables, colors, results)
├── commands/ # User-facing commands wrapping pipeline logic
│ ├── __init__.py
│ ├── sentiment.py # Sentiment analysis command
│ ├── fillmask.py # Masked token prediction command
│ ├── textgen.py # Text generation command
│ ├── ner.py # Named Entity Recognition command
│ └── moderation.py # Toxicity / content moderation command
├── pipelines/ # Machine learning logic (Hugging Face Transformers)
│ ├── __init__.py
│ ├── template.py # Blueprint for creating new pipelines
│ ├── sentiment.py
│ ├── fillmask.py
│ ├── textgen.py
│ ├── ner.py
│ └── moderation.py
└── config/
├── __init__.py
└── settings.py # Global configuration (default models, parameters)
```
---
## ⚙️ Installation
### 🧾 Option 1 Using Poetry (Recommended)
> Poetry is used as the main dependency manager.
```bash
# 1. Create and activate a new virtual environment
poetry shell
# 2. Install dependencies
poetry install
```
This will automatically install all dependencies declared in `pyproject.toml`, including **transformers** and **torch**.
To run the CLI inside the Poetry environment:
```bash
poetry run python src/main.py
```
---
### 📦 Option 2 Using pip and requirements.txt
If you prefer using `requirements.txt` manually:
```bash
# 1. Create a virtual environment
python -m venv .venv
# 2. Activate it
# Linux/macOS
source .venv/bin/activate
# Windows PowerShell
.venv\Scripts\Activate.ps1
# 3. Install dependencies
pip install -r requirements.txt
```
---
## ▶️ Usage
Once installed, launch the CLI with:
```bash
python -m src.main
# or, if using Poetry
poetry run python src/main.py
```
Youll see an interactive menu listing the available commands:
```
Welcome to AI Lab - Transformers CLI Playground
Available commands:
• sentiment Analyze the sentiment of a text
• fillmask Predict masked words in a sentence
• textgen Generate text from a prompt
• ner Extract named entities from text
• moderation Detect toxic or unsafe content
```
### Example Sessions
#### 🔹 Sentiment Analysis
```text
💬 Enter text: I absolutely love this project!
→ Sentiment: POSITIVE (score: 0.998)
```
#### 🔹 FillMask
```text
💬 Enter text: The capital of France is [MASK].
→ Predictions:
1) Paris score: 0.87
2) Lyon score: 0.04
3) London score: 0.02
```
#### 🔹 Text Generation
```text
💬 Prompt: Once upon a time
→ Output: Once upon a time there was a young AI learning to code...
```
#### 🔹 NER (Named Entity Recognition)
```text
💬 Enter text: Elon Musk founded SpaceX in California.
→ Entities:
- Elon Musk (PERSON)
- SpaceX (ORG)
- California (LOC)
```
#### 🔹 Moderation
```text
💬 Enter text: I hate everything!
→ Result: FLAGGED (toxic content detected)
```
---
## 🧠 Architecture Overview
The internal structure follows a clean **Command ↔ Pipeline ↔ Display** pattern:
```text
┌──────────────────────┐
│ InteractiveCLI │
│ (src/cli/base.py) │
└──────────┬───────────┘
┌─────────────────┐
│ Command Layer │ ← e.g. sentiment.py
│ (user commands) │
└───────┬─────────┘
┌─────────────────┐
│ Pipeline Layer │ ← e.g. pipelines/sentiment.py
│ (ML logic) │
└───────┬─────────┘
┌─────────────────┐
│ Display Layer │ ← cli/display.py
│ (format output) │
└─────────────────┘
```
### Key Concepts
| Layer | Description |
|-------|--------------|
| **CLI** | Manages user input/output, help menus, and navigation between commands. |
| **Command** | Encapsulates a single user-facing operation (e.g., run sentiment). |
| **Pipeline** | Wraps Hugging Faces `transformers.pipeline()` to perform inference. |
| **Display** | Handles clean console rendering (colored output, tables, JSON formatting). |
| **Config** | Centralizes model names, limits, and global constants. |
---
## ⚙️ Configuration
All configuration is centralized in `src/config/settings.py`.
Example:
```python
class Config:
DEFAULT_MODELS = {
"sentiment": "distilbert-base-uncased-finetuned-sst-2-english",
"fillmask": "bert-base-uncased",
"textgen": "gpt2",
"ner": "dslim/bert-base-NER",
"moderation":"unitary/toxic-bert"
}
MAX_LENGTH = 512
BATCH_SIZE = 8
```
You can easily modify model names to experiment with different checkpoints.
---
## 🧩 Extending the Playground
To create a new experiment (e.g., keyword extraction):
1. **Duplicate** `src/pipelines/template.py``src/pipelines/keywords.py`
Implement the `run()` or `analyze()` logic using a new Hugging Face pipeline.
2. **Create a Command** in `src/commands/keywords.py` to interact with users.
3. **Register the command** inside `src/main.py`:
```python
from src.commands.keywords import KeywordsCommand
cli.register_command(KeywordsCommand())
```
4. Optionally, add a model name in `Config.DEFAULT_MODELS`.
---
## 🧪 Testing
You can use `pytest` for lightweight validation:
```bash
pip install pytest
pytest -q
```
Recommended structure:
```
tests/
├── test_sentiment.py
├── test_textgen.py
└── ...
```
---
## 🧰 Troubleshooting
| Issue | Cause / Solution |
|-------|------------------|
| **`transformers` not found** | Check virtual environment activation. |
| **Torch fails to install** | Install CPU-only version from PyTorch index. |
| **Models download slowly** | Hugging Face caches them after first run. |
| **Unicode / accents broken** | Ensure terminal encoding is UTF8. |
---
## 🧭 Development Guidelines
- Keep **Command** classes lightweight — no ML logic inside them.
- Reuse the **Pipeline Template** for new experiments.
- Format outputs consistently via the `DisplayFormatter`.
- Document all new models or commands in `README.md` and `settings.py`.
---
## 🧱 Roadmap
- [ ] Add non-interactive CLI flags (`--text`, `--task`)
- [ ] Add multilingual model options
- [ ] Add automatic test coverage
- [ ] Add logging and profiling utilities
- [ ] Add export to JSON/CSV results
---
## 🪪 License
You can include a standard open-source license such as **MIT** or **Apache 2.0** depending on your use case.
---
## 🤝 Contributing
This repository is meant as an **educational sandbox** for experimenting with Transformers.
Pull requests are welcome for new models, better CLI UX, or educational improvements.
---
### ✨ Key Takeaways
- Modular and pedagogical design for training environments
- Clean separation between **I/O**, **ML logic**, and **UX**
- Easily extensible architecture for adding custom pipelines
- Perfect sandbox for students, researchers, and developers to learn modern NLP tools
---
> 🧩 Built for experimentation. Learn, break, and rebuild.

1444
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

27
pyproject.toml Normal file
View File

@ -0,0 +1,27 @@
[project]
name = "ai-lab"
version = "0.1.0"
description = "Lab for testing different uses of transformers"
authors = [{ name = "Cyril", email = "decostanzicyril@gmail.com" }]
[tool.poetry]
name = "ai-lab"
version = "0.1.0"
description = "Lab for testing different uses of transformers"
authors = ["Cyril"]
packages = [{ include = "src" }]
[tool.poetry.dependencies]
python = ">=3.12,<3.14"
torch = "^2.0.0"
transformers = "^4.30.0"
tokenizers = "^0.13.0"
numpy = "^1.24.0"
accelerate = "^0.20.0"
[tool.poetry.scripts]
ai-lab = "src.main:main"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

4
requirements.txt Normal file
View File

@ -0,0 +1,4 @@
torch>=2.0.0
transformers>=4.30.0
tokenizers>=0.13.0
numpy>=1.24.0

4
src/__init__.py Normal file
View File

@ -0,0 +1,4 @@
"""
AI Lab - Transformers Experimentation
"""
__version__ = "0.1.0"

7
src/cli/__init__.py Normal file
View File

@ -0,0 +1,7 @@
"""
CLI utilities for AI Lab
"""
from .base import CLICommand, InteractiveCLI
from .display import DisplayFormatter
__all__ = ['CLICommand', 'InteractiveCLI', 'DisplayFormatter']

87
src/cli/base.py Normal file
View File

@ -0,0 +1,87 @@
from abc import ABC, abstractmethod
from typing import Dict, Any
from src.config import Config
class CLICommand(ABC):
"""Base class for CLI commands"""
@property
@abstractmethod
def name(self) -> str:
"""Command name"""
pass
@property
@abstractmethod
def description(self) -> str:
"""Command description"""
pass
@abstractmethod
def run(self) -> None:
"""Execute the command"""
pass
class InteractiveCLI:
"""Interactive CLI handler"""
def __init__(self):
self.commands: Dict[str, CLICommand] = {}
def register_command(self, command: CLICommand):
"""Register a new command"""
self.commands[command.name] = command
def show_menu(self):
"""Display available commands"""
print(Config.CLI_BANNER)
print(Config.CLI_SEPARATOR)
print("Available commands:")
for name, cmd in self.commands.items():
print(f" 📌 {name}: {cmd.description}")
print(" 📌 quit: Exit application")
print(" 📌 help: Show this help")
print("-" * 50)
def show_help(self):
"""Show detailed help"""
print("\n📚 Detailed Help")
print("-" * 30)
print("Navigation:")
print(" - Type a command name to execute it")
print(" - Type 'back' in a command to return to menu")
print(" - Type 'quit' or Ctrl+C to exit")
print("\nAvailable commands:")
for name, cmd in self.commands.items():
print(f" {name}: {cmd.description}")
def run(self):
"""Run the interactive CLI"""
self.show_menu()
while True:
try:
choice = input("\n💬 Choose a command: ").strip().lower()
if choice in ['quit', 'exit', 'q']:
print("👋 Goodbye!")
break
if choice in ['help', 'h', '?']:
self.show_help()
continue
if choice in self.commands:
print() # Empty line for readability
self.commands[choice].run()
print() # Empty line after command
else:
print("❌ Unknown command. Type 'help' to see available commands.")
except KeyboardInterrupt:
print("\n👋 Stopping program")
break
except Exception as e:
print(f"❌ Error: {e}")

192
src/cli/display.py Normal file
View File

@ -0,0 +1,192 @@
from typing import Dict, Any
class DisplayFormatter:
"""Utility class for formatting display output"""
@staticmethod
def format_sentiment_result(result: Dict[str, Any]) -> str:
"""Format sentiment analysis result for display"""
if "error" in result:
return f"{result['error']}"
sentiment = result["sentiment"]
confidence = result["confidence"]
emoji = "😊" if sentiment == "POSITIVE" else "😞"
return f"{emoji} Sentiment: {sentiment}\n📊 Confidence: {confidence:.2%}"
@staticmethod
def show_loading(message: str = "Analysis in progress..."):
"""Show loading message"""
print(f"\n🔍 {message}")
@staticmethod
def show_warning(message: str):
"""Show warning message"""
print(f"⚠️ {message}")
@staticmethod
def show_error(message: str):
"""Show error message"""
print(f"{message}")
@staticmethod
def show_success(message: str):
"""Show success message"""
print(f"{message}")
@staticmethod
def format_fillmask_result(result: Dict[str, Any]) -> str:
"""Format fill-mask prediction result for display"""
if "error" in result:
return f"{result['error']}"
output = []
output.append(f"📝 Original: {result['original_text']}")
output.append(f"🎭 Masks found: {result['masks_count']}")
output.append("")
if result['masks_count'] == 1:
# Single mask
output.append("🔮 Predictions:")
for i, pred in enumerate(result['predictions'], 1):
confidence_bar = "" * int(pred['score'] * 10)
output.append(f" {i}. '{pred['token']}' ({pred['score']:.1%}) {confidence_bar}")
output.append(f"{pred['sequence']}")
else:
# Multiple masks
for mask_info in result['predictions']:
output.append(f"🔮 Mask #{mask_info['mask_position']} predictions:")
for i, pred in enumerate(mask_info['predictions'], 1):
confidence_bar = "" * int(pred['score'] * 10)
output.append(f" {i}. '{pred['token']}' ({pred['score']:.1%}) {confidence_bar}")
output.append("")
return "\n".join(output)
@staticmethod
def format_textgen_result(result: Dict[str, Any]) -> str:
"""Format text generation result for display"""
if "error" in result:
return f"{result['error']}"
output = []
output.append(f"📝 Prompt: {result['prompt']}")
output.append(f"⚙️ Parameters: max_length={result['parameters']['max_length']}, "
f"temperature={result['parameters']['temperature']}")
output.append("-" * 50)
for i, gen in enumerate(result['generations'], 1):
if len(result['generations']) > 1:
output.append(f"🎯 Generation {i}:")
output.append(f"📄 Full text: {gen['text']}")
if gen['continuation']:
output.append(f"✨ Continuation: {gen['continuation']}")
if i < len(result['generations']):
output.append("-" * 30)
return "\n".join(output)
@staticmethod
def format_moderation_result(result: Dict[str, Any]) -> str:
"""Format content moderation result for display"""
if "error" in result:
return f"{result['error']}"
output = []
output.append(f"📝 Original: {result['original_text']}")
if result['is_modified']:
output.append(f"🛡️ Moderated: {result['moderated_text']}")
output.append(f"⚠️ Status: Content modified ({result['words_replaced']} words replaced)")
status_emoji = "🔴"
else:
output.append("✅ Status: Content approved (no modifications needed)")
status_emoji = "🟢"
# Toxicity score bar
score = result['toxic_score']
score_bar = "" * int(score * 10)
output.append(f"{status_emoji} Toxicity Score: {score:.1%} {score_bar}")
return "\n".join(output)
@staticmethod
def format_ner_result(result: Dict[str, Any]) -> str:
"""Format NER result for display"""
if "error" in result:
return f"{result['error']}"
output = []
output.append(f"📝 Original: {result['original_text']}")
output.append(f"✨ Highlighted: {result['highlighted_text']}")
output.append(f"🎯 Found {result['total_entities']} entities (threshold: {result['confidence_threshold']:.2f})")
if result['entities']:
output.append("\n📋 Detected Entities:")
for entity in result['entities']:
confidence_bar = "" * int(entity['confidence'] * 10)
output.append(f" {entity['emoji']} {entity['text']}{entity['label']} "
f"({entity['confidence']:.1%}) {confidence_bar}")
if result['entity_stats']:
output.append("\n📊 Entity Statistics:")
for entity_type, stats in result['entity_stats'].items():
unique_entities = list(set(stats['entities']))
emoji = result['entities'][0]['emoji'] if result['entities'] else "🏷️"
for ent in result['entities']:
if ent['label'] == entity_type:
emoji = ent['emoji']
break
output.append(f" {emoji} {entity_type}: {stats['count']} occurrences")
if len(unique_entities) <= 3:
output.append(f"{', '.join(unique_entities)}")
else:
output.append(f"{', '.join(unique_entities[:3])}... (+{len(unique_entities)-3} more)")
return "\n".join(output)
@staticmethod
def format_ner_analysis(result: Dict[str, Any]) -> str:
"""Format comprehensive NER document analysis"""
if "error" in result:
return f"{result['error']}"
output = []
output.append("📊 Document Analysis Results")
output.append("=" * 50)
# Document statistics
stats = result['document_stats']
output.append(f"📄 Document: {stats['word_count']} words, {stats['char_count']} characters")
output.append(f"📝 Structure: ~{stats['sentence_count']} sentences")
output.append(f"🎯 Entity Density: {stats['entity_density']:.2%} (entities per word)")
# Most common entity type
if 'most_common_entity_type' in result:
common = result['most_common_entity_type']
output.append(f"🏆 Most Common: {common['emoji']} {common['type']} ({common['count']} occurrences)")
output.append(f"\n✨ Highlighted Text:")
output.append(result['highlighted_text'])
if result['entity_stats']:
output.append(f"\n📈 Detailed Statistics:")
for entity_type, stats in result['entity_stats'].items():
unique_entities = list(set(stats['entities']))
emoji = "🏷️"
for ent in result['entities']:
if ent['label'] == entity_type:
emoji = ent['emoji']
break
output.append(f"\n{emoji} {entity_type} ({stats['count']} total):")
for entity in unique_entities:
count = stats['entities'].count(entity)
output.append(f"{entity} ({count}x)")
return "\n".join(output)

10
src/commands/__init__.py Normal file
View File

@ -0,0 +1,10 @@
"""
AI Lab commands
"""
from .sentiment import SentimentCommand
from .fillmask import FillMaskCommand
from .textgen import TextGenCommand
from .moderation import ModerationCommand
from .ner import NERCommand
__all__ = ['SentimentCommand', 'FillMaskCommand', 'TextGenCommand', 'ModerationCommand', 'NERCommand']

84
src/commands/fillmask.py Normal file
View File

@ -0,0 +1,84 @@
from src.cli.base import CLICommand
from src.cli.display import DisplayFormatter
from src.pipelines.fillmask import FillMaskAnalyzer
class FillMaskCommand(CLICommand):
"""Interactive fill-mask prediction command"""
def __init__(self):
self.analyzer = None
@property
def name(self) -> str:
return "fillmask"
@property
def description(self) -> str:
return "Interactive fill-mask token prediction"
def _initialize_analyzer(self):
"""Lazy initialization of the analyzer"""
if self.analyzer is None:
print("🔄 Loading fill-mask model...")
self.analyzer = FillMaskAnalyzer()
DisplayFormatter.show_success("Model loaded!")
def _show_instructions(self):
"""Show usage instructions"""
print("\n📝 Fill-Mask Prediction")
print("Replace words with [MASK] token and get predictions")
print("\nExamples:")
print(" - The weather today is [MASK]")
print(" - I love to [MASK] music")
print(" - Paris is the capital of [MASK]")
print("\nType 'back' to return to main menu")
print("Type 'help' to see these instructions again")
print("-" * 50)
def _get_top_k(self) -> int:
"""Get number of predictions from user"""
while True:
try:
top_k_input = input("📊 Number of predictions (1-10, default=5): ").strip()
if not top_k_input:
return 5
top_k = int(top_k_input)
if 1 <= top_k <= 10:
return top_k
else:
DisplayFormatter.show_warning("Please enter a number between 1 and 10")
except ValueError:
DisplayFormatter.show_warning("Please enter a valid number")
def run(self):
"""Run interactive fill-mask prediction"""
self._initialize_analyzer()
self._show_instructions()
while True:
text = input("\n💬 Enter text with [MASK]: ").strip()
if text.lower() in ['back', 'return']:
break
if text.lower() == 'help':
self._show_instructions()
continue
if not text:
DisplayFormatter.show_warning("Please enter some text")
continue
if "[MASK]" not in text:
DisplayFormatter.show_warning("Text must contain [MASK] token")
continue
# Get number of predictions
top_k = self._get_top_k()
DisplayFormatter.show_loading("Predicting tokens...")
result = self.analyzer.predict(text, top_k=top_k)
formatted_result = DisplayFormatter.format_fillmask_result(result)
print(formatted_result)

View File

@ -0,0 +1,73 @@
from src.cli.base import CLICommand
from src.cli.display import DisplayFormatter
from src.pipelines.moderation import ContentModerator
class ModerationCommand(CLICommand):
"""Interactive content moderation command"""
def __init__(self):
self.moderator = None
@property
def name(self) -> str:
return "moderation"
@property
def description(self) -> str:
return "Content moderation and filtering"
def _initialize_moderator(self):
"""Lazy initialization of the moderator"""
if self.moderator is None:
print("🔄 Loading content moderation model...")
self.moderator = ContentModerator()
DisplayFormatter.show_success("Moderation model loaded!")
def run(self):
"""Run interactive content moderation"""
self._initialize_moderator()
print("\n🛡️ Content Moderation")
print("Type 'back' to return to main menu")
print("Type 'settings' to adjust moderation sensitivity")
print("-" * 40)
while True:
text = input("\n📝 Enter text to moderate: ").strip()
if text.lower() in ['back', 'return']:
break
if text.lower() == 'settings':
self._show_settings()
continue
if not text:
DisplayFormatter.show_warning("Please enter some text")
continue
DisplayFormatter.show_loading("Analyzing content...")
result = self.moderator.moderate(text)
formatted_result = DisplayFormatter.format_moderation_result(result)
print(formatted_result)
def _show_settings(self):
"""Show and allow modification of moderation settings"""
print(f"\n⚙️ Current Settings:")
print(f"Toxicity threshold: {self.moderator.toxicity_threshold:.2f}")
print("\nOptions:")
print("1. Change threshold (0.0 = very strict, 1.0 = very permissive)")
print("2. Back to moderation")
choice = input("\nChoose option (1-2): ").strip()
if choice == "1":
try:
new_threshold = float(input("Enter new threshold (0.0-1.0): "))
self.moderator.set_threshold(new_threshold)
DisplayFormatter.show_success(f"Threshold set to {new_threshold:.2f}")
except ValueError:
DisplayFormatter.show_error("Invalid threshold value")
elif choice != "2":
DisplayFormatter.show_warning("Invalid option")

137
src/commands/ner.py Normal file
View File

@ -0,0 +1,137 @@
from src.cli.base import CLICommand
from src.cli.display import DisplayFormatter
from src.pipelines.ner import NamedEntityRecognizer
class NERCommand(CLICommand):
"""Interactive Named Entity Recognition command"""
def __init__(self):
self.recognizer = None
self.confidence_threshold = 0.9
@property
def name(self) -> str:
return "ner"
@property
def description(self) -> str:
return "Named Entity Recognition - Extract people, places, organizations"
def _initialize_recognizer(self):
"""Lazy initialization of the recognizer"""
if self.recognizer is None:
print("🔄 Loading NER model...")
self.recognizer = NamedEntityRecognizer()
DisplayFormatter.show_success("NER model loaded!")
def _show_instructions(self):
"""Show usage instructions and examples"""
print("\n🎯 Named Entity Recognition")
print("Extract and classify entities like people, organizations, locations, etc.")
print("\n📝 Examples to try:")
print(" - Apple Inc. was founded by Steve Jobs in Cupertino, California.")
print(" - Barack Obama visited Paris in 2015 to meet Emmanuel Macron.")
print(" - Microsoft acquired GitHub for $7.5 billion in June 2018.")
print("\n🎛️ Commands:")
print(" 'back' - Return to main menu")
print(" 'help' - Show these instructions")
print(" 'settings' - Adjust confidence threshold")
print(" 'types' - Show entity types")
print(" 'analyze' - Detailed document analysis mode")
print("-" * 60)
def _show_entity_types(self):
"""Show available entity types"""
entity_types = self.recognizer.get_entity_types()
print("\n🏷️ Entity Types:")
type_descriptions = {
"PER": "Person names",
"ORG": "Organizations, companies",
"LOC": "Locations, places",
"MISC": "Miscellaneous entities",
"DATE": "Dates and time periods",
"TIME": "Specific times",
"MONEY": "Monetary amounts",
"PERCENT": "Percentages"
}
for entity_type, emoji in entity_types.items():
description = type_descriptions.get(entity_type, "Other entities")
print(f" {emoji} {entity_type}: {description}")
def _adjust_settings(self):
"""Allow user to adjust confidence threshold"""
print(f"\n⚙️ Current confidence threshold: {self.confidence_threshold:.2f}")
print("Lower values = more entities detected (but less accurate)")
print("Higher values = fewer entities detected (but more accurate)")
try:
new_threshold = input(f"Enter new threshold (0.1-1.0, current: {self.confidence_threshold}): ").strip()
if new_threshold:
threshold = float(new_threshold)
if 0.1 <= threshold <= 1.0:
self.confidence_threshold = threshold
DisplayFormatter.show_success(f"Threshold set to {threshold:.2f}")
else:
DisplayFormatter.show_warning("Threshold must be between 0.1 and 1.0")
except ValueError:
DisplayFormatter.show_error("Invalid threshold value")
def _analyze_mode(self):
"""Document analysis mode with detailed statistics"""
print("\n📊 Document Analysis Mode")
print("Enter longer text for comprehensive entity analysis")
print("Type 'done' when finished")
print("-" * 40)
lines = []
while True:
line = input("📝 ").strip()
if line.lower() == 'done':
break
if line:
lines.append(line)
if not lines:
DisplayFormatter.show_warning("No text entered")
return
document = " ".join(lines)
DisplayFormatter.show_loading("Analyzing document...")
result = self.recognizer.analyze_document(document, self.confidence_threshold)
formatted_result = DisplayFormatter.format_ner_analysis(result)
print(formatted_result)
def run(self):
"""Run interactive NER"""
self._initialize_recognizer()
self._show_instructions()
while True:
text = input("\n💬 Enter text to analyze: ").strip()
if text.lower() == 'back':
break
elif text.lower() == 'help':
self._show_instructions()
continue
elif text.lower() == 'settings':
self._adjust_settings()
continue
elif text.lower() == 'types':
self._show_entity_types()
continue
elif text.lower() == 'analyze':
self._analyze_mode()
continue
if not text:
DisplayFormatter.show_warning("Please enter some text")
continue
DisplayFormatter.show_loading("Extracting entities...")
result = self.recognizer.recognize(text, self.confidence_threshold)
formatted_result = DisplayFormatter.format_ner_result(result)
print(formatted_result)

48
src/commands/sentiment.py Normal file
View File

@ -0,0 +1,48 @@
from src.cli.base import CLICommand
from src.cli.display import DisplayFormatter
from src.pipelines.sentiment import SentimentAnalyzer
class SentimentCommand(CLICommand):
"""Interactive sentiment analysis command"""
def __init__(self):
self.analyzer = None
@property
def name(self) -> str:
return "sentiment"
@property
def description(self) -> str:
return "Interactive sentiment analysis"
def _initialize_analyzer(self):
"""Lazy initialization of the analyzer"""
if self.analyzer is None:
print("🔄 Loading sentiment model...")
self.analyzer = SentimentAnalyzer()
DisplayFormatter.show_success("Model loaded!")
def run(self):
"""Run interactive sentiment analysis"""
self._initialize_analyzer()
print("\n📝 Sentiment Analysis")
print("Type 'back' to return to main menu")
print("-" * 30)
while True:
text = input("\n💬 Enter your text: ").strip()
if text.lower() in ['back', 'return']:
break
if not text:
DisplayFormatter.show_warning("Please enter some text")
continue
DisplayFormatter.show_loading()
result = self.analyzer.analyze(text)
formatted_result = DisplayFormatter.format_sentiment_result(result)
print(formatted_result)

95
src/commands/textgen.py Normal file
View File

@ -0,0 +1,95 @@
from src.cli.base import CLICommand
from src.cli.display import DisplayFormatter
from src.pipelines.textgen import TextGenerator
class TextGenCommand(CLICommand):
"""Interactive text generation command"""
def __init__(self):
self.generator = None
self.default_params = {
'max_length': 100,
'num_return_sequences': 1,
'temperature': 1.0,
'do_sample': True
}
@property
def name(self) -> str:
return "textgen"
@property
def description(self) -> str:
return "Interactive text generation"
def _initialize_generator(self):
"""Lazy initialization of the generator"""
if self.generator is None:
print("🔄 Loading text generation model...")
self.generator = TextGenerator()
DisplayFormatter.show_success("Model loaded!")
def _show_parameters(self):
"""Show current generation parameters"""
print("\n⚙️ Current parameters:")
for key, value in self.default_params.items():
print(f" {key}: {value}")
def _update_parameters(self):
"""Allow user to update generation parameters"""
print("\n🔧 Update parameters (press Enter to keep current value):")
try:
max_length = input(f"Max length ({self.default_params['max_length']}): ").strip()
if max_length:
self.default_params['max_length'] = int(max_length)
num_sequences = input(f"Number of sequences ({self.default_params['num_return_sequences']}): ").strip()
if num_sequences:
self.default_params['num_return_sequences'] = int(num_sequences)
temperature = input(f"Temperature ({self.default_params['temperature']}): ").strip()
if temperature:
self.default_params['temperature'] = float(temperature)
do_sample = input(f"Use sampling ({self.default_params['do_sample']}): ").strip().lower()
if do_sample in ['true', 'false']:
self.default_params['do_sample'] = do_sample == 'true'
DisplayFormatter.show_success("Parameters updated!")
except ValueError as e:
DisplayFormatter.show_error(f"Invalid parameter value: {e}")
def run(self):
"""Run interactive text generation"""
self._initialize_generator()
print("\n📝 Text Generation")
print("Commands:")
print(" 'back' - Return to main menu")
print(" 'params' - Show current parameters")
print(" 'config' - Update parameters")
print("-" * 40)
while True:
prompt = input("\n💬 Enter your prompt: ").strip()
if prompt.lower() == 'back':
break
elif prompt.lower() == 'params':
self._show_parameters()
continue
elif prompt.lower() == 'config':
self._update_parameters()
continue
if not prompt:
DisplayFormatter.show_warning("Please enter a prompt")
continue
DisplayFormatter.show_loading("Generating text...")
result = self.generator.generate(prompt, **self.default_params)
formatted_result = DisplayFormatter.format_textgen_result(result)
print(formatted_result)

6
src/config/__init__.py Normal file
View File

@ -0,0 +1,6 @@
"""
Project configuration
"""
from .settings import Config
__all__ = ['Config']

40
src/config/settings.py Normal file
View File

@ -0,0 +1,40 @@
"""
Global project configuration
"""
from pathlib import Path
from typing import Dict, Any
class Config:
"""Global application configuration"""
# Paths
PROJECT_ROOT = Path(__file__).parent.parent.parent
SRC_DIR = PROJECT_ROOT / "src"
# Default models
DEFAULT_MODELS = {
"sentiment": "cardiffnlp/twitter-roberta-base-sentiment-latest",
"fillmask": "distilbert-base-uncased",
"textgen": "gpt2",
"moderation": "unitary/toxic-bert",
"ner": "dbmdz/bert-large-cased-finetuned-conll03-english",
}
# Interface
CLI_BANNER = "🤖 AI Lab - Transformers Experimentation"
CLI_SEPARATOR = "=" * 50
# Performance
MAX_BATCH_SIZE = 32
DEFAULT_MAX_LENGTH = 512
@classmethod
def get_model(cls, pipeline_name: str) -> str:
"""Get default model for a pipeline"""
return cls.DEFAULT_MODELS.get(pipeline_name, "")
@classmethod
def get_all_models(cls) -> Dict[str, str]:
"""Get all configured models"""
return cls.DEFAULT_MODELS.copy()

38
src/main.py Normal file
View File

@ -0,0 +1,38 @@
#!/usr/bin/env python3
"""
CLI entry point for AI Lab
"""
import sys
from pathlib import Path
# Add parent directory to PYTHONPATH
sys.path.insert(0, str(Path(__file__).parent.parent))
from src.cli import InteractiveCLI
from src.commands import SentimentCommand, FillMaskCommand, TextGenCommand, ModerationCommand, NERCommand
def main():
"""Main CLI function"""
try:
# Create CLI interface
cli = InteractiveCLI()
# Register available commands
cli.register_command(SentimentCommand())
cli.register_command(FillMaskCommand())
cli.register_command(TextGenCommand())
cli.register_command(ModerationCommand())
cli.register_command(NERCommand())
# Launch interactive interface
cli.run()
except KeyboardInterrupt:
print("\n👋 Stopping program")
except Exception as e:
print(f"❌ Error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

11
src/pipelines/__init__.py Normal file
View File

@ -0,0 +1,11 @@
"""
Experimentation pipelines with transformers
"""
from .sentiment import SentimentAnalyzer
from .fillmask import FillMaskAnalyzer
from .textgen import TextGenerator
from .moderation import ContentModerator
from .ner import NamedEntityRecognizer
from .template import TemplatePipeline
__all__ = ['SentimentAnalyzer', 'FillMaskAnalyzer', 'TextGenerator', 'ContentModerator', 'NamedEntityRecognizer', 'TemplatePipeline']

95
src/pipelines/fillmask.py Normal file
View File

@ -0,0 +1,95 @@
from transformers import pipeline
from typing import Dict, List, Optional
from src.config import Config
class FillMaskAnalyzer:
"""Fill-mask analyzer using transformers"""
def __init__(self, model_name: Optional[str] = None):
"""
Initialize the fill-mask pipeline
Args:
model_name: Name of the model to use (optional)
"""
self.model_name = model_name or Config.get_model("fillmask")
print(f"Loading fill-mask model: {self.model_name}")
self.pipeline = pipeline("fill-mask", model=self.model_name)
print("Model loaded successfully!")
def predict(self, text: str, top_k: int = 5) -> Dict:
"""
Predict masked tokens in text
Args:
text: Text with [MASK] token(s) to predict
top_k: Number of top predictions to return
Returns:
Dictionary with predictions and scores
"""
if not text.strip():
return {"error": "Empty text"}
if "[MASK]" not in text:
return {"error": "Text must contain [MASK] token"}
try:
results = self.pipeline(text, top_k=top_k)
# Handle single mask vs multiple masks
if isinstance(results, list) and isinstance(results[0], list):
# Multiple masks
predictions = []
for i, mask_results in enumerate(results):
mask_predictions = [
{
"token": pred["token_str"],
"score": round(pred["score"], 4),
"sequence": pred["sequence"]
}
for pred in mask_results
]
predictions.append({
"mask_position": i + 1,
"predictions": mask_predictions
})
return {
"original_text": text,
"masks_count": len(results),
"predictions": predictions
}
else:
# Single mask
predictions = [
{
"token": pred["token_str"],
"score": round(pred["score"], 4),
"sequence": pred["sequence"]
}
for pred in results
]
return {
"original_text": text,
"masks_count": 1,
"predictions": predictions
}
except Exception as e:
return {"error": f"Prediction error: {str(e)}"}
def predict_batch(self, texts: List[str], top_k: int = 5) -> List[Dict]:
"""
Predict masked tokens for multiple texts
Args:
texts: List of texts with [MASK] tokens
top_k: Number of top predictions to return
Returns:
List of prediction results
"""
return [self.predict(text, top_k) for text in texts]

174
src/pipelines/moderation.py Normal file
View File

@ -0,0 +1,174 @@
from transformers import pipeline
from typing import Dict, List, Optional
import re
from src.config import Config
class ContentModerator:
"""Content moderator that detects and replaces inappropriate content"""
def __init__(self, model_name: Optional[str] = None):
"""
Initialize the content moderation pipeline
Args:
model_name: Name of the model to use (optional)
"""
self.model_name = model_name or Config.get_model("moderation")
print(f"Loading moderation model: {self.model_name}")
self.classifier = pipeline("text-classification", model=self.model_name)
print("Moderation model loaded successfully!")
# Threshold for considering content as toxic
self.toxicity_threshold = 0.5
def moderate(self, text: str, replacement: str = "***") -> Dict:
"""
Moderate content by detecting and replacing inappropriate words
Args:
text: Text to moderate
replacement: String to replace inappropriate content with
Returns:
Dictionary with original text, moderated text, and detection info
"""
if not text.strip():
return {"error": "Empty text"}
try:
# First, check overall toxicity
result = self.classifier(text)
# Handle different model output formats
if isinstance(result, list):
predictions = result
else:
predictions = [result]
# Find toxicity score
toxic_score = 0.0
is_toxic = False
for pred in predictions:
label = pred["label"].upper()
score = pred["score"]
# Check different possible toxic labels
if label in ["TOXIC", "TOXICITY", "HARMFUL", "1"]:
toxic_score = max(toxic_score, score)
if score > self.toxicity_threshold:
is_toxic = True
elif label in ["NOT_TOXIC", "CLEAN", "0"]:
# For models where high score means NOT toxic
toxic_score = max(toxic_score, 1.0 - score)
if (1.0 - score) > self.toxicity_threshold:
is_toxic = True
if not is_toxic:
return {
"original_text": text,
"moderated_text": text,
"is_modified": False,
"toxic_score": toxic_score,
"words_replaced": 0
}
# If toxic, analyze word by word to find problematic parts
moderated_text, words_replaced = self._moderate_by_words(text, replacement)
return {
"original_text": text,
"moderated_text": moderated_text,
"is_modified": True,
"toxic_score": toxic_score,
"words_replaced": words_replaced
}
except Exception as e:
return {"error": f"Moderation error: {str(e)}"}
def _moderate_by_words(self, text: str, replacement: str) -> tuple[str, int]:
"""
Moderate text by analyzing individual words and phrases
Args:
text: Original text
replacement: Replacement string
Returns:
Tuple of (moderated_text, words_replaced_count)
"""
words = text.split()
moderated_words = []
words_replaced = 0
# Check individual words
for word in words:
# Clean word for analysis (remove punctuation)
clean_word = re.sub(r'[^\w]', '', word)
if not clean_word:
moderated_words.append(word)
continue
try:
word_result = self.classifier(clean_word)
# Handle different model output formats
if isinstance(word_result, list):
predictions = word_result
else:
predictions = [word_result]
is_word_toxic = False
for pred in predictions:
label = pred["label"].upper()
score = pred["score"]
if label in ["TOXIC", "TOXICITY", "HARMFUL", "1"]:
if score > self.toxicity_threshold:
is_word_toxic = True
break
elif label in ["NOT_TOXIC", "CLEAN", "0"]:
if (1.0 - score) > self.toxicity_threshold:
is_word_toxic = True
break
if is_word_toxic:
# Replace the clean part with asterisks, keep punctuation
moderated_word = re.sub(r'\w+', replacement, word)
moderated_words.append(moderated_word)
words_replaced += 1
else:
moderated_words.append(word)
except:
# If analysis fails for a word, keep it as is
moderated_words.append(word)
return " ".join(moderated_words), words_replaced
def moderate_batch(self, texts: List[str], replacement: str = "***") -> List[Dict]:
"""
Moderate multiple texts
Args:
texts: List of texts to moderate
replacement: String to replace inappropriate content with
Returns:
List of moderation results
"""
return [self.moderate(text, replacement) for text in texts]
def set_threshold(self, threshold: float):
"""
Set the toxicity threshold
Args:
threshold: Threshold between 0 and 1
"""
if 0 <= threshold <= 1:
self.toxicity_threshold = threshold
else:
raise ValueError("Threshold must be between 0 and 1")

179
src/pipelines/ner.py Normal file
View File

@ -0,0 +1,179 @@
from transformers import pipeline
from typing import Dict, List, Optional, Tuple
from src.config import Config
class NamedEntityRecognizer:
"""Named Entity Recognition using transformers"""
def __init__(self, model_name: Optional[str] = None):
"""
Initialize the NER pipeline
Args:
model_name: Name of the model to use (optional)
"""
self.model_name = model_name or Config.get_model("ner")
print(f"Loading NER model: {self.model_name}")
self.pipeline = pipeline("ner", model=self.model_name, aggregation_strategy="simple")
print("NER model loaded successfully!")
# Entity type mappings for better display
self.entity_colors = {
"PER": "👤", # Person
"ORG": "🏢", # Organization
"LOC": "📍", # Location
"MISC": "🏷️", # Miscellaneous
"DATE": "📅", # Date
"TIME": "", # Time
"MONEY": "💰", # Money
"PERCENT": "📊", # Percentage
}
def recognize(self, text: str, confidence_threshold: float = 0.9) -> Dict:
"""
Recognize named entities in text
Args:
text: Text to analyze
confidence_threshold: Minimum confidence score for entities
Returns:
Dictionary with entities and their information
"""
if not text.strip():
return {"error": "Empty text"}
try:
entities = self.pipeline(text)
# Filter by confidence and process entities
filtered_entities = []
entity_stats = {}
for entity in entities:
if entity["score"] >= confidence_threshold:
entity_type = entity["entity_group"]
processed_entity = {
"text": entity["word"],
"label": entity_type,
"confidence": round(entity["score"], 4),
"start": entity["start"],
"end": entity["end"],
"emoji": self.entity_colors.get(entity_type, "🏷️")
}
filtered_entities.append(processed_entity)
# Update statistics
if entity_type not in entity_stats:
entity_stats[entity_type] = {"count": 0, "entities": []}
entity_stats[entity_type]["count"] += 1
entity_stats[entity_type]["entities"].append(entity["word"])
# Create highlighted text
highlighted_text = self._highlight_entities(text, filtered_entities)
return {
"original_text": text,
"highlighted_text": highlighted_text,
"entities": filtered_entities,
"entity_stats": entity_stats,
"total_entities": len(filtered_entities),
"confidence_threshold": confidence_threshold
}
except Exception as e:
return {"error": f"NER processing error: {str(e)}"}
def _highlight_entities(self, text: str, entities: List[Dict]) -> str:
"""
Create highlighted version of text with entity markers
Args:
text: Original text
entities: List of detected entities
Returns:
Text with highlighted entities
"""
if not entities:
return text
# Sort entities by start position (reverse order for replacement)
sorted_entities = sorted(entities, key=lambda x: x["start"], reverse=True)
highlighted = text
for entity in sorted_entities:
start, end = entity["start"], entity["end"]
entity_text = entity["text"]
emoji = entity["emoji"]
label = entity["label"]
confidence = entity["confidence"]
# Create highlighted version
highlight = f"{emoji}[{entity_text}]({label}:{confidence:.2f})"
highlighted = highlighted[:start] + highlight + highlighted[end:]
return highlighted
def analyze_document(self, text: str, confidence_threshold: float = 0.9) -> Dict:
"""
Perform comprehensive document analysis with entity extraction
Args:
text: Document text to analyze
confidence_threshold: Minimum confidence for entities
Returns:
Comprehensive analysis results
"""
result = self.recognize(text, confidence_threshold)
if "error" in result:
return result
# Additional analysis
analysis = {
**result,
"document_stats": {
"word_count": len(text.split()),
"char_count": len(text),
"sentence_count": len([s for s in text.split('.') if s.strip()]),
"entity_density": len(result["entities"]) / len(text.split()) if text.split() else 0
}
}
# Find most common entity types
if result["entity_stats"]:
most_common_type = max(result["entity_stats"].items(), key=lambda x: x[1]["count"])
analysis["most_common_entity_type"] = {
"type": most_common_type[0],
"count": most_common_type[1]["count"],
"emoji": self.entity_colors.get(most_common_type[0], "🏷️")
}
return analysis
def recognize_batch(self, texts: List[str], confidence_threshold: float = 0.9) -> List[Dict]:
"""
Recognize entities in multiple texts
Args:
texts: List of texts to analyze
confidence_threshold: Minimum confidence for entities
Returns:
List of NER results
"""
return [self.recognize(text, confidence_threshold) for text in texts]
def get_entity_types(self) -> Dict[str, str]:
"""
Get available entity types with their emojis
Returns:
Dictionary mapping entity types to emojis
"""
return self.entity_colors.copy()

View File

@ -0,0 +1,54 @@
from transformers import pipeline
from typing import Dict, List, Optional
from src.config import Config
class SentimentAnalyzer:
"""Sentiment analyzer using transformers"""
def __init__(self, model_name: Optional[str] = None):
"""
Initialize the sentiment-analysis pipeline
Args:
model_name: Name of the model to use (optional)
"""
self.model_name = model_name or Config.get_model("sentiment")
print(f"Loading sentiment model: {self.model_name}")
self.pipeline = pipeline("sentiment-analysis", model=self.model_name)
print("Model loaded successfully!")
def analyze(self, text: str) -> Dict:
"""
Analyze the sentiment of a text
Args:
text: Text to analyze
Returns:
Dictionary with label and confidence score
"""
if not text.strip():
return {"error": "Empty text"}
try:
result = self.pipeline(text)[0]
return {
"text": text,
"sentiment": result["label"],
"confidence": round(result["score"], 4)
}
except Exception as e:
return {"error": f"Analysis error: {str(e)}"}
def analyze_batch(self, texts: List[str]) -> List[Dict]:
"""
Analyze the sentiment of multiple texts
Args:
texts: List of texts to analyze
Returns:
List of analysis results
"""
return [self.analyze(text) for text in texts]

59
src/pipelines/template.py Normal file
View File

@ -0,0 +1,59 @@
"""
Template for creating new pipelines
Copy this file and adapt it according to your needs
"""
from transformers import pipeline
from typing import Dict, List, Optional
class TemplatePipeline:
"""Template for a new pipeline"""
def __init__(self, model_name: Optional[str] = None):
"""
Initialize the pipeline
Args:
model_name: Name of the model to use (optional)
"""
self.model_name = model_name or "distilbert-base-uncased"
print(f"Loading model {self.model_name}...")
# Replace "text-classification" with your task
self.pipeline = pipeline("text-classification", model=self.model_name)
print("Model loaded successfully!")
def process(self, text: str) -> Dict:
"""
Process a text
Args:
text: Text to process
Returns:
Dictionary with results
"""
if not text.strip():
return {"error": "Empty text"}
try:
result = self.pipeline(text)
return {
"text": text,
"result": result,
# Add other fields according to your needs
}
except Exception as e:
return {"error": f"Processing error: {str(e)}"}
def process_batch(self, texts: List[str]) -> List[Dict]:
"""
Process multiple texts
Args:
texts: List of texts to process
Returns:
List of results
"""
return [self.process(text) for text in texts]

82
src/pipelines/textgen.py Normal file
View File

@ -0,0 +1,82 @@
from transformers import pipeline
from typing import Dict, List, Optional
from src.config import Config
class TextGenerator:
"""Text generator using transformers"""
def __init__(self, model_name: Optional[str] = None):
"""
Initialize the text-generation pipeline
Args:
model_name: Name of the model to use (optional)
"""
self.model_name = model_name or Config.get_model("textgen")
print(f"Loading text generation model: {self.model_name}")
self.pipeline = pipeline("text-generation", model=self.model_name)
print("Model loaded successfully!")
def generate(self, prompt: str, max_length: int = 100, num_return_sequences: int = 1,
temperature: float = 1.0, do_sample: bool = True) -> Dict:
"""
Generate text from a prompt
Args:
prompt: Input text prompt
max_length: Maximum length of generated text
num_return_sequences: Number of sequences to generate
temperature: Sampling temperature (higher = more random)
do_sample: Whether to use sampling
Returns:
Dictionary with generated texts
"""
if not prompt.strip():
return {"error": "Empty prompt"}
try:
results = self.pipeline(
prompt,
max_length=max_length,
num_return_sequences=num_return_sequences,
temperature=temperature,
do_sample=do_sample,
pad_token_id=self.pipeline.tokenizer.eos_token_id
)
generations = [
{
"text": result["generated_text"],
"continuation": result["generated_text"][len(prompt):].strip()
}
for result in results
]
return {
"prompt": prompt,
"parameters": {
"max_length": max_length,
"num_sequences": num_return_sequences,
"temperature": temperature,
"do_sample": do_sample
},
"generations": generations
}
except Exception as e:
return {"error": f"Generation error: {str(e)}"}
def generate_batch(self, prompts: List[str], **kwargs) -> List[Dict]:
"""
Generate text for multiple prompts
Args:
prompts: List of input prompts
**kwargs: Generation parameters
Returns:
List of generation results
"""
return [self.generate(prompt, **kwargs) for prompt in prompts]