· Engineering  · 3 min read

Hermes OpenCode Agent ile Modal.com'da Ücretsiz LLM Model Deploy

Hermes OpenCode Agent kullanarak Modal.com üzerinde ücretsiz T4 GPU ile LLM model deploy etme rehberi.

Hermes OpenCode Agent ile Modal.com’da Ücretsiz LLM Deploy

Bu rehberde, Hermes OpenCode Agent kullanarak Modal.com platformunda ücretsiz T4 GPU ile LLM model nasıl deploy edeceğinizi göstereceğim.


Modal.com Nedir?

Modal.com, GPU üzerinde Python kodlarınızı çalıştırabileceğiniz bir serverless platform. Özellikle ML/AI iş yükleri için optimize edilmiş.

ÖzellikDeğer
GPUT4 (Ücretsiz Tier)
RAM16GB
Storage256GB
Fiyat40 saat/ay ücretsiz

Kurulum

1. Modal CLI Kurulumu

pip install modal
modal setup

2. API Key Alımı

modal.com adresinden hesap oluşturun ve API key alın.


Hermes OpenCode Agent Deployment Script

#!/usr/bin/env python3
"""
Hermes OpenCode Agent - Modal.com Deployment
"""

import modal

app = modal.App("hermes-opencode-agent")

# Custom container image
image = modal.Image.debian_slim(python_version="3.11").pip_install([
    "fastapi==0.110.3",
    "uvicorn==0.29.0", 
    "pydantic==2.10.0",
    "requests==2.32.3",
    "transformers==4.45.0",
    "torch==2.3.0",
    "sentencepiece==0.2.0",
    "protobuf==3.20.3",
    "accelerate==1.1.0"
])

@app.function(
    image=image,
    gpu="T4",  # Free tier GPU
    timeout=3600
)
def serve_llm():
    """Serve Phi-3 mini LLM with FastAPI"""
    import fastapi
    from fastapi import FastAPI
    from pydantic import BaseModel
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    # Phi-3 mini model
    MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
    
    print("Loading Phi-3 mini model...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    # Add padding token if not exists
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
    
    print(f"Model loaded on {model.device}!")
    
    # FastAPI app
    app = FastAPI(title="Hermes OpenCode LLM Agent")
    
    class CodeRequest(BaseModel):
        prompt: str
        max_tokens: int = 256
        
    class CodeResponse(BaseModel):
        code: str
        model: str
        tokens: int
    
    @app.get("/")
    async def root():
        return {
            "message": "Hermes OpenCode LLM Agent",
            "model": MODEL_NAME,
            "status": "ready",
            "gpu": "T4"
        }
    
    @app.post("/generate")
    async def generate_code(request: CodeRequest):
        """Generate code with Phi-3 model"""
        try:
            # Prepare prompt
            prompt = f"""<|system|>
You are a helpful AI assistant that generates high-quality code. Always respond with complete, runnable code blocks.
<|end|>
<|user|>
{request.prompt}
<|end|>
<|assistant|>
"""
            
            # Tokenize
            inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
            
            # Generate
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=request.max_tokens,
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=tokenizer.eos_token_id,
                    eos_token_id=tokenizer.eos_token_id
                )
            
            # Decode
            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Extract assistant response
            assistant_response = generated_text.split("<|assistant|>")[-1].strip()
            
            return CodeResponse(
                code=assistant_response,
                model=MODEL_NAME,
                tokens=len(outputs[0])
            )
            
        except Exception as e:
            return {"error": str(e)}
    
    @app.get("/health")
    async def health():
        return {
            "status": "healthy",
            "model": MODEL_NAME,
            "device": str(model.device),
            "dtype": str(model.dtype)
        }
    
    return app

@app.local_entrypoint()
def deploy():
    """Deploy the LLM agent"""
    print("🚀 Deploying Hermes OpenCode LLM Agent to Modal.com...")
    print(f"📦 Model: microsoft/Phi-3-mini-4k-instruct")
    print(f"🎯 GPU: T4 (Free tier)")
    print("⏳ This may take a few minutes...")
    
    # Test the function
    result = serve_llm.remote()
    print(f"✅ Deployment successful!")
    print(f"🔗 App URL: https://your-username--hermes-opencode-agent.modal.run")
    print(f"📊 Health endpoint: /health")
    print(f"⚡ Generate endpoint: /generate")

if __name__ == "__main__":
    deploy()

Deployment

Script’i çalıştırın:

modal run hermes_modal_deploy.py

API Endpoints

EndpointMethodAçıklama
/GETStatus sayfası
/healthGETSistem sağlık durumu
/generatePOSTKod generation

Örnek Kullanım

# Health check
curl https://your-username--hermes-opencode-agent.modal.run/health

# Code generation
curl -X POST https://your-username--hermes-opencode-agent.modal.run/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Python ile Fibonacci serisi oluştur",
    "max_tokens": 256
  }'

Maliyet Analizi

KaynakÜcretLimit
GPU Saati$040 saat/ay
Bandwidth$010GB/ay
Storage$0256GB

Toplam Maliyet: $0 🎉


Avantajlar

  1. Ücretsiz: T4 GPU 40 saat/ay bedava
  2. Hızlı: NVIDIA GPU’lar ile yüksek performans
  3. Kolay: Python script ile tek komut deploy
  4. Scalable: Otomatik scaling
  5. Open Source: Açık kaynak model desteği

Sonuç

Modal.com + Hermes OpenCode Agent kombinasyonu:

  • ✅ Ücretsiz GPU erişimi
  • ✅ Kolay deployment
  • ✅ Production-ready API
  • ✅ Açık kaynak model desteği
  • ✅ Otomatik scaling

Tüm AI projeleriniz için mükemmel bir başlangıç noktası! 🚀


Kaynaklar

Back to Blog

Related Posts

View All Posts »