| CLIP | Image + Text | Zero-shot classification, similarity | | Whisper | Audio → Text | Transcription, translation | | Stable Diffusion | Text → Image | Image generation, editing |
| Zero-shot classification | Compare image to text label embeddings | | Image search | Find images matching text query | | Content moderation | Classify against safety categories | | Image similarity | Compare image embeddings |
| ViT-B/32 | 151M | Recommended balance | | ViT-L/14 | 428M | Best quality, slower | | RN50 | 102M | Fastest, lower quality |
Utilizzare quando "CLIP", "Whisper", "Diffusione stabile", "SDXL", "discorso in testo", "testo in immagine", "generazione di immagini", "trascrizione", "classificazione zero-shot", "somiglianza immagine-testo", "inpainting", "ControlNet" Fonte: eyadsibai/ltk.