Large Language Model (LLM) Tutorials

Decoding Methods for Text Generation with LLMs

A hands-on comparison of greedy search, beam search, sampling, top-k, top-p, and contrastive search using small pretrained models

Deploying and Serving LLM with Llama.cpp

End-to-end guide: deploy and serve LLMs locally and at scale with llama.cpp for efficient CPU/GPU inference

Deploying and Serving LLM with vLLM

End-to-end guide: deploy and serve LLMs at scale with vLLM for high-throughput, low-latency inference

FinOps Best Practices for LLM Applications

From prompt caching to model routing: a practical guide to cutting LLM inference costs by 10x with semantic caching, continuous batching, quantization, prompt optimization, and cost-aware architecture

Fine-tuning an LLM with Unsloth and Serving with Ollama

End-to-end guide: fine-tune a small model on Hugging Face with Unsloth and deploy locally with Ollama

Guardrails for LLM Applications with Giskard

Securing LLM applications in production: policy-based content screening, jailbreak detection, PII filtering, groundedness checks, and custom guardrails using Giskard Guards and Giskard Open Source

Observability for Multi-Turn LLM Conversations with LangSmith

End-to-end guide: trace, monitor, and debug multi-turn agentic conversations with LangChain, LangGraph, and LangSmith — covering threads, runs, tool use, token cost, latency, and error tracking

Post-Training LLMs for Human Alignment

A practical comparison of SFT, RLHF, DPO, ORPO, KTO, and GRPO for aligning pretrained language models with human preferences

Pre-training LLMs from Scratch

End-to-end guide: from web scraping and data collection to preprocessing, tokenization, and pretraining a small language model with PyTorch and Unsloth

Prompt Engineering vs Context Engineering

From crafting single prompts to designing dynamic systems that give LLMs everything they need to succeed

Quantization Methods for LLMs

A practical comparison of GPTQ, AWQ, GGUF, and bitsandbytes for compressing pretrained language models

Run LLM Locally with Ollama

From setup to deployment: run and serve local LLMs easily with Ollama

Scaling LLM Serving for Enterprise Production

From a single GPU to millions of requests: hardware foundations, serving engines, parallelism strategies, load balancing, Kubernetes orchestration, and production monitoring for on-premise LLM deployment

Training LLMs for Reasoning

A practical guide to building reasoning capabilities in language models using RL, distillation, and chain-of-thought training

Training LLMs with Mixture of Experts

From dense to sparse: understanding MoE architecture, routing strategies, expert specialization, sparse upcycling, and fine-tuning MoE models with PyTorch and Unsloth