
ERA V5 Registration
ERA V5 is a course designed to train students to actually train Large Language Models from scratch — pre-training, post-training, alignment, and serving — at frontier scale. It is built for beginners willing to commit fully. The pace is fast and the learning curve is steep.
Join only if you can promise yourself ~6 months of disciplined immersion.
ERA V5 is a hands-on, lab-run course where students are taught to build, train, and release a frontier-scale Mixture-of-Experts language model. We do not teach LLM training as theory. We run the training, and YOU are the lab.
What we built in ERA V4
V4 produced a real, public, working frontier-scale model.
- LightningLM 0.1V — a 118.67B-parameter Mixture-of-Experts model grown through four stages: 1.78B dense → 4.96B MoE → 9.36B MoE → 118.67B MoE. Trained on GPU nodes including A100 → H200 → B200, bf16 throughout, for around 40–50 days of pure GPU training. Public checkpoints (available post Technical Paper Release in first week of June) on HuggingFace under
theschoolofai/LightningLM-0.1V-*. - BrahmicTokenizer-131K — an Indic-tokenizer beating all other tokenizers of its class, with first-class support for Devanagari, Telugu, Odia, Bengali, Tamil, Kannada, Gujarati, Malayalam, Gurmukhi, Assamese, and more. Published on arXiv.
- Kronecker Embeddings — a factorized byte-level embedding construction that eliminates 91–94% of input-side parameters compared to a standard embedding table. Published on arXiv. (Link will be added in first week of June)
- Systems paper — the full training methodology, growth strategy, and infrastructure paper covering the LightningLM 0.1V run is being finalized for public release.
V4 was largely executed as the last stage of the course, the Capstone.
V5 is built to change that, whole course IS the training and Capstone.
What V5 will train
A Large Mixture-of-Experts model at frontier-scale. Current SOTA at every layer of the stack: tokenizer, architecture, optimizer, kernels, parallelism, precision — assembled together rather than each studied in isolation. Pre-training, supervised fine-tuning, preference alignment, and inference serving are all covered before the run launches. The scaffold (the training framework itself) and the final model are released openly. The cohort writes and submits research papers on the contributions. Capstone sub-projects in ERA V4 is now our syllabus!
How V5 is different from any other LLM course
There really is no other course like it. Since other's objective is to add more students or earn just money, nobody else will focus on actual research, publishing something, creating something that India can be proud of.. none of these obective would meet their investors or stomach.
TSAI exists to build this capability in India and challenge the world. PERIOD. This is also why we offer 15 days no questions asked full refund, and charge 1/5th to 1/10th of what others charge. And how do we do this?
- We actually train the model. As part of the course. Session 20 is not a write-up of someone else's training run, it is the kickoff of ours. Training continues past the formal calendar with students staffed into running roles.
- You contribute to a public research artifact. The scaffold is released open-source. Real contribution earns named authorship on the systems paper.
- No history lessons. We teach the current best — optimizer, attention, alignment — not the evolution that led there. Time is too short for archaeology.
- Beginners welcome. Coasting impossible. We start from foundations, but you will be in deep technical territory by week six. You WILL fall behind if you are not dedicating your next 4-6 months. 1 missed class will set you back by weeks! Remember you are doing what others take years to learn and have 100s of professional staff to support! Here is you and your instructor!
Course structure
| Duration | ~6 months, including the training run that continues past the formal calendar |
| Sessions | 20 classes, each up to 3 hours, live |
| Schedule | Every Saturday, 7:00 AM IST |
| Format | Live coding + weekly assignments + ongoing lab contributions |
| Assignments | After every class; minimum 70% to qualify for the completion certificate |
| Capstone | The actual training run itself (starts around 22nd week), students are staffed into running roles (training tracking, evaluation, alignment, operations, ablation, narrative) and continue contributing past the formal calendar |
Syllabus
| # | Class | Focus |
|---|---|---|
| 1 | Transformer Foundations | Attention, multi-head attention, positional encodings; build a minimal transformer block from scratch |
| 2 | Tokenization & Vocabulary Design | BPE, WordPiece, SentencePiece; vocab size, merges, frequency sorting; Indic and multilingual |
| 3 | Data Collection & Sourcing | Sourcing across the full lifecycle: pre-training corpora, SFT, preference, safety, evaluation |
| 4 | Data Cleaning & Deduplication | Quality filters, MinHash/LSH dedup, toxicity/PII, contamination scans; reproducible at scale |
| 5 | Data Mixtures & Curriculum | Domain weighting, upsampling, mixture-shift effects on loss |
| 6 | Building the Training Dataset | Sharding, packing, streaming dataloaders, tokenized binary formats; resumable data ordering |
| 7 | Embeddings & Model Internals | Token, positional, factorized (Kronecker) embeddings; weight tying |
| 8 | Modern Attention Variants | RoPE, ALiBi, GQA/MQA, sliding-window, linear-attention families; long-context extension |
| 9 | Loss Functions & Output Heads | Cross-entropy, adaptive softmax, fused linear CE kernels, multi-token prediction |
| 10 | Training Loop Fundamentals | Forward/backward, gradient accumulation, mixed precision, gradient clipping |
| 11 | Optimizers & Learning-Rate Schedules | AdamW, weight decay, warmup, cosine vs WSD, EMA; linear scaling rule |
| 12 | Distributed Training I: Data Parallel & ZeRO | DDP, ZeRO 1/2/3; memory math for multi-GPU |
| 13 | Distributed Training II: Model & Pipeline Parallel | Tensor, pipeline, sequence parallelism; communication overhead, topology-aware placement |
| 14 | Mixture-of-Experts | Routing, load balancing, expert sharding, active-vs-total params |
| 15 | Stability, Debugging & Live Monitoring | Divergence detection, frozen-layer constraints, live training dashboards |
| 16 | Scaling Laws & Compute Planning | Chinchilla-style token/param trade-offs, compute budgeting, run sizing |
| 17 | Supervised Fine-Tuning | Current best SFT recipes; instruction datasets; LoRA/QLoRA; instruction-following benchmarks |
| 18 | Preference Alignment & Inference Serving | Current SOTA preference learning (GRPO/DPO family); vLLM serving, throughput/latency |
| 19 | Infrastructure, Checkpointing & Quantization | Cloud provisioning, fault tolerance, QAT; provisioning the actual cluster the run launches on |
| 20 | Training Run Kickoff & Ongoing Lab Operations | Launching the lab's flagship training run; ongoing roles continue past the formal calendar |
This is the 15th time this course is being written from scratch, starting from EVA V1!
What you need to bring
- A laptop. No local GPU required.
- Python comfort helps; we cover it. PyTorch is covered.
- A working internet connection for live Saturday classes.
- The willingness to write code every week and not skip assignments. Usage of Claude/AntiGravity/etc is allowed.
What we will not do
- We will not teach historical methods that are no longer SOTA.
- We will not pre-build the scaffold and hand it to you. You build it with us, live in class.
- We will not assign roles. Natural selection plays its role. Students who put in work get the contribution credit.
Pricing and refund
The course is priced significantly below comparable IIT/IIIT programs while covering work those programs do not attempt. Pricing differs for returning TSAI students. AWS credits are included; you do not pay separately for compute.
15-day, no-questions-asked refund — equivalent to the first 2 classes. After that window your seat is committed. There are no transfers to future cohorts.
Pricing details will be shared with your enrollment form.
Schedule
| Milestone | Date |
|---|---|
| Registration opens | 29th May 11:59PM |
| Registration closes | 18th June (or until capacity is met) |
| Enrolment opens | 9th June |
| Enrolment closes | 19th June (or until capacity is met) |
| First class | 21st, 7:00 AM IST (Saturday) |
Seats are limited. In every prior cohort, registration has closed before the deadline.
This is not a course you finish by watching videos. ERA V5 demands real weekly work. The training run we are building is the kind of work that frontier labs do with 100+ experienced staff — we are doing it with 300 motivated students plus an instructor who has done it once before, solo. Once you cross the 15-day refund window, you are committed. Buckle up.
Public links to LightningLM 0.1V, BrahmicTokenizer-131K, and Kronecker Embeddings are available at theschoolof.ai. Read them before you enrol — they show exactly the kind of work this course produces.
Register for ERA V5
Drop your details below. We'll email you when enrolment opens — no spam, ever.
