Ryujin 3.5 (2025)

Works best with vLLM for production (supports MoE expert parallelism) or llama.cpp (with MoE kernels) for CPU inference. Ryujin 3.5 vs. The Competition | Feature | Ryujin 3.5 | Mixtral 8x7B | DeepSeek-V2 | | :--- | :--- | :--- | :--- | | Active Params | 6B | 12B | 21B | | Total Params | 35B | 47B | 236B | | Expert Count | 16 | 8 | 160 | | Context Window | 256k | 32k | 128k | | License | Apache 2.0 | Apache 2.0 | MIT |

prompt = "Explain the significance of the Dragon God in Shinto mythology." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ryujin 3.5

model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16, load_in_4bit=True # Critical for MoE memory savings ) Works best with vLLM for production (supports MoE

For developers, the lesson is clear: The era of dense LLMs is sunsetting. Have you run an MoE model locally? How does your experience compare to dense models like LLaMA? Share your benchmarks in the comments below. Have you run an MoE model locally

Ryujin 3.5 (2025)

Recent Posts