Scientific Multimodal Large Model Intern-S1 (Codename: Scholar)

WAIC 2025 Conference: Intern-S1 (Codename "Scholar") Released#

The Shanghai Artificial Intelligence Laboratory released and open-sourced the latest scientific multimodal large model Intern-S1 (codename "Scholar") at the WAIC 2025 Conference.

Experience Address#

Intern-S1 Experience Address

Resource Links#

Model Architecture#

Intern-S1 is based on the MoE architecture and has:

A language model component with 235 billion parameters (Qwen3)
A visual encoder with 6 billion parameters (IntrenViT)
Total scale: 241 billion parameters

Training Data and Capabilities#

A training dataset of 5T, with over half being specialized domain knowledge.
Context length: 128K tokens, capable of processing multiple top conference papers and performing sequential analysis.

Data Analysis Capability#

Able to read data trends in scientific charts and explain the underlying logic with text.
Understands the content displayed in figures, interprets the represented physical processes, and deduces the next experimental steps.

Innovative Features#

Intern-S1 has pioneered a cross-modal scientific analysis engine that adaptively performs token encoding for different modalities of data. It provides more efficient encoding representations for special sequences such as chemical formulas and protein sequences, achieving over 70% compression, enabling it to understand specialized, complex data.

Training Paradigm#

To balance generality and specialization, Intern-S1 proposes a "Unified Training Paradigm for General and Specialized Integration":

Utilizes vast amounts of general scientific data to expand the model's knowledge breadth.
Trains numerous domain expert models to generate highly readable, logically clear specialized data, validated for quality by customized domain intelligence.

Through this closed-loop mechanism feeding back into pre-training, Intern-S1 possesses both strong general reasoning capabilities and multiple top-tier specialized abilities, achieving a breakthrough where one model addresses various specialized tasks.

Training Efficiency#

A large-scale multi-task reinforcement learning framework Inte was introduced in the later stages of model training. The algorithm focuses on "mixed rewards"—tasks that can be validated receive rewards from rules and validators. This system allows its training energy consumption to be only 1% of Grok 4, while performance remains competitive.

Conclusion#

Will Intern-S1 become the standard answer for scientific multimodality? It is too early to conclude, but it shows us another path—not merely building large models to compete on parameters, but starting from actual needs to solve genuinely challenging yet valuable application scenarios. The direction of Intern-S1 differs from the pursuit of general capabilities in recent years. While models like GPT, Gemini, and Claude are mature in dialogue and code generation, they often produce unstable and illogical results when analyzing scientific graphs or assisting in experimental design, with complex formulas appearing as gibberish to them.

Intern-S1, on the other hand, tackles this challenge in scientific research, applying multimodality to literature analysis, experimental assistance, and other "high-pressure areas," opening a pathway to the possibility of "specialized AI."