InternRobotics

InternDataEngine: Pioneering High-Fidelity Synthetic Data Generator for Robotic Manipulation

29
0
100% credibility
Found Mar 18, 2026 at 29 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

InternDataEngine generates massive synthetic datasets of realistic robot manipulation using physics simulation for training embodied AI models.

How It Works

1
๐Ÿ” Discover robot practice data maker

You hear about a tool that creates pretend robot videos and movements for training AI without real hardware.

2
๐Ÿ“ฅ Get the simulator ready

Download the free software and open it up on your powerful computer with a good graphics card.

3
๐ŸŽฎ Pick a robot job

Choose simple tasks like picking objects or sorting items that your robot needs to learn.

4
๐Ÿ—๏ธ Build practice scenes

Arrange tables, toys, and lights in different ways to make varied training situations.

5
โšก Generate tons of data

Hit go and watch robots practice endlessly, creating videos and motion records super fast.

6
๐Ÿ“‚ Collect your dataset

Grab folders full of realistic robot videos and actions ready for AI training.

๐Ÿš€ Train smarter robots

Your AI learns from endless perfect practice, getting ready for the real world.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 29 to 29 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is InternDataEngine?

InternDataEngine is a Python-based synthetic data generator that creates high-fidelity datasets for robotic manipulation tasks. Built on NVIDIA Isaac Sim, it simulates realistic physics across rigid, deformable, and fluid objects for single-arm, dual-arm, or humanoid robots, while applying domain randomization for diverse scenes, textures, and lighting. Users get scalable, multimodal data with ground-truth annotations like bounding boxes and keypoints, ready for training embodied AI models with strong sim-to-real transfer.

Why is it gaining traction?

It stands out by decoupling planning, rendering, and storage into async pipelines for 2-3x higher throughput on clusters, enabling billion-scale data generation without bottlenecks. The YAML-configurable workflows support long-horizon, multi-skill tasks that generic sim tools struggle with, plus easy export to formats like LeRobot datasets. Developers appreciate the fault-tolerant scheduling and precise annotations that cut manual labeling needs.

Who should use this?

Robotics engineers training vision-language-action models on manipulation like pick-place or assembly. Teams at research labs or startups scaling sim data for sim-to-real policies, especially those using Isaac Sim and needing diverse, physics-accurate trajectories. Ideal if you're iterating on humanoid or dual-arm setups and want to avoid real-world data collection costs.

Verdict

Worth trying for Isaac Sim users needing custom synthetic data pipelines, but at 29 stars and 1.0% credibility, it's early-stageโ€”expect some setup tweaks despite solid docs and linked arXiv papers. Pair with their HF datasets for quick wins.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.