SAM Audio is the first unified multimodal AI model for audio separation. It leverages a transformer-based architecture with the Perception Encoder Audiovisual (PE-AV) engine to achieve state-of-the-art results in isolating sounds from complex audio mixtures. The model operates faster than real-time, with versions ranging from 500M to 3B parameters to balance speed and quality.
The platform's core strength lies in its versatile prompting capabilities. Users can isolate audio using natural language text prompts (e.g., 'lead guitar'), visual prompts by clicking on a source in a video, or span prompts by marking a sound's first appearance to track and remove it throughout the file. These methods can be combined for highly precise and granular control over the separation process.
Its practical applications are extensive, catering to music producers for creating stems and remixes, podcast creators for cleaning up dialogue and removing noise, and film professionals for isolating dialogue and sound effects in post-production. The technology also supports accessibility applications by enhancing speech clarity for users with hearing difficulties.




