Help:AI video dubbing

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

This page is under construction. Please help build this help page. It is very incomplete.

Videos can be redubbed using modern AI tools. This page helps you create high-quality videos for another language and serves as a place to discuss and organize related things. ffmpeg commands will be added later.

Video made after some experimenting (here with minor changes to the timings to increase time between ~3 sentences) – English; German (original); Italian; Spanish; Vietnamese; French; Hindi; Chinese; Croatian; Czech
Long video without any edits to the transcript after experimenting – English; Spanish; French; Japanese; German (original)

Tutorials

[edit]

SoniTranslate

[edit]

SoniTranslate is free and open source software (GitHub repo)

  • If you only want to create a video transcript at the top just enter the URL for the opus export of the video which you can find at the bottom of the file page of the video on Commons; this allows you to create transcripts more quickly than if you were to provide the full video.
  • You need to change a few settings and configure the language of the source audio in the dropdown
  • Then you can enter that into the TimedText and then check if there are any errors or issues (for example it often writes year dates as text instead of with numbers.
  • Here you need the tool ffmpeg to add the audio to the video
  • You can translate the transcript after corrections using DeepL but usually it's better to use the tool also for the translated transcript and if necessary make adjustments afterwards. The best approach is the following: generate a transcript, maybe a translated one, and then make adjustments to it when done upload the srt file to the tool so it uses that instead of the ogg audio
  • If you want to create a redubbed video, one can either let the tool generate a new video with translated audio or just create the audio as ogg file and then add that to the video. The latter is preferred as it can be done much quicker, reduces the risk of video quality loss, and because you can add the former audio as an additional audio channel (see here and here).
  • Here you need the tool ffmpeg to either add the audio to the video (very quick) or to convert the mp4/mkv video to webm (can take a while)
  • After installing, to start the tool you use the console to navigate into the folder you installed it in and then run python app_rvc.py Append --cpu_mode if you do not have a graphics card (GPU) which is not a problem.
Steps
  1. Select source language English (lang of the closed captions), show Advanced settings, turn down Volume original audio to 0, turn down max acceleration to 0, check the two boxes "Acceleration Rate Regulation" and "Overlap Reduction", select target language and if you want to create an audio file also configure the TTS Speaker 1 accordingly (max speakers should usually be 1), then either upload srt file in the srt file box or submit some video or URL, select subtitle in "Output type" if you only want the transcript or ogg for the audio
  2. You can also create a custom voice using training data and when selecting the speaker some may be more fitting for the video or sound better, don't forget to change the speaker when changing target language
  3. Press translate and wait (again just install locally which is way faster than this huggingface Web UI for testing, sometimes you need to press translate a few times)
  4. Download the file (if it is a srt file you could open it from the Downloads window to copy and paste into TimedTexts after changing the lang code in the URL)
  5. Mute the original webm video with ffmpeg (you should have that installed): ffmpeg -i "/directory/name.webm" -c copy -an "/directory/name2.webm"
  6. Add the new audio channel with ffmpeg (to this muted video): ffmpeg -i "/directory/name2.webm" -i "/directory/audio1.ogg" […] TBA
  7. Optional: add the audio channel from the original video with ffmpeg and maybe other audio channels – however if done manually this is a hassle and these audio channels can't be used when playing the file on WMC, you can only play them when downloading the video and switching the audio channel (also if you include 200 audio channels the file is going to be much larger)
  8. When uploading to WMC the old upload form may work better...it only works for files under 100 MB. Maybe also add the redubbed version to the other_versions= fields in the respective files manually

A tool like the image annotation tool could make all of this much easier and faster.

Until a better tool exists (preferably accessible on the Web to registered active Wikimedia contributors), it's probably best to install this software and experiment around. You don't need a graphics card and can generate subtitles as well as audio fairly quickly. It's just two ffmpeg commands and renaming the files to always the same name.

Instead of entering the transcript into DeepL one by one, one can also use the tool and configure another output language. For best results use both and then see a diff of the difference to manually select the better translation. It may also work better when using the tool's generated transcript rather than WMC users' transcript.

If you copy the TimedTexts from the srt file instead of letting the tool create the transcript you need to edit the srt to remove line breaks in sentence because the tool adds a period after each.

Examples of videos made with this tool are in Category:AI-generated voice (all videos there as of early August except the "Decay phase..." video).

SpeechGen

[edit]
Example of a redub made with this tool (voice EN Iron); transcript here
  • https://speechgen.io/en/subs/
  • Here you need the tool ffmpeg to add the audio to the video after muting the original video just like above
  • A difference is that you need to remove the parameter -shortest from the ffmpeg command to add the audio because the audio file ends before the video so the video could be cut off

Known problems

[edit]
Video as input, let it create transcript and translate
SRT subtitles as input let it create audio to be added with ffmpeg (both videos may get improved versions to fix issues)
  • Speech can overlap but when acceleration is above 1 it can sound bad – the timings should probably adjusted depending on the text or one should only use transcripts generated with that tool, not by WMC users (it could work much better there)
  • Transcriptions often does not work very well in some languages other than English so it can more easily get a few words wrong – if there is an English version of the video you're seeking to transcribe it may make sense to transcribe that and then let the transcript be translated rather than transcribing videos in other languages

[…]

Issues in SoniTranslate

Issues will be created

  • If soft-writing the subtitles, it writes them twice
  • It adds a point after each line ends of SRT files
  • It's not clear from the UI how to convert subtitles to audio, the UI for that should be easier/clearer
  • It can't export the video as webm
  • It's unclear how to prevent slowdown of speech and maybe it's not possible; one can prevent accelerated speech but too slow speech should also be prevented. A workaround could be to edit the timing of the slowed down part to make it shorter so the AI voice has to talk quicker (this is especially the case at the end of the video).
  • It's unclear to the user how provide just text without text and let it create the audio or a TimedText – see Help:Spoken Wikipedia using AI for how that can be done
  • Unclear how to prevent near-overlaps even when "Overlap Reduction" is checked; often it sounds unnatural when the same voice starts immediately (shortly before?) having finished a sentence
  • It would be nice if a TTS speaker was changed automatically when changing target language, this would speed it up until there is some UI using APIs
  • Needs investigations: rarely it drops phrases like "any inconsistency with the future" (only case) or writes things duplicate like "a.C. antes de Cristo" (only case)
  • Button to set the configs to all the changes described above for redubbing, transcription, text2voice/spoken Wikipedia, etc would be useful – then would only have to start up the tool and click a button instead of going through the settings anew every time. (quickconfig buttons)
Spoken Wikipedia by AI-generated voice using ST example

See also

[edit]