Help:Spoken Wikipedia using AI
Wikipedia articles can be spoken using modern AI tools. This makes them more accessible to blind people or people who would like to listen to articles on the go similar to podcasts. This page helps you create high-quality audio narrations in the original language or for translations and serves as a place to discuss and organize related things.
Tutorials
[edit]SoniTranslate
[edit]SoniTranslate is free and open source software (GitHub repo)
- Set it up on your local machine as described in the readme
- Launch it until you can open the localhost page in your browser. Run
conda activate sonitr
thenpython app_rvc.py
(append--cpu_mode
if you do not have a graphics card (GPU)). - There (in the SoniTranslate Web UI), specify source language, Translate audio to language should be the same (if you don't intend to create a translated spoken article) and select Max speakers 1 and a voice that matches the target language.
- Under Advanced settings select turn down max acceleration to none/1, enable Acceleration Rate Regulation, select output type "audio (ogg)" or "audio (mp3)", Subtitle type should be disabled, and Translation process (setting near the bottom) should be disable_translation
- Install the Firefox addon Stylus
- Go to the Stylus preferences, create a new style, give it a name, select URLs starting with, enter e.g.
https://en.wikipedia.org/wiki/
, give the profile a name, then paste this CSS and save:.noprint, .reference, .reflist, .refbegin, .infobox, .IPA, .hatnote, .wikitable, .gallery, .sidebar, .tsingle, .trow, .timeline-wrapper, .mw-file-element, .side-box-flex, .box-Expand_section, .metadata, .ambox, .thumb, .mod-gallery, .mwe-math-element, .mw-highlight, .toc, .hauptartikel, .sieheauch, .rellink, .mw-default-size, .excerpt-hat, figure, table { display: none !important; }
- The above hides media files/captions, IPA pronunciations, tables, citation markers like [3], math equations, software code, see also links, hatnotes, etc. Don't forget to turn this style off when done. You need to add css classes to this for other language Wikipedias like
.sieheauch
for German-language Wikipedia (use the Inspect feature of Firefox, e.g. press F12, to find out what the css classes to hide are) - Now copy the article's contents from start to the beginning of the See also or the Reference section (if you know how to do this automatically let us know here)
- Paste the text into a text file that you save as an .srt file (e.g. "articlename.srt")
- At the top above the text enter something like:
1
00:00:00,000 --> 00:03:00,000 - At the bottom below your text enter something like "This was the English Wikipedia article about ARTICLENAME as of CURRENTDATE narrated by AI-generated voice via the open source software SoniTranslate." and save the file. You could also create new files with just these contents so you only need to paste the contents in between. For short articles, it seems like the narration gets slowed down if you don't replace
00:03:00,000
with00:00:10,000
(max 3 hours to max 10 minutes) so do that as well for these cases. - Now remove things like International Phonetic Alphabet (IPA) pronunciation leftovers that is usually at the very beginning of the article (e.g.
()
) and e.g. original-language things that are difficult to pronounce and just disrupt the speech flow / listenability, section headers for sections whose contents was removed (usually just "Gallery") and table headers, and long mathematical equations. Check if there's anything left in the article that should also be removed (one can do this while copying the contents). Note that you may want to include some information from tables (you can still see them on the mobile Wikipedia site or when disabling the Stylus style). - Optional: If substantial content was removed or in general you may want to add info about what was removed to the audio, e.g. "The table Bibliography is not included in the audio" or "For mathematical equations, see the Wikipedia article".
- Optional: Spell out some abbreviations like i.e. refers to "that is" (why ever English has no normal abbreviation for that), e.g. to "for example", i.a. to "inter alia" or "among other things", M often "million", c. for circa, Prof. for Professor, Jr. for Junior, and so on.
- In the Web UI under "Upload an SRT subtitle file" upload the srt file then click Translate at the top. It can take up to ~5 minutes.
- When you upload the file please name it Wikipedia - article name (spoken by AI voice) and also put it into the category Category:Spoken Wikipedia articles using English-language speech synthesis (for whatever language you created the audio (the language is known from the categories or context where it's used or the Wikidata language qualifier and usually can be inferred from the title language – use the language of the audio so e.g. (prononcé par l'IA) or similar for French audios)
- If the audio is of good quality (and you probably shouldn't upload it otherwise), then also add it to the relevant Wikidata items by on the Wikipedia article's connected Wikidata item clicking on "add statement" then entering "spoken text audio", entering the filename, and setting the language as a qualifier by clicking on "add qualifier" and selecting "language of work or name" and then entering the language on the right (then click publish). If you uploaded a batch of many audio files you can use OpenRefine to do this semi-automatically. For this download and install version 3.7.9 (it doesn't work with the latest version), then copy the results of a query like this (adjust as needed) and paste it into OpenRefine Clipboard to create a new project, then click on the column headers to split into multiple columns (uncheck remove column) to create a column that just shows the article name; if you named the files properly you can split by "Wikipedia - " and " (spoken by AI voice).mp3", then use this another time to have a column with the filename except of "File:" then reconcile the column that shows the Wikipedia article title against Wikidata labels by clicking Reconcile against in the column header, selecting the Wikidata service, and selecting "against no particular type" which sadly seems like the current best way to match against item labels, then manually match any unmatched items. (This could maybe at some point be automated so it works at scale once there is a better spoken audio creation system that e.g. automatically replaces abbreviations and does the steps above mostly automatically; note that currently these Wikidata statements are not synced to their Wikipedia articles so currently they aren't shown there which is another issue.)
Known solutions to problems
[edit]- usually if it gets an error you could simply try again
- if it has the 5000 characters limit error (
Text length need to be between 0 and 5000 characters
in the console) you need to reduce the text size and create one audio file at a time if you want to have it translated (if not translating just make sure Translation process is set to disable_translation as described above) – the srt should be up to about 5 kb in size each and after you created audio files for each part you can combine them with ffmpeg or with commandmp3wrap ./out.mp3 ./first.mp3 ./second.mp3 ./third.mp3
(there may be a way to increase the GT max length or to let SoniTranslate do the file splitting and merging) - Long numbers like Wikipedia en español tiene 7 274 112 usuarios in should probably be modified to "más de 7,2 millones)" ("over 7,2 million") in the srt (this currently needs to be done entirely manually).
- Once it has finished you can convert the file to opus and/or trim the end of the file if the audio file is longer than its contents with this ffmpeg command:
ffmpeg -copyts -ss 00:00:00.0 -to 00:08:50.0 -i ./input.mpga -c:a libopus -b:a 192000 ./output.opus
(adjust file paths and the second timestamp) You can replace 192000 with other audio bitrate quality or remove-copyts -ss 00:00:00.0 -to 00:08:50.0
if no trimming is needed - Most of the above steps would not be necessary if there was a audio export view like a print preview. A tool to create these audios at scale and with lots of adjustments to improve quality is needed so it's probably not a good idea to create many of these semi-manually as described above instead of building such a tool.
- SoniTranslate doesn't yet have a proper config file but you can modify app_rvc.py like so (the below is an example you can also modify it differently) after copying the unmodified file somewhere as a backup (line-numbers are between @@):
app_rvc.py adjustment
|
---|
@@ -389,10 +389,10 @@
target_language="English (en)",
min_speakers=1,
max_speakers=1,
- tts_voice00="en-US-EmmaMultilingualNeural-Female",
+ tts_voice00="en-US-BrianMultilingualNeural-Male",
tts_voice01="en-US-AndrewMultilingualNeural-Male",
tts_voice02="en-US-AvaMultilingualNeural-Female",
- tts_voice03="en-US-BrianMultilingualNeural-Male",
+ tts_voice03="en-US-EmmaMultilingualNeural-Female",
tts_voice04="de-DE-SeraphinaMultilingualNeural-Female",
tts_voice05="de-DE-FlorianMultilingualNeural-Male",
tts_voice06="fr-FR-VivienneMultilingualNeural-Female",
@@ -405,7 +405,7 @@
mix_method_audio="Adjusting volumes and mixing audio",
max_accelerate_audio=2.1,
acceleration_rate_regulation=False,
- volume_original_audio=0.25,
+ volume_original_audio=0.0,
volume_translated_audio=1.80,
output_format_subtitle="srt",
get_translated_text=False,
@@ -416,7 +416,7 @@
literalize_numbers=True,
segment_duration_limit=15,
diarization_model="pyannote_2.1",
- translate_process="google_translator_batch",
+ translate_process="disable_translation",
subtitle_file=None,
output_type="video (mp4)",
voiceless_track=False,
@@ -1096,7 +1096,7 @@
output = media_out(
media_file,
TRANSLATE_AUDIO_TO,
- video_output_name,
+ subtitle_file.name,
"wav" if "wav" in output_type else (
"ogg" if "ogg" in output_type else "mp3"
),
@@ -1498,7 +1498,7 @@
SOURCE_LANGUAGE = gr.Dropdown(
LANGUAGES_LIST,
- value=LANGUAGES_LIST[0],
+ value='English (en)',
label=lg_conf["sl_label"],
info=lg_conf["sl_info"],
)
@@ -1524,7 +1524,7 @@
max_speakers = gr.Slider(
1,
MAX_TTS,
- value=2,
+ value=1,
step=1,
label=lg_conf["max_sk"],
)
@@ -1539,7 +1539,7 @@
tts_voice00 = gr.Dropdown(
SoniTr.tts_info.tts_list(),
- value="en-US-EmmaMultilingualNeural-Female",
+ value="en-US-BrianMultilingualNeural-Male",
label=lg_conf["sk1"],
visible=True,
interactive=True,
@@ -1548,7 +1548,7 @@
SoniTr.tts_info.tts_list(),
value="en-US-AndrewMultilingualNeural-Male",
label=lg_conf["sk2"],
- visible=True,
+ visible=False,
interactive=True,
)
tts_voice02 = gr.Dropdown(
@@ -1740,7 +1740,7 @@
):
audio_accelerate = gr.Slider(
label=lg_conf["acc_max_label"],
- value=1.9,
+ value=1.0,
step=0.1,
minimum=1.0,
maximum=2.5,
@@ -1749,12 +1749,12 @@
info=lg_conf["acc_max_info"],
)
acceleration_rate_regulation_gui = gr.Checkbox(
- False,
+ True,
label=lg_conf["acc_rate_label"],
info=lg_conf["acc_rate_info"],
)
avoid_overlap_gui = gr.Checkbox(
- False,
+ True,
label=lg_conf["or_label"],
info=lg_conf["or_info"],
)
@@ -1774,7 +1774,7 @@
volume_original_mix = gr.Slider(
label=lg_conf["vol_ori"],
info="for Adjusting volumes and mixing audio",
- value=0.25,
+ value=0.0,
step=0.05,
minimum=0.0,
maximum=2.50,
@@ -1909,14 +1909,14 @@
)
translate_process_dropdown = gr.Dropdown(
TRANSLATION_PROCESS_OPTIONS,
- value=TRANSLATION_PROCESS_OPTIONS[0],
+ value="disable_translation",
label=lg_conf["tr_process_label"],
)
gr.HTML("<hr></h2>")
main_output_type = gr.Dropdown(
OUTPUT_TYPE_OPTIONS,
- value=OUTPUT_TYPE_OPTIONS[0],
+ value="audio (mp3)",
label=lg_conf["out_type_label"],
)
VIDEO_OUTPUT_NAME = gr.Textbox(
@@ -2124,7 +2124,7 @@
SoniTr.tts_info.tts_list(),
)
),
- value="en-US-EmmaMultilingualNeural-Female",
+ value="en-US-BrianNeural-Male",
label="TTS",
visible=True,
interactive=True,
@@ -2151,9 +2151,7 @@
):
docs_translate_process_dropdown = gr.Dropdown(
DOCS_TRANSLATION_PROCESS_OPTIONS,
- value=DOCS_TRANSLATION_PROCESS_OPTIONS[
- 0
- ],
+ value="disable_translation",
label="Translation process",
)
@@ -2349,7 +2347,7 @@
minimum=0,
maximum=1,
label="Envelope ratio",
- value=0.25,
+ value=0.0,
interactive=True,
|
Problems
[edit]There also is no proper audio player with features like skip 10 seconds back here so far and there is a request to add the functionality to enable adding buttons to jump to the timestamps where the article's sections start in the audio from the file description page (and later the proper audio player) so that one can also jump around the audio. It would also be nice if it played a distinctive sound for every section header and subsection header and/or said "Section:" before the header (see below).
- Trying to let it say "Section" before every section header by prepending "Section:" before all section headers with the stylus CSS below doesn't work because the text is not selected when copying the text.
.mw-heading2:before { content: 'Section: '; } .mw-heading3:before { content: 'Subsection: '; }
- Likewise, tables also can't be converted
- This copy-pasting still needs to be done manually – it would be great if this could be automated more, see: VP/T How to copy texts from Wikipedia articles via browser automatically?
- There can be issues with the pronunciation of words of other languages as well as numbers; this could be solved with setting it to 2 speakers and modifying the numbers
- It should know common abbreviations and either always spell them out or if possible infer what it refers to; in some cases this could be done if the tool knew the wikilink set on the text
- The voice volume can quickly turn noticeably louder for unknown reasons – Issue here
- Sometimes (ca once every 4th audio), the voice and language is changed for one word (e.g. "formation" in French instead of English) – Issue here
- A way to just specify a directory and let it create audio files for every srt file in it one after another would speed things up a lot – Issue here
- It would be useful if one could configure it to use the same name as the srt file for the resulting audio (mainly needed for the above point) – Issue here
- A changed favicon when the job is done would be useful if the above two points are not done (or until then)
- It would be useful if quotes were spoken by a second speaker
- For nested lists / indentation there would need to be some solution like it enumerating the lists with 1. and 1.1. etc.
- Various issues described already in Help:AI video dubbing#Known problems such as missing quickconfig buttons (or a changing the default settings when starting it with some parameter like
--config:spokenWP
) or automatically selecting a voice that matches the output language
Theoretically it may be possible to have links to the different sections in the audio in the file description via Temporal media fragments but I haven't tried this – add info on things like that which is useful to improve the quality of these AI-enabled audio files here or on the talk page
Bark
[edit]Bark by Suno AI is also open source. You could experiment with that using Wikipedia text as described above but the voices don't seem as good as the ones of SoniTranslate and it's currently not well-suited for long texts and texts that should be narrated exactly as 1:1 without any additions/hallucinations or slight alterations by the AI. See "Advanced Long-Form Generation" here.
Missing integrated Wikimedia tool & audio-player
[edit]See the proposal linked in See also.
There needs to be a proper Wikipedia article to AI-spoken audio integrated tool, if possible also available as Web UI to active users, that does things automatically like removing all things that shouldn't be narrated, resolving abbreviations, adding attribution text at the end of the text, adding a standardized description with wikilink and categories, and so on (see above).
Likewise, there needs to be a proper audio player for spoken Wikipedia where it's possible to:
- add timestamps for each section so one could jump to the one would like to listen to or skip a section (audio chapters)
- have a wider player so if skipping around with a click one doesn't miss the intended timestamp by minutes
- listen to the audio without the phone turned on (without having to download the file)
- have a next and previous track functionality (e.g. for listening to lists or many bookmarked items and this is also critical for the usefulness of audio files of music on WMC)
- jump back by 10 seconds (maybe also a few more jump options like that) as all audiobook and podcast players as well as YouTube have it
One could develop a separate app dedicated to just listening to Wikipedia articles that uses the Commons category for spoken audios but it would be much better to have a proper audio player suited also for spoken articles in the Wikipedia and/or Commons app because 1. that would make these apps in general more popular as people have another major usecase for them 2. one would like to read the Wikipedia article alongside listening to the audio (also to see the media in there for example) 3. it would be more accessible (many people have these apps already installed and aren't just randomly searching for this functionality or seeing such an app in the PlayStore results).