Help:Spoken Wikipedia using AI

Wikipedia articles can be spoken using modern AI tools. This makes them more accessible to blind people or people who would like to listen to articles on the go similar to podcasts. This page helps you create high-quality audio narrations in the original language or for translations and serves as a place to discuss and organize related things. Also see Wikipedia:WikiProject Wikipedia spoken by AI voice.

Tutorials

Earth

Two-level utilitarianism

Arch Mission Foundation

Linux

Elephant communication

Heraclitus

2022 in science#August (still some issues with abbreviations and brackets)

Potential cultural impact of extraterrestrial contact

Pig intelligence

SoniTranslate

SoniTranslate is free and open source software (GitHub repo)

Set it up on your local machine as described in the readme
Launch it until you can open the localhost page in your browser. Run conda activate sonitr then python app_rvc.py (append --cpu_mode if you do not have a graphics card (GPU)).
There (in the SoniTranslate Web UI), specify source language, Translate audio to language should be the same (if you don't intend to create a translated spoken article) and select Max speakers 1 and a voice that matches the target language.
Under Advanced settings select turn down max acceleration to none/1, enable Acceleration Rate Regulation, select output type "audio (ogg)" or "audio (mp3)", Subtitle type should be disabled, and Translation process (setting near the bottom) should be disable_translation
Install the Firefox addon Stylus

Go to the Stylus preferences, create a new style, give it a name, select URLs starting with, enter e.g. https://en.wikipedia.org/wiki/, give the profile a name, then paste this CSS and save:

.noprint,
.reference,
.reflist,
.refbegin,
.infobox,
.IPA,
.hatnote,
.wikitable,
.gallery,
.sidebar,
.tsingle,
.trow,
.timeline-wrapper,
.mw-file-element,
.side-box-flex,
.box-Expand_section,
.metadata,
.ambox,
.thumb,
.mod-gallery,
.mwe-math-element,
.mw-highlight,
.toc,
.hauptartikel,
.sieheauch,
.rellink,
.mw-default-size,
.excerpt-hat,
figure,
table
{
    display: none !important;
}

The above hides media files/captions, IPA pronunciations, tables, citation markers like [3], math equations, software code, see also links, hatnotes, etc. Don't forget to turn this style off when done. You need to add css classes to this for other language Wikipedias like .sieheauch for German-language Wikipedia (use the Inspect feature of Firefox, e.g. press F12, to find out what the css classes to hide are)
Now copy the article's contents from start to the beginning of the See also or the Reference section (if you know how to do this automatically let us know here)
Paste the text into a text file that you save as an .srt file (e.g. "articlename.srt")
At the top above the text enter something like:
1
00:00:00,000 --> 00:03:00,000
At the bottom below your text enter something like "This was the English Wikipedia article about ARTICLENAME as of CURRENTDATE narrated by AI-generated voice via the open source software SoniTranslate." and save the file. You could also create new files with just these contents so you only need to paste the contents in between. For short articles, it seems like the narration gets slowed down if you don't replace 00:03:00,000 with 00:00:10,000 (max 3 hours to max 10 minutes) so do that as well for these cases.
Now remove things like International Phonetic Alphabet (IPA) pronunciation leftovers that is usually at the very beginning of the article (e.g. ()) and e.g. original-language things that are difficult to pronounce and just disrupt the speech flow / listenability, section headers for sections whose contents was removed (usually just "Gallery") and table headers, and long mathematical equations. Check if there's anything left in the article that should also be removed (one can do this while copying the contents). Note that you may want to include some information from tables (you can still see them on the mobile Wikipedia site or when disabling the Stylus style).
Optional: If substantial content was removed or in general you may want to add info about what was removed to the audio, e.g. "The table Bibliography is not included in the audio" or "For mathematical equations, see the Wikipedia article".
Optional: Spell out some abbreviations like i.e. refers to "that is" (why ever English has no normal abbreviation for that), e.g. to "for example", i.a. to "inter alia" or "among other things", M often "million", c. for circa, Prof. for Professor, Jr. for Junior, "WHO" for "World Health Organization" or "W.H.O.", and so on. You may also want to do further adjustments as suggested in "Known solutions to problems" like replacing names like "J. K. Rowling" with "JK Rowling" so it doesn't make strange long pauses or replacing km2 / km² with "square kilometers" and c. (before numbers) to circa.
In the Web UI under "Upload an SRT subtitle file" upload the srt file then click Translate at the top. It can take up to ~5 minutes.
When you upload the file please name it Wikipedia - article name (spoken by AI voice) and also put it into the category Category:Spoken Wikipedia articles using English-language speech synthesis (for whatever language you created the audio (the language is known from the categories or context where it's used or the Wikidata language qualifier and usually can be inferred from the title language – use the language of the audio so e.g. (prononcé par l'IA) or similar for French audios)
If the audio is of good quality (and you probably shouldn't upload it otherwise), then also add it to the relevant Wikidata items by on the Wikipedia article's connected Wikidata item clicking on "add statement" then entering "spoken text audio", entering the filename, and setting the language as a qualifier by clicking on "add qualifier" and selecting "language of work or name" and then entering the language on the right (then click publish). If you uploaded a batch of many audio files you can use OpenRefine to do this semi-automatically. For this download and install version 3.7.9 (it doesn't work with the latest version), then copy the results of a query like this (adjust as needed) and paste it into OpenRefine Clipboard to create a new project, then click on the column headers to split into multiple columns (uncheck remove column) to create a column that just shows the article name; if you named the files properly you can split by "Wikipedia - " and " (spoken by AI voice).mp3", then use this another time to have a column with the filename except of "File:" then reconcile the column that shows the Wikipedia article title against Wikidata labels by clicking Reconcile against in the column header, selecting the Wikidata service, and selecting "against no particular type" which sadly seems like the current best way to match against item labels, then manually match any unmatched items. (This could maybe at some point be automated so it works at scale once there is a better spoken audio creation system that e.g. automatically replaces abbreviations and does the steps above mostly automatically; note that currently these Wikidata statements are not synced to their Wikipedia articles so currently they aren't shown there which is another issue.)

Known solutions to problems

If it gets an error you could simply try clicking the Translate button again. Deleting /tmp/gradio and reinserting the srt file may help.
If it has the 5000 characters limit error (Text length need to be between 0 and 5000 characters in the console) you need to reduce the text size and create one audio file at a time if you want to have it translated (if not translating just make sure Translation process is set to disable_translation as described above) – the srt should be up to about 5 kb in size each and after you created audio files for each part you can combine them with command cat ./first.mp3 ./second.mp3 > out.mp3 or with ffmpeg or with command mp3wrap ./out.mp3 ./first.mp3 ./second.mp3 ./third.mp3 (there may be a way to increase the GT max length or to let SoniTranslate do the file splitting and merging)
For longer articles (srt file > ca 50 kb), often the audio is cut off – Issue here – Always check the end of the audio files which should have this Wikipedia crediting sentence. If it's cut off, copy the srt file and remove all the text already narrated to let it narrate the rest, then merge the two audio files with a command in the point above.
The voice volume can quickly turn noticeably louder for unknown reasons – Issue here – this can be fixed with ffmpeg command ffmpeg -i ./v1.mp3 -filter:a "speechnorm=e=50:r=0.01:l=1" ./v2.mp3 (as described here; maybe it reduces the audio quality a bit; one can also use 0.0001 instead; other command-line normalization tools didn't work)
- In practice, one would cd into the Downloads directory to which both mp3s are downloaded into, rename them to 1.mp3 and 2.mp3, and then simply always run this command: cat ./1.mp3 ./2.mp3 > ./v2.mp3 && ffmpeg -i ./v2.mp3 -filter:a "speechnorm=e=25:r=0.0001:l=1" ./v3.mp3. A problem with that approach is that the name of the article in the filetitle is lost which is a problem when many files are created in a row. Rename the file v3.mp3 to match the article title manually.
Long numbers like Wikipedia en español tiene 7 274 112 usuarios in es:Wikipedia en español should probably be modified to "más de 7,2 millones)" ("over 7,2 million") in the srt (this currently needs to be done entirely manually).
There can be issues with the pronunciation of names and abbreviations – e.g. here the HG of "H. G. Wells" should be pronounced quicker without the two long pauses. This could be solved by modifying such abbreviations (e.g. from "H.G." to "HG"). It would be best if names in articles were always in some css class wrapper that could make it easier to adjust these and the same could also be useful for things like x km² (y sq mi). The tool can also leave out things in quotes in square brackets so this needs be adjusted as well and it would be best if it pronounced it in some distinguishable way. Use search and replace functionality of your text editor to spot and replace such things.
Once it has finished you can convert the file to opus and/or trim the end of the file if the audio file is longer than its contents which may happen rarely for very short articles with this ffmpeg command: ffmpeg -copyts -ss 00:00:00.0 -to 00:08:50.0 -i ./input.mpga -c:a libopus -b:a 192000 ./output.opus (adjust file paths and the second timestamp) You can replace 192000 with other audio bitrate quality or remove -copyts -ss 00:00:00.0 -to 00:08:50.0 if no trimming is needed
Most of the above steps would not be necessary if there was a audio export view like a print preview. A tool to create these audios at scale and with lots of adjustments to improve quality is needed so it's probably not a good idea to create many of these semi-manually as described above instead of building such a tool.
SoniTranslate doesn't yet have a proper config file but you can modify app_rvc.py like so (the below is an example you can also modify it differently) after copying the unmodified file somewhere as a backup (line-numbers are between @@):

app_rvc.py adjustment

@@ -389,10 +389,10 @@
         target_language="English (en)",
         min_speakers=1,
         max_speakers=1,
-        tts_voice00="en-US-EmmaMultilingualNeural-Female",
+        tts_voice00="en-US-BrianMultilingualNeural-Male",
         tts_voice01="en-US-AndrewMultilingualNeural-Male",
         tts_voice02="en-US-AvaMultilingualNeural-Female",
-        tts_voice03="en-US-BrianMultilingualNeural-Male",
+        tts_voice03="en-US-EmmaMultilingualNeural-Female",
         tts_voice04="de-DE-SeraphinaMultilingualNeural-Female",
         tts_voice05="de-DE-FlorianMultilingualNeural-Male",
         tts_voice06="fr-FR-VivienneMultilingualNeural-Female",
@@ -405,7 +405,7 @@
         mix_method_audio="Adjusting volumes and mixing audio",
         max_accelerate_audio=2.1,
         acceleration_rate_regulation=False,
-        volume_original_audio=0.25,
+        volume_original_audio=0.0,
         volume_translated_audio=1.80,
         output_format_subtitle="srt",
         get_translated_text=False,
@@ -416,7 +416,7 @@
         literalize_numbers=True,
         segment_duration_limit=15,
         diarization_model="pyannote_2.1",
-        translate_process="google_translator_batch",
+        translate_process="disable_translation",
         subtitle_file=None,
         output_type="video (mp4)",
         voiceless_track=False,
@@ -1096,7 +1096,7 @@
             output = media_out(
                 media_file,
                 TRANSLATE_AUDIO_TO,
-                video_output_name,
+                subtitle_file.name,
                 "wav" if "wav" in output_type else (
                     "ogg" if "ogg" in output_type else "mp3"
                 ),
@@ -1498,7 +1498,7 @@
 
                     SOURCE_LANGUAGE = gr.Dropdown(
                         LANGUAGES_LIST,
-                        value=LANGUAGES_LIST[0],
+                        value='English (en)',
                         label=lg_conf["sl_label"],
                         info=lg_conf["sl_info"],
                     )
@@ -1524,7 +1524,7 @@
                     max_speakers = gr.Slider(
                         1,
                         MAX_TTS,
-                        value=2,
+                        value=1,
                         step=1,
                         label=lg_conf["max_sk"],
                     )
@@ -1539,7 +1539,7 @@
 
                     tts_voice00 = gr.Dropdown(
                         SoniTr.tts_info.tts_list(),
-                        value="en-US-EmmaMultilingualNeural-Female",
+                        value="en-US-BrianMultilingualNeural-Male",
                         label=lg_conf["sk1"],
                         visible=True,
                         interactive=True,
@@ -1548,7 +1548,7 @@
                         SoniTr.tts_info.tts_list(),
                         value="en-US-AndrewMultilingualNeural-Male",
                         label=lg_conf["sk2"],
-                        visible=True,
+                        visible=False,
                         interactive=True,
                     )
                     tts_voice02 = gr.Dropdown(
@@ -1740,7 +1740,7 @@
                         ):
                             audio_accelerate = gr.Slider(
                                 label=lg_conf["acc_max_label"],
-                                value=1.9,
+                                value=1.0,
                                 step=0.1,
                                 minimum=1.0,
                                 maximum=2.5,
@@ -1749,12 +1749,12 @@
                                 info=lg_conf["acc_max_info"],
                             )
                             acceleration_rate_regulation_gui = gr.Checkbox(
-                                False,
+                                True,
                                 label=lg_conf["acc_rate_label"],
                                 info=lg_conf["acc_rate_info"],
                             )
                             avoid_overlap_gui = gr.Checkbox(
-                                False,
+                                True,
                                 label=lg_conf["or_label"],
                                 info=lg_conf["or_info"],
                             )
@@ -1774,7 +1774,7 @@
                             volume_original_mix = gr.Slider(
                                 label=lg_conf["vol_ori"],
                                 info="for Adjusting volumes and mixing audio",
-                                value=0.25,
+                                value=0.0,
                                 step=0.05,
                                 minimum=0.0,
                                 maximum=2.50,
@@ -1909,14 +1909,14 @@
                             )
                             translate_process_dropdown = gr.Dropdown(
                                 TRANSLATION_PROCESS_OPTIONS,
-                                value=TRANSLATION_PROCESS_OPTIONS[0],
+                                value="disable_translation",
                                 label=lg_conf["tr_process_label"],
                             )
 
                             gr.HTML("<hr></h2>")
                             main_output_type = gr.Dropdown(
                                 OUTPUT_TYPE_OPTIONS,
-                                value=OUTPUT_TYPE_OPTIONS[0],
+                                value="audio (mp3)",
                                 label=lg_conf["out_type_label"],
                             )
                             VIDEO_OUTPUT_NAME = gr.Textbox(
@@ -2124,7 +2124,7 @@
                                         SoniTr.tts_info.tts_list(),
                                     )
                                 ),
-                                value="en-US-EmmaMultilingualNeural-Female",
+                                value="en-US-BrianNeural-Male",
                                 label="TTS",
                                 visible=True,
                                 interactive=True,
@@ -2151,9 +2151,7 @@
                                 ):
                                     docs_translate_process_dropdown = gr.Dropdown(
                                         DOCS_TRANSLATION_PROCESS_OPTIONS,
-                                        value=DOCS_TRANSLATION_PROCESS_OPTIONS[
-                                            0
-                                        ],
+                                        value="disable_translation",
                                         label="Translation process",
                                     )
 
@@ -2349,7 +2347,7 @@
                                     minimum=0,
                                     maximum=1,
                                     label="Envelope ratio",
-                                    value=0.25,
+                                    value=0.0,
                                     interactive=True,

Problems

Problems in the audios can be reported (and fixed) with {{Suggested improvements AI voice}}

There is no proper audio player with features like skip 10 seconds back here so far and there is a request to add the functionality to enable adding buttons to jump to the timestamps where the article's sections start in the audio from the file description page (and later the proper audio player) so that one can also jump around the audio. It would also be nice if it played a distinctive sound for every section header and subsection header and/or said "Section:" before the header (see below).

Trying to let it say "Section" before every section header by prepending "Section:" before all section headers with the stylus CSS below doesn't work because the text is not selected when copying the text.
```
.mw-heading2:before {
  content: 'Section: ';
}
.mw-heading3:before {
  content: 'Subsection: ';
}
```
Likewise, tables also can't be auto converted to copyable notes like "Table xyz is not included in this audio."
This copy-pasting still needs to be done manually – it would be great if this could be automated more, see: VP/T How to copy texts from Wikipedia articles via browser automatically?
There can be issues the pronunciation of words of other languages. Maybe this could be solved by setting it to 2 speakers with one being of the other language if it's just one other language or using a multilingual voice for that (untested).
It should know common abbreviations and either always spell them out or if possible infer what it refers to; in some cases this could be done if the tool knew the wikilink set on the text.
Math equations / formulas can be difficult to exclude or include.
Sometimes (ca once every 4th audio), the voice and language is changed for one word (e.g. "formation" in French instead of English) – Issue here
A way to just specify a directory and let it create audio files for every srt file in it one after another would speed things up a lot – Issue here
It would be useful if one could configure it to use the same name as the srt file for the resulting audio (mainly needed for the above point) – Issue here
A changed favicon when the job is done would be useful if the above two points are not done (or until then)
It would be useful if quotes were spoken by a second speaker or otherwise in a clearly distinguishable way
For nested lists / indentation there would need to be some solution like it enumerating the lists with 1. and 1.1. etc.
Various issues described already in Help:AI video dubbing#Known problems such as missing quickconfig buttons (or a changing the default settings when starting it with some parameter like --config:spokenWP) or automatically selecting a voice that matches the output language

Theoretically it may be possible to have links to the different sections in the audio in the file description via Temporal media fragments but I haven't tried this – add info on things like that which is useful to improve the quality of these AI-enabled audio files here or on the talk page

Bark

Bark by Suno AI is also open source. You could experiment with that using Wikipedia text as described above but the voices don't seem as good as the ones of SoniTranslate and it's currently not well-suited for long texts and texts that should be narrated exactly as 1:1 without any additions/hallucinations or slight alterations by the AI. See "Advanced Long-Form Generation" here.

Missing integrated Wikimedia tool & audio-player

See the proposal linked in See also.

There needs to be a proper Wikipedia article to AI-spoken audio integrated tool, if possible also available as Web UI to active users, that does things automatically like removing all things that shouldn't be narrated, resolving abbreviations, adding attribution text at the end of the text, adding a standardized description with wikilink and categories, and so on (see above).

Likewise, there needs to be a proper audio player for spoken Wikipedia where it's possible to:

add timestamps for each section so one could jump to the one would like to listen to or skip a section (audio chapters)
have a wider player so if skipping around with a click one doesn't miss the intended timestamp by minutes
listen to the audio without the phone turned on (without having to download the file)
have a next and previous track functionality (e.g. for listening to lists or many bookmarked items; this includes playlist-building features)
jump back by 10 seconds (maybe also a few more jump options like that) as all audiobook and podcast players as well as YouTube have it (this should also work via a button on the lockscreen and the buttons on the headphones)

One could develop a separate app dedicated to just listening to Wikipedia articles that uses the Commons category for spoken audios but it would be much better to have a proper audio player suited also for spoken articles in the Wikipedia and/or Commons app because 1. that would make these apps in general more popular as people have another major usecase for them 2. one would like to read the Wikipedia article alongside listening to the audio (also to see the media in there for example) 3. it would be more accessible (many people have these apps already installed and aren't just randomly searching for this functionality or seeing such an app in the PlayStore results).