Has anyone considered using a FOSS TTS Engines for digitising the public domain texts on Marxists Internet Archive?

No Más@lemmygrad.ml · edit-2 11 days ago

Has anyone considered using a FOSS TTS Engines for digitising the public domain texts on Marxists Internet Archive?

No Más@lemmygrad.ml · edit-2 11 days ago

So I tried converting “White Empire” by Indrajit Samarajiva - and the TTS Engine made a whole 12 hour audiobook for the entire 70 chapters in one hour or so on my laptop! I also tried an alternative to epub2tts - I think it’s got more features but for some reason I couldn’t get it to work (yet) - Pandrator it’s called. I can’t share the audiobook here obviously for copyright reasons, but I think I’ll give Lenin’s What is to be Done a try next.

Also, so far, the few places I faced a problem with epub2tts-kokoro are at the speaking of roman numerals, some non-English pronounciations, and other such intricacies which I assume are often used within older public domain texts, although I think it made a good enough attempt at dividing the chapter names autonomously.

CriticalResist8@lemmygrad.ml · 11 days ago

The few TTS models that I know are piper, kokoro, parler and dia. But I haven’t always found them to be interesting. piper and kokoro use presets (kokoro has them baked in, piper needs them to be downloaded as jsons and there’s not that many). parler and dia if I’m not mistaken can create a voice from a prompt, but it never really worked right for me.

But these are a few years old already and in this domain it’s already a lifetime ago, there’s probably Chinese models now that surpass them lol.

You could also look into cloning models to clone the authors’ own voices for some books (probably public domain books rather than living authors lol). Could be interesting. Someone else proposed that idea here iirc.

TTS is a solved problem, don’t get me wrong, but in the open source models above there was nothing that really made me go “oh yeah, this is it” yet :( I found the voices could slowly morph over long files, or sometimes come out completely different (as if they used another preset), or they don’t get the inflection just right in some cases. for audiobook usage it can easily take you out of it imo if it starts glitching.

What I would do (because I did that for my speech-to-text engine) is some more research to find SOTA TTS models on huggingface, and especially look at how they handle longer texts. you can use deepseek web with search enabled for this (“find SOTA open-weights TTS models 2026, make a comparison table of benchmarks”). Then in opencode I would send deepseek the huggingface pages of these models, and tell it to build my own software suite to leverage those models specifically. It really doesn’t take that many tokens, I had my v1 speech-to-text built in a single 256k context. Only problem is deepseek fucked up cache hit and miss right now on API and you will pay 5x what you should, so I would wait until they fix it. It should cost around 75 cents at most to build this, and the upside is: you have something that works for your needs specifically, you can keep using it for years to come, you can easily switch out the models later when new ones come out, and you can easily add more features as you need them. Fully custom software.

Once you have an engine built, you can just let it run 24/7. add a batch-processing argument, put all your books in a folder, and just let it work overnight. Add a graceful quit+save ctrl+c command so it saves progress in an sqlite database, and add a batch setting so it can break down the book in various batches. That way instead of asking to TTS 72 chapters in one go, you have it TTS 1 chapter at a time and then collate them automatically into a full mp3/wav/flac file. This should prevent some of the glitching over long generations, while also allowing you to multithread several chapters at the same time, so you could TTS 3 chapters at once instead of linearly going through the book. I assume epub2tts already does this, it’s a common technique. i suggest python with a self-contained venv in which the entire project lives, that way it’s easily portable and editable later. simple to use from CLI too - I don’t know if you’re comfortable with the CLI but you would just run “tts-epub --folder “path/to/folder” --batch-size 3"” in the terminal for example and the engine will take care of the rest. if you’re on linux: add an alias to activate venv in .bashrc, if you’re on windows I’m honestly not sure lol.

With these three features you can let it run overnight on an ever-growing folder of books you want to TTS, then ctrl+c to stop in the morning when you want to use your computer, then restart the script at night. With tracking it will scan your ‘books to TTS’ directory when it starts, add new books to the queue, and continue the process right where it left off. It’s as automated as automated gets. Could also imagine moving finished book files to a FINISHED subdirectory automatically just for tidiness. TTS books are generated into a generated_audiobooks subdirectory and take on the name of the book file.

if you ever want to use another model you can just make a copy of your engine and ask the agent to migrate the code to use Y model instead (or handle both).

I don’t know if this speaks to you or not haha, but if it doesn’t just copy-paste my comment over to deepseek on opencode and it’ll figure it out - I actually wrote it in a way you could just send it to an agent and it would build this for you, if you trust my prompts that is lol.

No Más@lemmygrad.ml · 10 days ago

Thanks for the input, I’m a bit of a vibe coder myself but largely self taught so I appreciate the advice. I think if it works out as a custom batch suite it’d be great, I’ll probably put it out as a github (or perhaps gitlab?) repo but with the major disclaimer that the code would come without any warranty - on a P2P license.

I don’t think voice cloning is possible for the texts that I have in mind (Marx, Engels, Lenin) because there is a danger of misrepresenting their personalities without sufficient speech recordings available.

There’s this HF space called TTS-Spaces-Arena and Kokoro has the most votes. Even the audio sounds good to me (I’m 2 hours into the audiobook I generated on a fly), so unless there’s a reason to go elsewhere, I’ll be going ahead with Kokoro tentatively.

CriticalResist8@lemmygrad.ml · 10 days ago

no problem totally understand haha. I thought of it afterwards but you can also have deepseek install or even fork pandrator for you, that way you can develop features for your own needs on a working base.

No Más@lemmygrad.ml · 10 days ago

Of course, I’ll give it a try and revert.

PS: I can’t view the comment by me that you’re replying to. Is it just me?

CriticalResist8@lemmygrad.ml · 10 days ago

oh, I was wondering why you had a tag with “हिन्दी” written in it next to your name on that comment. You probably picked the language (Hindi) for that comment. You can change languages in your profile settings on the website directly, if you ctrl+a the list you will see all languages on lemmy.