View Single Post
Old 04-12-2026, 05:23 PM   #1
UriF
Junior Member
UriF began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Apr 2026
Device: none
Converting from pdf to docx & epub - Suggestion

It is a suggestion!!!

Sometimes I downloaded pdf files which are able to read but when you convert to docx or epub you receive a garbage because these pdf files has been created.... WITHOUT SPACES. As a result you get a solid block of characters. Yes, you can split them manually but it would take a lot of time. Two application solve this problem correctly - Adobe Acrobat and online service online2pdf.com. All other converters (paid and free) FAILED. BY the way I spent couple day and I found out an open source library which solved this problem. It is SymSpell ported at different languages including C++, C#, Rust, Python...

I did a simple application for myself where I copied a text with problems and receive a fixed text. It is not an OCR at all. I have idea why this absolutely simple solution is not a standard feature of ALL OF CONVERTERS.

And the second suggestion. Tesseract is an open source project. Why not to include into conversion from pdf?

P.S. This forum doesn't allow me to attach such problem files. If developers want to see samples of such files please reply in this post
UriF is offline   Reply With Quote