|
Converting from pdf to docx & epub - Suggestion
It is a suggestion!!!
Sometimes I downloaded pdf files which are able to read but when you convert to docx or epub you receive a garbage because these pdf files has been created.... WITHOUT SPACES. As a result you get a solid block of characters. Yes, you can split them manually but it would take a lot of time. Two application solve this problem correctly - Adobe Acrobat and online service online2pdf.com. All other converters (paid and free) FAILED. BY the way I spent couple day and I found out an open source library which solved this problem. It is SymSpell ported at different languages including C++, C#, Rust, Python...
I did a simple application for myself where I copied a text with problems and receive a fixed text. It is not an OCR at all. I have idea why this absolutely simple solution is not a standard feature of ALL OF CONVERTERS.
And the second suggestion. Tesseract is an open source project. Why not to include into conversion from pdf?
P.S. This forum doesn't allow me to attach such problem files. If developers want to see samples of such files please reply in this post
|