Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 04-12-2026, 05:23 PM   #1
UriF
Junior Member
UriF began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Apr 2026
Device: none
Converting from pdf to docx & epub - Suggestion

It is a suggestion!!!

Sometimes I downloaded pdf files which are able to read but when you convert to docx or epub you receive a garbage because these pdf files has been created.... WITHOUT SPACES. As a result you get a solid block of characters. Yes, you can split them manually but it would take a lot of time. Two application solve this problem correctly - Adobe Acrobat and online service online2pdf.com. All other converters (paid and free) FAILED. BY the way I spent couple day and I found out an open source library which solved this problem. It is SymSpell ported at different languages including C++, C#, Rust, Python...

I did a simple application for myself where I copied a text with problems and receive a fixed text. It is not an OCR at all. I have idea why this absolutely simple solution is not a standard feature of ALL OF CONVERTERS.

And the second suggestion. Tesseract is an open source project. Why not to include into conversion from pdf?

P.S. This forum doesn't allow me to attach such problem files. If developers want to see samples of such files please reply in this post
UriF is offline   Reply With Quote
Old 04-12-2026, 10:35 PM   #2
UriF
Junior Member
UriF began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Apr 2026
Device: none
sorry, not such simple. 95% is recognized but 5% something incorrect
UriF is offline   Reply With Quote
Advert
Old 04-26-2026, 10:14 PM   #3
UriF
Junior Member
UriF began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Apr 2026
Device: none
I contacted to Gemini and at last I resolved this problem. The problem is that some pdf files CORRUPTED and could be fixed using Ghostscripr. It gave me a sample script:

@echo off
:: Try the standard path first
set GS_PATH="C:\Program Files\gs\gs10.07.0\bin\gswin64c.exe"

:: If that doesn't exist, he can just edit this one line
if not exist %GS_PATH% set GS_PATH="C:\Your\Custom\Path\gswin64c.exe"

mkdir cleaned_files
for %%f in (*.pdf) do (
echo Cleaning "%%f"...
%GS_PATH% -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="cleaned_files\%%f" "%%f"
)
pause

Of course it depends on Ghostscript installed on user's system. AFTER you will run this script your conversion pdf to any format (docx, epub and so on) will return a good output file. What not to add this code into your application? Of course, you can modify this script according to operation system and a computer language which you prefer
UriF is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting Epub>docx Winnito Conversion 2 03-24-2022 11:38 AM
Help with indents when converting docx to epub MJParker Conversion 2 09-29-2021 04:40 AM
Question about converting epub to .docx or PDF andi1235 Conversion 17 07-22-2020 09:59 PM
Converting from EPUB to DOCX - styles tage fredheim Conversion 2 10-16-2019 11:21 AM
Converting a play in docx to epub sir_despard Conversion 1 01-29-2014 08:00 AM


All times are GMT -4. The time now is 06:15 AM.


MobileRead.com is a privately owned, operated and funded community.