|
|
#1 |
|
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Apr 2026
Device: none
|
Converting from pdf to docx & epub - Suggestion
It is a suggestion!!!
Sometimes I downloaded pdf files which are able to read but when you convert to docx or epub you receive a garbage because these pdf files has been created.... WITHOUT SPACES. As a result you get a solid block of characters. Yes, you can split them manually but it would take a lot of time. Two application solve this problem correctly - Adobe Acrobat and online service online2pdf.com. All other converters (paid and free) FAILED. BY the way I spent couple day and I found out an open source library which solved this problem. It is SymSpell ported at different languages including C++, C#, Rust, Python... I did a simple application for myself where I copied a text with problems and receive a fixed text. It is not an OCR at all. I have idea why this absolutely simple solution is not a standard feature of ALL OF CONVERTERS. And the second suggestion. Tesseract is an open source project. Why not to include into conversion from pdf? P.S. This forum doesn't allow me to attach such problem files. If developers want to see samples of such files please reply in this post |
|
|
|
|
|
#2 |
|
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Apr 2026
Device: none
|
sorry, not such simple. 95% is recognized but 5% something incorrect
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Apr 2026
Device: none
|
I contacted to Gemini and at last I resolved this problem. The problem is that some pdf files CORRUPTED and could be fixed using Ghostscripr. It gave me a sample script:
@echo off :: Try the standard path first set GS_PATH="C:\Program Files\gs\gs10.07.0\bin\gswin64c.exe" :: If that doesn't exist, he can just edit this one line if not exist %GS_PATH% set GS_PATH="C:\Your\Custom\Path\gswin64c.exe" mkdir cleaned_files for %%f in (*.pdf) do ( echo Cleaning "%%f"... %GS_PATH% -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="cleaned_files\%%f" "%%f" ) pause Of course it depends on Ghostscript installed on user's system. AFTER you will run this script your conversion pdf to any format (docx, epub and so on) will return a good output file. What not to add this code into your application? Of course, you can modify this script according to operation system and a computer language which you prefer |
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Converting Epub>docx | Winnito | Conversion | 2 | 03-24-2022 11:38 AM |
| Help with indents when converting docx to epub | MJParker | Conversion | 2 | 09-29-2021 04:40 AM |
| Question about converting epub to .docx or PDF | andi1235 | Conversion | 17 | 07-22-2020 09:59 PM |
| Converting from EPUB to DOCX - styles | tage fredheim | Conversion | 2 | 10-16-2019 11:21 AM |
| Converting a play in docx to epub | sir_despard | Conversion | 1 | 01-29-2014 08:00 AM |