![]() |
#1 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
|
Need Text extraction engin from editable PDF
Hi all,
Please suggest me for best text extraction engine that is an exact text extract from the good quality(editable) PDF. We already tried in the ABBY finereader versions. It is useful if it's the image PDF. But we need exact text from the PDF. So kindly could you give me any suggestion? ![]() qsipl |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Copy/Paste? Standard free PDF conversion tools?
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
|
Need Text-extraction engine for extract text from editable PDF
Hi Toxaris,
Can you give me more details about that. It will helpful for me. Thanks in advance. Regards, qsipl |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.
|
![]() |
![]() |
![]() |
#5 |
Color me gone
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards. The goal at the time also matters. Make it exact or merely make it searchable? Searchable has a lot more tolerance.
If it must be exactly the same, then you need to proofread it all word for word..very time consuming. If it is current, produced by a word processor then converted to a PDF, then the odds of good text are much higher. But if that were the case, you might be able to get a hold of the original. |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
You should be able to use pdf2txt.py to extract the text directly: http://www.unixuser.org/~euske/python/pdfminer/ Hopefully, the person who originally created the PDF created it as a "tagged PDF". You should then be able to use the "-t tag" to pull the text out relatively cleanly (I am not too sure if tagged PDFs also carry the formatting in the tags as well). There is also xpdf: http://www.foolabs.com/xpdf/download.html and Poppler (I believe this was built to expand upon xpdf): http://poppler.freedesktop.org/ You could also try your hand at feeding it into Calibre and seeing what happens (I believe it uses Poppler on the backend?). Quote:
Also, I was just taking a gander at Adobe Acrobat's site, and they have this as a feature in their Pro version: https://www.adobe.com/products/acrob...converter.html I doubt it works anywhere close to how they make it seem... and probably only works for documents created with Adobe's own tools. Feed it a file made from something else, and these PDF -> XYZ programs usually explode. Quote:
Saving as plain text or copying/pasting out of the PDF is going to cause a bunch more headaches. Last edited by Tex2002ans; 05-16-2014 at 03:29 PM. |
|||
![]() |
![]() |
![]() |
#7 |
Bibliophile
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 59
Karma: 2017058
Join Date: Mar 2014
Location: Somewhere in the middle of the desert.
Device: Kobo Aura H2O
|
Though Acrobat Professional program is expensive, it has some very good conversion features.
Acrobat Professional will allow you to Save a good pdf in to a HTM format, DOC format or RTF format along with TXT and JPG formats using the SAVE AS command. I have tried a few large PDFs with formatted text and images saved it to HTM format (HTML 4.01 with CSS 1.0) and it gave me an almost exact replica of the PDF. Using Sigil, I could make corrections to the HTM file and create an epub file. The PDF to Text convert utilities are useless as they loose the images and page formatiing. The best option would be to convert the PDF to HTML format which retains the formatting and the images. Try to look for some free PDF to HTML utilities on Google and experiment. One word of caution while trying out these free utilities. They come bundled with unnecessary programs. Select custom install and read the instructions carefully screen after screen while installing these utilities and opt out of any other extra programs the installer tries to put on your system by un-clicking the check-marks. Don't keep pressing the next button repeatedly. Good Luck! |
![]() |
![]() |
![]() |
#8 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
|
Need Text-extraction engine for extract text from editable PDF
Hi all,
Thanks a lot for the response. I will get back soon after collecting the information from the link referred by you. Once again thanks for spend you valuable time for me. Thanks, qsipl |
![]() |
![]() |
![]() |
#9 | |||
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Quote:
Quote:
For anyone who thinks that even cutting & pasting works, take a nice big page in PDF--a high-quality, good PDF. Make sure you get some nice question marks, quotation marks, etc., in the selection. Then paste that, NOT into Word, but into Word's "SEARCH FOR" box--and look at what you get. That's what's really being pasted, or exported in the "Save as Word" or "Save as RTF" file options. It's garbage. Can it be cleaned up, with a lot of time by hand and eye? Yes. But it's not "exact," by ANY means. Abbyy, in my experience, is still the best solution, and the worse the PDF's get, the better a solution it is. (OP: you may safely rely on anything Texanns tells you about scanning, OCR and clean-up; he's a steely-eyed ePUB pilot. Ditto anything Tox tells you about his tools--they are excellent.) Just my $.02. Take it for what it's worth--but we've done well over a thousand PDF-->ePUB & MOBI conversions. Hitch |
|||
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
|
![]() |
![]() |
![]() |
#11 | |
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Hitch |
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.
|
![]() |
![]() |
![]() |
#13 | |
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I guess I should clarify if we're all talking about the same thing? roger64, do you mean fully-linked footnotes/endnotes, or...? With us, we tend to end up having to do a large amount of renumbering, because we tend to get a lot of works (not sure why this is), in which the author used an asterisk for items on pages, not numbers. That's a nice PITA. ;-) Hitch |
|
![]() |
![]() |
![]() |
#14 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I showed off an example of JPG haloing that made me pull my hair out: https://www.mobileread.com/forums/sho...3&postcount=30 Quote:
I hunted down a few videos/information trying to see how well the conversion ACTUALLY works, but they were not as technically in-depth as I would like.... or they were just the typical generic marketing/useless fluff that didn't say anything of substance. I wish I knew of some trustworthy technically-minded review sites. Quote:
Finereader tries to create links back/forth, but it:
Here is a real life example of a book I worked on earlier this month: These two pages get morphed into this on EPUB export:
EPUB/HTML Exported from Finereader: Spoiler:
If you export a large book, the footnote situation only gets much worse because of Finereader's horrible Chapter splitting, so the missing footnotes + Finereader's auto-numbering creates a huge mess. My current method is just go through the book and do a manual pass of all of the footnotes. While I am double-checking that all of the text is there, I also just do all of the formatting (blockquotes). Anyway, from what I gather, the DOC/ODT export doesn't have much text that magically goes poof, but those two formats come along with their own host of problems/bloat (and I don't have much experience with those formats, since my workflow is OCR -> EPUB/HTML -> Sigil -> completed EPUB). This is what it looks the text from the two pages look like in the completed EPUB: Spoiler:
Anyway, as you can see, PDFs cause a whole host of formatting problems when trying to get it from PDF -> XYZ (particularly with split paragraphs, hard/soft hyphens, footnotes, headers/footers, numbered lists, tables, captions, etc. etc.). Last edited by Tex2002ans; 05-22-2014 at 08:48 PM. Reason: Added some Spoiler Tags for the code. |
|||
![]() |
![]() |
![]() |
#15 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
no text extraction for pdf with images and OCR | fxp33 | Conversion | 7 | 12-15-2015 07:22 AM |
Generate epub using text-recognized text in PDF not Pictures. | lordofazeroth | Conversion | 0 | 09-19-2013 04:16 PM |
Creating a standard editable format | ebooks-love | Calibre | 9 | 01-15-2012 06:52 PM |
User-Editable HTML in Templates? | marcot | Calibre | 0 | 06-15-2010 09:19 AM |
PDF extraction – what is the best tool? | Prospect | 21 | 09-27-2009 01:34 AM |