Need Text extraction engin from editable PDF

qsipl · 05-16-2014, 03:02 AM

Hi all,

Please suggest me for best text extraction engine that is an exact text extract from the good quality(editable) PDF. We already tried in the ABBY finereader versions. It is useful if it's the image PDF.

But we need exact text from the PDF. So kindly could you give me any suggestion?

qsipl

Toxaris · 05-16-2014, 03:39 AM

Copy/Paste? Standard free PDF conversion tools?

qsipl · 05-16-2014, 08:20 AM

Hi Toxaris,

Can you give me more details about that. It will helpful for me.

Thanks in advance.

Regards,
qsipl

Toxaris · 05-16-2014, 08:51 AM

Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.

mrmikel · 05-16-2014, 11:55 AM

Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards. The goal at the time also matters. Make it exact or merely make it searchable? Searchable has a lot more tolerance.

If it must be exactly the same, then you need to proofread it all word for word..very time consuming.

If it is current, produced by a word processor then converted to a PDF, then the odds of good text are much higher. But if that were the case, you might be able to get a hold of the original.

Tex2002ans · 05-16-2014, 04:26 PM

Quote:

Originally Posted by mrmikel

Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards.

I believe the original post stated "the good quality(editable) PDF"... I am thinking perhaps that this is just a digitally generated PDF (for example, directly out of LaTeX/InDesign/Word/LibreOffice/etc.).

You should be able to use pdf2txt.py to extract the text directly: http://www.unixuser.org/~euske/python/pdfminer/

Hopefully, the person who originally created the PDF created it as a "tagged PDF". You should then be able to use the "-t tag" to pull the text out relatively cleanly (I am not too sure if tagged PDFs also carry the formatting in the tags as well).

There is also xpdf: http://www.foolabs.com/xpdf/download.html

and Poppler (I believe this was built to expand upon xpdf): http://poppler.freedesktop.org/

You could also try your hand at feeding it into Calibre and seeing what happens (I believe it uses Poppler on the backend?).

Quote:

Originally Posted by Toxaris

Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.

Saving as Plain Text:

Won't save any formatting information.
Likely get hard line breaks
Likely get missing things like ligatures + unicode characters + dropcaps
Potentially get odd spacing issues introduced
Lose all slightly more complex objects (tables, formulas, etc. etc.)

Also, I was just taking a gander at Adobe Acrobat's site, and they have this as a feature in their Pro version:

https://www.adobe.com/products/acrob...converter.html

I doubt it works anywhere close to how they make it seem... and probably only works for documents created with Adobe's own tools. Feed it a file made from something else, and these PDF -> XYZ programs usually explode.

Quote:

Originally Posted by mrmikel

If it must be exactly the same, then you need to proofread it all word for word..very time consuming.

Indeed indeed. PDF = horrendous input format, avoid it whenever possible.

Saving as plain text or copying/pasting out of the PDF is going to cause a bunch more headaches.

rraod · 05-16-2014, 06:42 PM

Though Acrobat Professional program is expensive, it has some very good conversion features.

Acrobat Professional will allow you to Save a good pdf in to a HTM format, DOC format or RTF format along with TXT and JPG formats using the SAVE AS command.

I have tried a few large PDFs with formatted text and images saved it to HTM format (HTML 4.01 with CSS 1.0) and it gave me an almost exact replica of the PDF. Using Sigil, I could make corrections to the HTM file and create an epub file.

The PDF to Text convert utilities are useless as they loose the images and page formatiing. The best option would be to convert the PDF to HTML format which retains the formatting and the images. Try to look for some free PDF to HTML utilities on Google and experiment.

One word of caution while trying out these free utilities. They come bundled with unnecessary programs. Select custom install and read the instructions carefully screen after screen while installing these utilities and opt out of any other extra programs the installer tries to put on your system by un-clicking the check-marks. Don't keep pressing the next button repeatedly.

Good Luck!

qsipl · 05-17-2014, 02:23 AM

Hi all,

Thanks a lot for the response. I will get back soon after collecting the information from the link referred by you.

Once again thanks for spend you valuable time for me.

Thanks,
qsipl

Hitch · 05-20-2014, 09:48 PM

Quote:

Originally Posted by rraod

Though Acrobat Professional program is expensive, it has some very good conversion features.

Acrobat Professional will allow you to Save a good pdf in to a HTM format, DOC format or RTF format along with TXT and JPG formats using the SAVE AS command.

I have tried a few large PDFs with formatted text and images saved it to HTM format (HTML 4.01 with CSS 1.0) and it gave me an almost exact replica of the PDF. Using Sigil, I could make corrections to the HTM file and create an epub file.

You must have been extraordinarily fortunate, or don't mind expending a LOT of time doing clean-up in HTML. I wouldn't use Acrobat Pro's export to ANYTHING feature for anything. The HTML it outputs is filthy. The Word files are just as bad. We have the entire suite of Acrobat programs--everything from InDesign to Acrobat Pro, etc., and nothing in Acrobat exports to html, Word, etc., worth a damn, in my fairly experienced opinion.

Quote:

The PDF to Text convert utilities are useless as they loose the images and page formatiing. The best option would be to convert the PDF to HTML format which retains the formatting and the images. Try to look for some free PDF to HTML utilities on Google and experiment.

Again, if someone is very experienced with regex, this can work, but a TON of cleanup is required.

Quote:

One word of caution while trying out these free utilities. They come bundled with unnecessary programs. Select custom install and read the instructions carefully screen after screen while installing these utilities and opt out of any other extra programs the installer tries to put on your system by un-clicking the check-marks. Don't keep pressing the next button repeatedly.

Good Luck!

I have yet to see any "PDF-->Word" or "PDF-->Anything" converters on the web, whether tools or websites, that work better than AbbyyFineReader. We do this for a living, and if there were ANYTHING out there that captured text and everything else better than Abbyy, regardless of price, we'd use it. The fact that the OP doesn't think that Abbyy does a good enough job tells me that either a) they expect some type of perfect export from the PDF, which is, literally, impossible (as the image layer and the text layer are absolutely, positively, ALWAYS different), or b) hasn't worked with Abbyy very much.

For anyone who thinks that even cutting & pasting works, take a nice big page in PDF--a high-quality, good PDF. Make sure you get some nice question marks, quotation marks, etc., in the selection. Then paste that, NOT into Word, but into Word's "SEARCH FOR" box--and look at what you get. That's what's really being pasted, or exported in the "Save as Word" or "Save as RTF" file options. It's garbage. Can it be cleaned up, with a lot of time by hand and eye? Yes. But it's not "exact," by ANY means. Abbyy, in my experience, is still the best solution, and the worse the PDF's get, the better a solution it is.

(OP: you may safely rely on anything Texanns tells you about scanning, OCR and clean-up; he's a steely-eyed ePUB pilot. Ditto anything Tox tells you about his tools--they are excellent.)

Just my $.02. Take it for what it's worth--but we've done well over a thousand PDF-->ePUB & MOBI conversions.

Hitch

roger64 · 05-21-2014, 03:20 AM

Quote:

Originally Posted by Hitch

.../... Abbyy, in my experience, is still the best solution, and the worse the PDF's get, the better a solution it is.

I concur.

With Abby, do you manage to make it produce real endnotes (and not bookmarks that I must do again)? I may have missed something.

Hitch · 05-22-2014, 03:18 PM

Quote:

Originally Posted by roger64

I concur.

With Abby, do you manage to make it produce real endnotes (and not bookmarks that I must do again)? I may have missed something.

In short? No. ;-) I could go into a long discussion of it, but...no. We end up redoing them by hand, or at least, ensuring that they are right, by hand. There's just no footnote substitute yet for hand-coding.

Hitch

Toxaris · 05-22-2014, 05:02 PM

Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.

Hitch · 05-22-2014, 05:55 PM

Quote:

Originally Posted by Toxaris

Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.

Well...I suppose I think of it as "not," because we have to hand-check them, anyway.

I guess I should clarify if we're all talking about the same thing? roger64, do you mean fully-linked footnotes/endnotes, or...? With us, we tend to end up having to do a large amount of renumbering, because we tend to get a lot of works (not sure why this is), in which the author used an asterisk for items on pages, not numbers. That's a nice PITA. ;-)

Hitch

Tex2002ans · 05-22-2014, 09:31 PM

Quote:

Originally Posted by rraod

Acrobat Professional will allow you to Save a good pdf in [...] JPG formats using the SAVE AS command.

Ugh... just don't save images of TEXT DOCUMENTS as JPG. (This is one of my huge pet peeves)

I showed off an example of JPG haloing that made me pull my hair out:

https://www.mobileread.com/forums/sho...3&postcount=30

Quote:

Originally Posted by Hitch

You must have been extraordinarily fortunate, or don't mind expending a LOT of time doing clean-up in HTML. I wouldn't use Acrobat Pro's export to ANYTHING feature for anything. The HTML it outputs is filthy. The Word files are just as bad. We have the entire suite of Acrobat programs--everything from InDesign to Acrobat Pro, etc., and nothing in Acrobat exports to html, Word, etc., worth a damn, in my fairly experienced opinion.

Thanks for the info... I am ALWAYS leery about these programs that convert (ESPECIALLY Adobe's programs, I know they love their bloat, and design their programs to work in THEIR ecosystem, and not play nice with others).

I hunted down a few videos/information trying to see how well the conversion ACTUALLY works, but they were not as technically in-depth as I would like.... or they were just the typical generic marketing/useless fluff that didn't say anything of substance.

I wish I knew of some trustworthy technically-minded review sites.

Quote:

Originally Posted by Toxaris

Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.

The HTML/EPUB export MANGLES footnotes.

Finereader tries to create links back/forth, but it:

May/may not toss out the actual footnote numbers (no rhyme or reason that I can figure out).
- I believe it is based on some sort of heuristics of a superscript number/symbol + if it is marked as a "footnote" style by Finereader
May or may not "combine" two footnotes into "one".
- So Finereader sticks 1 auto-number/link, but includes the text for footnotes 1+2 as an endnote.
Whole footnote paragraphs may just go poof (again, no rhyme or reason that I can figure out).
- This is especially true if the footnote is split across pages.
Finereader 12 has a very annoying bug that 11 did not have.
- In certain books, let us say there are 5 footnotes on a page, it will insert five links at the END of the page, instead of where the superscripts actually are in the text.

Here is a real life example of a book I worked on earlier this month:

Click image for larger version

Name: pg031.png
Views: 217
Size: 63.5 KB
ID: 123223

Click image for larger version

Name: pg032.png
Views: 247
Size: 58.8 KB
ID: 123224

These two pages get morphed into this on EPUB export:

Marked in BLUE, you can see, Finereader tries to auto-insert endnotes + renumber, but mangles it completely.
Marked in RED are footnotes that Finereader missed (Footnote 1 on Page 31 + Footnote 1 on page 32 just went poof).
Marked in GREEN is where you can see, the second half of Footnote 2 on Page 31 just went poof into thin air.
Marked in ORANGE, you can see the superscript went into thin air (because the link in blue = Finereader's auto-numbering).
- Most of the time the superscript number is removed, but other times, it is STILL left there.

EPUB/HTML Exported from Finereader:

Spoiler:

Quote:

During the years immediately after the war, the aid given in the tariff of 1816 was not sufficient to prevent severe depression in the cotton manufacture. Reference has already been made to the disadvantages which, under the circumstances of the years 1815-18, existed for all manufacturers who had to meet competition from abroad. But when the crisis of 1818-19 had brought about a rearrangement of prices more advantageous for manufacturers, matters began to mend. The minimum duty became more effective in handicapping foreign competitors. At the same time the power-loom was generally introduced. Looms made after an English model were introduced in the factories of Rhode Island, the first going into operation in 1817; while in Massachusetts and New Hampshire the loom invented by Lowell was generally adopted after 1816.1 From these various causes the manufacture soon became profitable. There is abundant evidence to show that shortly after the crisis the cotton manufacture had fully recovered from the depression that followed the war.<a id="footnote1"></a><a href="#bookmark0">1</a> The profits made were such as to cause a rapid extension of the industry. The beginning of those man-ufacturing villages which now form the characteristic economic feature of New England falls in this period. Nashua was founded in 1823. Fall River, which had grown into some importance during the war of 1814, grew rapidly from 1820 to 1830.1 By far the most important and the best known of the new ventures in cotton manufacturing was the foundation of the town of Lowell, which was undertaken by the same persons who had been engaged in the establishment of the first power-loom factory at Waltham. The new town was named after the inventor of the power-loom. The scheme of utilizing the falls of the Merrimac, at the point where Lowell now stands, had been suggested as early as 1821, and in the following year the Merrimac Manufacturing Company was incorporated. In 1823 manufacturing began, and was profitable from the beginning; and in 1824 the future growth of Lowell was clearly foreseen.<a id="footnote2"></a><a href="#bookmark1">2</a>

<a id="bookmark0"></a><a href="#footnote1">1</a>

 The following passage, referring to the general revival of manufactures, may be quoted: “The manufacture of cotton now yields a moderate profit to those who conduct the business with the requisite skill and economy. The extensive factories at Pawtucket are still in operation. ... In Philadelphia it is said that about 4,000 looms have been put in operation within the last six months, which are chiefly engaged in making cotton goods, and that in all probability they will, within six months more, be increased to four times that number. In Paterson, N. J., where, two years ago, only three out of sixteen of its extensive factories were in operation ... all are now in vigorous employment.”—“Niles’s Register,” XXI., 39 (1821). Com-

<a id="bookmark1"></a><a href="#footnote2">2</a>

 See the account in Appleton, pp. 17-25. One of the originators of the enterprise said in 1824: “If our business succeeds, as we have reason to expect, we shall have here [at Lowell] as large a population in twenty

years from this time as there was in Boston twenty years ago.”—Batchel-

der, p. 69.

In Bishop, II., 309, is a list of the manufacturing villages of 1826. in which some twenty places are enumerated.

If you export a large book, the footnote situation only gets much worse because of Finereader's horrible Chapter splitting, so the missing footnotes + Finereader's auto-numbering creates a huge mess.

My current method is just go through the book and do a manual pass of all of the footnotes. While I am double-checking that all of the text is there, I also just do all of the formatting (blockquotes).

Anyway, from what I gather, the DOC/ODT export doesn't have much text that magically goes poof, but those two formats come along with their own host of problems/bloat (and I don't have much experience with those formats, since my workflow is OCR -> EPUB/HTML -> Sigil -> completed EPUB).

This is what it looks the text from the two pages look like in the completed EPUB:

Spoiler:

Quote:

During the years immediately after the war, the aid given in the tariff of 1816 was not sufficient to prevent severe depression in the cotton manufacture. Reference has already been made to the disadvantages which, under the circumstances of the years 1815–18, existed for all manufacturers who had to meet competition from abroad. But when the crisis of 1818–19 had brought about a rearrangement of prices more advantageous for manufacturers, matters began to mend. The minimum duty became more effective in handicapping foreign competitors. At the same time the power-loom was generally introduced. Looms made after an English model were introduced in the factories of Rhode Island, the first going into operation in 1817; while in Massachusetts and New Hampshire the loom invented by Lowell was generally adopted after 1816.<a href="#fn22" id="ft22">[22]</a> From these various causes the manufacture soon became profitable. There is abundant evidence to show that shortly after the crisis the cotton manufacture had fully recovered from the depression that followed the war.<a href="#fn23" id="ft23">[23]</a> The profits made were such as to cause a rapid extension of the industry. The beginning of those manufacturing villages which now form the characteristic economic feature of New England falls in this period. Nashua was founded in 1823. Fall River, which had grown into some importance during the war of 1814, grew rapidly from 1820 to 1830.<a href="#fn24" id="ft24">[24]</a> By far the most important and the best known of the new ventures in cotton manufacturing was the foundation of the town of Lowell, which was undertaken by the same persons who had been engaged in the establishment of the first power-loom factory at Waltham. The new town was named after the inventor of the power-loom. The scheme of utilizing the falls of the Merrimac, at the point where Lowell now stands, had been suggested as early as 1821, and in the following year the Merrimac Manufacturing Company was incorporated. In 1823 manufacturing began, and was profitable from the beginning; and in 1824 the future growth of Lowell was clearly foreseen.<a href="#fn25" id="ft25">[25]</a>

[...]

<a href="#ft22" id="fn22">[22]</a> Appleton, p. 13; Batchelder, pp. 70–73.

<a href="#ft23" id="fn23">[23]</a> The following passage, referring to the general revival of manufactures, may be quoted: “The manufacture of cotton now yields a moderate profit to those who conduct the business with the requisite skill and economy. The extensive factories at Pawtucket are still in operation. . . . In Philadelphia it is said that about 4,000 looms have been put in operation within the last six months, which are chiefly engaged in making cotton goods, and that in all probability they will, within six months more, be increased to four times that number. In Paterson, N.J., where, two years ago, only three out of sixteen of its extensive factories were in operation ... all are now in vigorous employment.”—“Niles’s Register,” XXI., 39 (1821). Compare Ibid., XXII., 225, 250 (1822); XXIII., 35, 88 (1823); and passim. In Woodbury’s cotton report, cited above, it is said (p. 57) that “there was a great increase [in cotton manufacturing] in 1806 and 1807; again during the war of 1812; again from 1820 to 1825; and in 1831–32.”

<a href="#ft24" id="fn24">[24]</a> Fox’s “History of Dunstable”; Earl’s “History of Fall River.” p. 20 seq.

<a href="#ft25" id="fn25">[25]</a> See the account in Appleton, pp. 17–25. One of the originators of the enterprise said in 1824: “If our business succeeds, as we have reason to expect, we shall have here [at Lowell] as large a population in twenty years from this time as there was in Boston twenty years ago.”—Batchelder, p. 69.

In Bishop, II., 309, is a list of the manufacturing villages of 1826. in which some twenty places are enumerated.

Anyway, as you can see, PDFs cause a whole host of formatting problems when trying to get it from PDF -> XYZ (particularly with split paragraphs, hard/soft hyphens, footnotes, headers/footers, numbered lists, tables, captions, etc. etc.).

roger64 · 05-23-2014, 01:59 AM

Quote:

Originally Posted by Hitch

Well...I suppose I think of it as "not," because we have to hand-check them, anyway.

I guess I should clarify if we're all talking about the same thing? roger64, do you mean fully-linked footnotes/endnotes, or...? With us, we tend to end up having to do a large amount of renumbering, because we tend to get a lot of works (not sure why this is), in which the author used an asterisk for items on pages, not numbers. That's a nice PITA. ;-)

Hitch

Yes. I meaned exactly fully-linked footnotes/endnotes, and it's the same situation for me. I wish I could avoid hand-checking but...

05-16-2014, 03:02 AM	#1
qsipl Enthusiast Posts: 25 Karma: 412584 Join Date: Feb 2014 Device: IPAD, KF8 & Tablet	Need Text extraction engin from editable PDF Hi all, Please suggest me for best text extraction engine that is an exact text extract from the good quality(editable) PDF. We already tried in the ABBY finereader versions. It is useful if it's the image PDF. But we need exact text from the PDF. So kindly could you give me any suggestion? qsipl

05-16-2014, 08:20 AM	#3
qsipl Enthusiast Posts: 25 Karma: 412584 Join Date: Feb 2014 Device: IPAD, KF8 & Tablet	Need Text-extraction engine for extract text from editable PDF Hi Toxaris, Can you give me more details about that. It will helpful for me. Thanks in advance. Regards, qsipl

05-17-2014, 02:23 AM	#8
qsipl Enthusiast Posts: 25 Karma: 412584 Join Date: Feb 2014 Device: IPAD, KF8 & Tablet	Need Text-extraction engine for extract text from editable PDF Hi all, Thanks a lot for the response. I will get back soon after collecting the information from the link referred by you. Once again thanks for spend you valuable time for me. Thanks, qsipl

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
no text extraction for pdf with images and OCR	fxp33	Conversion	7	12-15-2015 08:22 AM
Generate epub using text-recognized text in PDF not Pictures.	lordofazeroth	Conversion	0	09-19-2013 05:16 PM
Creating a standard editable format	ebooks-love	Calibre	9	01-15-2012 07:52 PM
User-Editable HTML in Templates?	marcot	Calibre	0	06-15-2010 10:19 AM
PDF extraction – what is the best tool?	Prospect	PDF	21	09-27-2009 02:34 AM

05-16-2014, 03:39 AM	#2
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Copy/Paste? Standard free PDF conversion tools?

05-16-2014, 08:51 AM	#4
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.

05-16-2014, 11:55 AM	#5
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards. The goal at the time also matters. Make it exact or merely make it searchable? Searchable has a lot more tolerance. If it must be exactly the same, then you need to proofread it all word for word..very time consuming. If it is current, produced by a word processor then converted to a PDF, then the odds of good text are much higher. But if that were the case, you might be able to get a hold of the original.

05-16-2014, 06:42 PM	#7
rraod Bibliophile Posts: 59 Karma: 2017058 Join Date: Mar 2014 Location: Somewhere in the middle of the desert. Device: Kobo Aura H2O	Though Acrobat Professional program is expensive, it has some very good conversion features. Acrobat Professional will allow you to Save a good pdf in to a HTM format, DOC format or RTF format along with TXT and JPG formats using the SAVE AS command. I have tried a few large PDFs with formatted text and images saved it to HTM format (HTML 4.01 with CSS 1.0) and it gave me an almost exact replica of the PDF. Using Sigil, I could make corrections to the HTM file and create an epub file. The PDF to Text convert utilities are useless as they loose the images and page formatiing. The best option would be to convert the PDF to HTML format which retains the formatting and the images. Try to look for some free PDF to HTML utilities on Google and experiment. One word of caution while trying out these free utilities. They come bundled with unnecessary programs. Select custom install and read the instructions carefully screen after screen while installing these utilities and opt out of any other extra programs the installer tries to put on your system by un-clicking the check-marks. Don't keep pressing the next button repeatedly. Good Luck!

05-22-2014, 05:02 PM	#12
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.

Advert

Advert