Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-16-2014, 02:02 AM   #1
qsipl
Member
qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.
 
Posts: 23
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
Need Text extraction engin from editable PDF

Hi all,

Please suggest me for best text extraction engine that is an exact text extract from the good quality(editable) PDF. We already tried in the ABBY finereader versions. It is useful if it's the image PDF.

But we need exact text from the PDF. So kindly could you give me any suggestion?


qsipl
qsipl is offline   Reply With Quote
Old 05-16-2014, 02:39 AM   #2
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 2,959
Karma: 3363559
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
Copy/Paste? Standard free PDF conversion tools?
Toxaris is offline   Reply With Quote
Old 05-16-2014, 07:20 AM   #3
qsipl
Member
qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.
 
Posts: 23
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
Need Text-extraction engine for extract text from editable PDF

Hi Toxaris,

Can you give me more details about that. It will helpful for me.

Thanks in advance.

Regards,
qsipl
qsipl is offline   Reply With Quote
Old 05-16-2014, 07:51 AM   #4
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 2,959
Karma: 3363559
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.
Toxaris is offline   Reply With Quote
Old 05-16-2014, 10:55 AM   #5
mrmikel
Book Twiddler
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,028
Karma: 1424487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards. The goal at the time also matters. Make it exact or merely make it searchable? Searchable has a lot more tolerance.

If it must be exactly the same, then you need to proofread it all word for word..very time consuming.

If it is current, produced by a word processor then converted to a PDF, then the odds of good text are much higher. But if that were the case, you might be able to get a hold of the original.
mrmikel is offline   Reply With Quote
Old 05-16-2014, 03:26 PM   #6
Tex2002ans
Evangelist
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 498
Karma: 379915
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by mrmikel View Post
Beware! The words do not always match the text. They are only as good as the person running the OCR program and editing the result afterwards.
I believe the original post stated "the good quality(editable) PDF"... I am thinking perhaps that this is just a digitally generated PDF (for example, directly out of LaTeX/InDesign/Word/LibreOffice/etc.).

You should be able to use pdf2txt.py to extract the text directly: http://www.unixuser.org/~euske/python/pdfminer/

Hopefully, the person who originally created the PDF created it as a "tagged PDF". You should then be able to use the "-t tag" to pull the text out relatively cleanly (I am not too sure if tagged PDFs also carry the formatting in the tags as well).

There is also xpdf: http://www.foolabs.com/xpdf/download.html

and Poppler (I believe this was built to expand upon xpdf): http://poppler.freedesktop.org/

You could also try your hand at feeding it into Calibre and seeing what happens (I believe it uses Poppler on the backend?).

Quote:
Originally Posted by Toxaris View Post
Select text, copy, paste. Within Adobe Reader (an a lot of other PDF-readers) you can also do a save as text.
Saving as Plain Text:
  • Won't save any formatting information.
  • Likely get hard line breaks
  • Likely get missing things like ligatures + unicode characters + dropcaps
  • Potentially get odd spacing issues introduced
  • Lose all slightly more complex objects (tables, formulas, etc. etc.)

Also, I was just taking a gander at Adobe Acrobat's site, and they have this as a feature in their Pro version:

https://www.adobe.com/products/acrob...converter.html

I doubt it works anywhere close to how they make it seem... and probably only works for documents created with Adobe's own tools. Feed it a file made from something else, and these PDF -> XYZ programs usually explode.

Quote:
Originally Posted by mrmikel View Post
If it must be exactly the same, then you need to proofread it all word for word..very time consuming.
Indeed indeed. PDF = horrendous input format, avoid it whenever possible.

Saving as plain text or copying/pasting out of the PDF is going to cause a bunch more headaches.

Last edited by Tex2002ans; 05-16-2014 at 03:29 PM.
Tex2002ans is offline   Reply With Quote
Old 05-16-2014, 05:42 PM   #7
rraod
Bibliophile
rraod began at the beginning.
 
rraod's Avatar
 
Posts: 11
Karma: 10
Join Date: Mar 2014
Location: Riyadh, KSA
Device: Kindle paperwhite, Nook Simple Touch
Though Acrobat Professional program is expensive, it has some very good conversion features.

Acrobat Professional will allow you to Save a good pdf in to a HTM format, DOC format or RTF format along with TXT and JPG formats using the SAVE AS command.

I have tried a few large PDFs with formatted text and images saved it to HTM format (HTML 4.01 with CSS 1.0) and it gave me an almost exact replica of the PDF. Using Sigil, I could make corrections to the HTM file and create an epub file.

The PDF to Text convert utilities are useless as they loose the images and page formatiing. The best option would be to convert the PDF to HTML format which retains the formatting and the images. Try to look for some free PDF to HTML utilities on Google and experiment.

One word of caution while trying out these free utilities. They come bundled with unnecessary programs. Select custom install and read the instructions carefully screen after screen while installing these utilities and opt out of any other extra programs the installer tries to put on your system by un-clicking the check-marks. Don't keep pressing the next button repeatedly.

Good Luck!
rraod is offline   Reply With Quote
Old 05-17-2014, 01:23 AM   #8
qsipl
Member
qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.
 
Posts: 23
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
Need Text-extraction engine for extract text from editable PDF

Hi all,

Thanks a lot for the response. I will get back soon after collecting the information from the link referred by you.

Once again thanks for spend you valuable time for me.

Thanks,
qsipl
qsipl is offline   Reply With Quote
Old 05-20-2014, 08:48 PM   #9
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 2,377
Karma: 12871193
Join Date: Apr 2010
Location: Phoenix, AZ
Device: Kindle2, iPad, KindleFire and NookColor
Quote:
Originally Posted by rraod View Post
Though Acrobat Professional program is expensive, it has some very good conversion features.

Acrobat Professional will allow you to Save a good pdf in to a HTM format, DOC format or RTF format along with TXT and JPG formats using the SAVE AS command.

I have tried a few large PDFs with formatted text and images saved it to HTM format (HTML 4.01 with CSS 1.0) and it gave me an almost exact replica of the PDF. Using Sigil, I could make corrections to the HTM file and create an epub file.
You must have been extraordinarily fortunate, or don't mind expending a LOT of time doing clean-up in HTML. I wouldn't use Acrobat Pro's export to ANYTHING feature for anything. The HTML it outputs is filthy. The Word files are just as bad. We have the entire suite of Acrobat programs--everything from InDesign to Acrobat Pro, etc., and nothing in Acrobat exports to html, Word, etc., worth a damn, in my fairly experienced opinion.

Quote:
The PDF to Text convert utilities are useless as they loose the images and page formatiing. The best option would be to convert the PDF to HTML format which retains the formatting and the images. Try to look for some free PDF to HTML utilities on Google and experiment.
Again, if someone is very experienced with regex, this can work, but a TON of cleanup is required.

Quote:
One word of caution while trying out these free utilities. They come bundled with unnecessary programs. Select custom install and read the instructions carefully screen after screen while installing these utilities and opt out of any other extra programs the installer tries to put on your system by un-clicking the check-marks. Don't keep pressing the next button repeatedly.

Good Luck!
I have yet to see any "PDF-->Word" or "PDF-->Anything" converters on the web, whether tools or websites, that work better than AbbyyFineReader. We do this for a living, and if there were ANYTHING out there that captured text and everything else better than Abbyy, regardless of price, we'd use it. The fact that the OP doesn't think that Abbyy does a good enough job tells me that either a) they expect some type of perfect export from the PDF, which is, literally, impossible (as the image layer and the text layer are absolutely, positively, ALWAYS different), or b) hasn't worked with Abbyy very much.

For anyone who thinks that even cutting & pasting works, take a nice big page in PDF--a high-quality, good PDF. Make sure you get some nice question marks, quotation marks, etc., in the selection. Then paste that, NOT into Word, but into Word's "SEARCH FOR" box--and look at what you get. That's what's really being pasted, or exported in the "Save as Word" or "Save as RTF" file options. It's garbage. Can it be cleaned up, with a lot of time by hand and eye? Yes. But it's not "exact," by ANY means. Abbyy, in my experience, is still the best solution, and the worse the PDF's get, the better a solution it is.

(OP: you may safely rely on anything Texanns tells you about scanning, OCR and clean-up; he's a steely-eyed ePUB pilot. Ditto anything Tox tells you about his tools--they are excellent.)

Just my $.02. Take it for what it's worth--but we've done well over a thousand PDF-->ePUB & MOBI conversions.

Hitch
Hitch is offline   Reply With Quote
Old 05-21-2014, 02:20 AM   #10
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 1,424
Karma: 846401
Join Date: Jan 2009
Device: KoboGlo
Quote:
Originally Posted by Hitch View Post
.../... Abbyy, in my experience, is still the best solution, and the worse the PDF's get, the better a solution it is.
I concur.

With Abby, do you manage to make it produce real endnotes (and not bookmarks that I must do again)? I may have missed something.
roger64 is offline   Reply With Quote
Old 05-22-2014, 02:18 PM   #11
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 2,377
Karma: 12871193
Join Date: Apr 2010
Location: Phoenix, AZ
Device: Kindle2, iPad, KindleFire and NookColor
Quote:
Originally Posted by roger64 View Post
I concur.

With Abby, do you manage to make it produce real endnotes (and not bookmarks that I must do again)? I may have missed something.
In short? No. ;-) I could go into a long discussion of it, but...no. We end up redoing them by hand, or at least, ensuring that they are right, by hand. There's just no footnote substitute yet for hand-coding.

Hitch
Hitch is offline   Reply With Quote
Old 05-22-2014, 04:02 PM   #12
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 2,959
Karma: 3363559
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.
Toxaris is offline   Reply With Quote
Old 05-22-2014, 04:55 PM   #13
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 2,377
Karma: 12871193
Join Date: Apr 2010
Location: Phoenix, AZ
Device: Kindle2, iPad, KindleFire and NookColor
Quote:
Originally Posted by Toxaris View Post
Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.
Well...I suppose I think of it as "not," because we have to hand-check them, anyway.

I guess I should clarify if we're all talking about the same thing? roger64, do you mean fully-linked footnotes/endnotes, or...? With us, we tend to end up having to do a large amount of renumbering, because we tend to get a lot of works (not sure why this is), in which the author used an asterisk for items on pages, not numbers. That's a nice PITA. ;-)

Hitch
Hitch is offline   Reply With Quote
Old 05-22-2014, 08:31 PM   #14
Tex2002ans
Evangelist
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 498
Karma: 379915
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by rraod View Post
Acrobat Professional will allow you to Save a good pdf in [...] JPG formats using the SAVE AS command.
Ugh... just don't save images of TEXT DOCUMENTS as JPG. (This is one of my huge pet peeves)

I showed off an example of JPG haloing that made me pull my hair out:

http://www.mobileread.com/forums/sho...3&postcount=30

Quote:
Originally Posted by Hitch View Post
You must have been extraordinarily fortunate, or don't mind expending a LOT of time doing clean-up in HTML. I wouldn't use Acrobat Pro's export to ANYTHING feature for anything. The HTML it outputs is filthy. The Word files are just as bad. We have the entire suite of Acrobat programs--everything from InDesign to Acrobat Pro, etc., and nothing in Acrobat exports to html, Word, etc., worth a damn, in my fairly experienced opinion.
Thanks for the info... I am ALWAYS leery about these programs that convert (ESPECIALLY Adobe's programs, I know they love their bloat, and design their programs to work in THEIR ecosystem, and not play nice with others).

I hunted down a few videos/information trying to see how well the conversion ACTUALLY works, but they were not as technically in-depth as I would like.... or they were just the typical generic marketing/useless fluff that didn't say anything of substance.

I wish I knew of some trustworthy technically-minded review sites.

Quote:
Originally Posted by Toxaris View Post
Well, the Word export of ABBYY gets most footnotes right... It misses some, but that is actually rare.
The HTML/EPUB export MANGLES footnotes.

Finereader tries to create links back/forth, but it:
  • May/may not toss out the actual footnote numbers (no rhyme or reason that I can figure out).
    • I believe it is based on some sort of heuristics of a superscript number/symbol + if it is marked as a "footnote" style by Finereader
  • May or may not "combine" two footnotes into "one".
    • So Finereader sticks 1 auto-number/link, but includes the text for footnotes 1+2 as an endnote.
  • Whole footnote paragraphs may just go poof (again, no rhyme or reason that I can figure out).
    • This is especially true if the footnote is split across pages.
  • Finereader 12 has a very annoying bug that 11 did not have.
    • In certain books, let us say there are 5 footnotes on a page, it will insert five links at the END of the page, instead of where the superscripts actually are in the text.

Here is a real life example of a book I worked on earlier this month:

Click image for larger version

Name:	pg031.png
Views:	19
Size:	63.5 KB
ID:	123223Click image for larger version

Name:	pg032.png
Views:	23
Size:	58.8 KB
ID:	123224

These two pages get morphed into this on EPUB export:
  • Marked in BLUE, you can see, Finereader tries to auto-insert endnotes + renumber, but mangles it completely.
  • Marked in RED are footnotes that Finereader missed (Footnote 1 on Page 31 + Footnote 1 on page 32 just went poof).
  • Marked in GREEN is where you can see, the second half of Footnote 2 on Page 31 just went poof into thin air.
  • Marked in ORANGE, you can see the superscript went into thin air (because the link in blue = Finereader's auto-numbering).
    • Most of the time the superscript number is removed, but other times, it is STILL left there.

EPUB/HTML Exported from Finereader:

Spoiler:
Quote:
<p>During the years immediately after the war, the aid given in the tariff of 1816 was not sufficient to prevent severe depression in the cotton manufacture. Reference has already been made to the disadvantages which, under the circumstances of the years 1815-18, existed for all manufacturers who had to meet competition from abroad. But when the crisis of 1818-19 had brought about a rearrangement of prices more advantageous for manufacturers, matters began to mend. The minimum duty became more effective in handicapping foreign competitors. At the same time the power-loom was generally introduced. Looms made after an English model were introduced in the factories of Rhode Island, the first going into operation in 1817; while in Massachusetts and New Hampshire the loom invented by Lowell was generally adopted after 1816.<sup>1</sup> From these various causes the manufacture soon became profitable. There is abundant evidence to show that shortly after the crisis the cotton manufacture had fully recovered from the depression that followed the war.<a id="footnote1"></a><sup><a href="#bookmark0">1</a></sup> The profits made were such as to cause a rapid extension of the industry. The beginning of those man-ufacturing villages which now form the characteristic economic feature of New England falls in this period. Nashua was founded in 1823. Fall River, which had grown into some importance during the war of 1814, grew rapidly from 1820 to 1830.<sup>1</sup> By far the most important and the best known of the new ventures in cotton manufacturing was the foundation of the town of Lowell, which was undertaken by the same persons who had been engaged in the establishment of the first power-loom factory at Waltham. The new town was named after the inventor of the power-loom. The scheme of utilizing the falls of the Merrimac, at the point where Lowell now stands, had been suggested as early as 1821, and in the following year the Merrimac Manufacturing Company was incorporated. In 1823 manufacturing began, and was profitable from the beginning; and in 1824 the future growth of Lowell was clearly foreseen.<a id="footnote2"></a><sup><a href="#bookmark1">2</a></sup></p>

<p><a id="bookmark0"></a><a href="#footnote1">1</a></p>

<p> The following passage, referring to the general revival of manufactures, may be quoted: “The manufacture of cotton now yields a moderate profit to those who conduct the business with the requisite skill and economy. The extensive factories at Pawtucket are still in operation. ... In Philadelphia it is said that about 4,000 looms have been put in operation within the last six months, which are chiefly engaged in making cotton goods, and that in all probability they will, within six months more, be increased to four times that number. In Paterson, N. J., where, two years ago, only three out of sixteen of its extensive factories were in operation ... all are now in vigorous employment.”—“Niles’s Register,” XXI., 39 (1821). Com-</p>

<p><a id="bookmark1"></a><a href="#footnote2">2</a></p>

<p> See the account in Appleton, pp. 17-25. One of the originators of the enterprise said in 1824: “If our business succeeds, as we have reason to expect, we shall have here [at Lowell] as large a population in twenty</p>

<p>years from this time as there was in Boston twenty years ago.”—Batchel-</p>

<p>der, p. 69.</p>

<p>In Bishop, II., 309, is a list of the manufacturing villages of 1826. in which some twenty places are enumerated.</p>


If you export a large book, the footnote situation only gets much worse because of Finereader's horrible Chapter splitting, so the missing footnotes + Finereader's auto-numbering creates a huge mess.

My current method is just go through the book and do a manual pass of all of the footnotes. While I am double-checking that all of the text is there, I also just do all of the formatting (blockquotes).

Anyway, from what I gather, the DOC/ODT export doesn't have much text that magically goes poof, but those two formats come along with their own host of problems/bloat (and I don't have much experience with those formats, since my workflow is OCR -> EPUB/HTML -> Sigil -> completed EPUB).

This is what it looks the text from the two pages look like in the completed EPUB:

Spoiler:
Quote:
<p>During the years immediately after the war, the aid given in the tariff of 1816 was not sufficient to prevent severe depression in the cotton manufacture. Reference has already been made to the disadvantages which, under the circumstances of the years 1815–18, existed for all manufacturers who had to meet competition from abroad. But when the crisis of 1818–19 had brought about a rearrangement of prices more advantageous for manufacturers, matters began to mend. The minimum duty became more effective in handicapping foreign competitors. At the same time the power-loom was generally introduced. Looms made after an English model were introduced in the factories of Rhode Island, the first going into operation in 1817; while in Massachusetts and New Hampshire the loom invented by Lowell was generally adopted after 1816.<a href="#fn22" id="ft22">[22]</a> From these various causes the manufacture soon became profitable. There is abundant evidence to show that shortly after the crisis the cotton manufacture had fully recovered from the depression that followed the war.<a href="#fn23" id="ft23">[23]</a> The profits made were such as to cause a rapid extension of the industry. The beginning of those manufacturing villages which now form the characteristic economic feature of New England falls in this period. Nashua was founded in 1823. Fall River, which had grown into some importance during the war of 1814, grew rapidly from 1820 to 1830.<a href="#fn24" id="ft24">[24]</a> By far the most important and the best known of the new ventures in cotton manufacturing was the foundation of the town of Lowell, which was undertaken by the same persons who had been engaged in the establishment of the first power-loom factory at Waltham. The new town was named after the inventor of the power-loom. The scheme of utilizing the falls of the Merrimac, at the point where Lowell now stands, had been suggested as early as 1821, and in the following year the Merrimac Manufacturing Company was incorporated. In 1823 manufacturing began, and was profitable from the beginning; and in 1824 the future growth of Lowell was clearly foreseen.<a href="#fn25" id="ft25">[25]</a></p>

[...]

<p><a href="#ft22" id="fn22">[22]</a> Appleton, p. 13; Batchelder, pp. 70–73.</p>

<p><a href="#ft23" id="fn23">[23]</a> The following passage, referring to the general revival of manufactures, may be quoted: “The manufacture of cotton now yields a moderate profit to those who conduct the business with the requisite skill and economy. The extensive factories at Pawtucket are still in operation. . . . In Philadelphia it is said that about 4,000 looms have been put in operation within the last six months, which are chiefly engaged in making cotton goods, and that in all probability they will, within six months more, be increased to four times that number. In Paterson, N.J., where, two years ago, only three out of sixteen of its extensive factories were in operation ... all are now in vigorous employment.”—“Niles’s Register,” XXI., 39 (1821). Compare <i>Ibid</i>., XXII., 225, 250 (1822); XXIII., 35, 88 (1823); and <i>passim</i>. In Woodbury’s cotton report, cited above, it is said (p. 57) that “there was a great increase [in cotton manufacturing] in 1806 and 1807; again during the war of 1812; again from 1820 to 1825; and in 1831–32.”</p>

<p><a href="#ft24" id="fn24">[24]</a> Fox’s “History of Dunstable”; Earl’s “History of Fall River.” p. 20 <i>seq</i>.</p>

<p><a href="#ft25" id="fn25">[25]</a> See the account in Appleton, pp. 17–25. One of the originators of the enterprise said in 1824: “If our business succeeds, as we have reason to expect, we shall have here [at Lowell] as large a population in twenty years from this time as there was in Boston twenty years ago.”—Batchelder, p. 69.</p>

<p>In Bishop, II., 309, is a list of the manufacturing villages of 1826. in which some twenty places are enumerated.</p>


Anyway, as you can see, PDFs cause a whole host of formatting problems when trying to get it from PDF -> XYZ (particularly with split paragraphs, hard/soft hyphens, footnotes, headers/footers, numbered lists, tables, captions, etc. etc.).

Last edited by Tex2002ans; 05-22-2014 at 08:48 PM. Reason: Added some Spoiler Tags for the code.
Tex2002ans is offline   Reply With Quote
Old 05-23-2014, 12:59 AM   #15
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 1,424
Karma: 846401
Join Date: Jan 2009
Device: KoboGlo
Quote:
Originally Posted by Hitch View Post
Well...I suppose I think of it as "not," because we have to hand-check them, anyway.

I guess I should clarify if we're all talking about the same thing? roger64, do you mean fully-linked footnotes/endnotes, or...? With us, we tend to end up having to do a large amount of renumbering, because we tend to get a lot of works (not sure why this is), in which the author used an asterisk for items on pages, not numbers. That's a nice PITA. ;-)

Hitch
Yes. I meaned exactly fully-linked footnotes/endnotes, and it's the same situation for me. I wish I could avoid hand-checking but...
roger64 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Generate epub using text-recognized text in PDF not Pictures. lordofazeroth Conversion 0 09-19-2013 04:16 PM
no text extraction for pdf with images and OCR fxp33 Conversion 6 05-09-2013 03:51 AM
Creating a standard editable format ebooks-love Calibre 9 01-15-2012 06:52 PM
User-Editable HTML in Templates? marcot Calibre 0 06-15-2010 09:19 AM
PDF extraction – what is the best tool? Prospect PDF 21 09-27-2009 01:34 AM


All times are GMT -4. The time now is 08:31 PM.


MobileRead.com is a privately owned, operated and funded community.