Quote:
Originally Posted by famfam
My problem is to extract the footnotes from pdf or even from epub:
The footnotes at the end of a page are disturbing the ebook production from pdf or even from FR. Is there any solution how to find and extract the footnotes and collect them in a word document?
|
For PDF/Scanned Footnotes?
ABBYY Finereader.
Then a whole lot of elbow grease, tools,
and Regular Expressions.
I've written about the process I use many times over the years, for example:
I've used those methods to digitize over 700 books (dense, heavily footnoted, mostly Non-Fiction).
For more info, also see many other topics describing the footnote/endnote problems and steps on how to mitigate the issues:
For EPUB Footnotes?
Every single one is going to have uniquely messy code, which you have to reverese engineer and come up with Regular Expressions in order to correct.
- - -
If you want even more fun, there was:
where I described:
- - -
Quote:
Originally Posted by famfam
And after thist operation put the collected footnotes as a whole list at the and of the book?
I did that many times manually and successfully.
|
Yes, this is the method I ultimately settled on and
first hinted in 2020 +
2021:
Quote:
Originally Posted by Tex2002ans
A while back, I created a little Python program to move all footnotes to the bottom of the HTML file.
It's very helpful when dealing with footnotes mixed right in the middle of the text (mostly due to OCR).
* * *
And on Footnotes vs. Endnotes:
I think it's best to have Footnotes at the bottom of each HTML file. This ensures:
- each chapter is "standalone"
- Easily posted on a website as a single article, etc.
[...]
|
See the original topic for even more methods/discussion:
In it, I also described the common OCR problem of:
- Half the footnotes get "auto-detected"
- and moved to the back of the book.
- Half the footnotes don't "get detected"
- and are left in the middle of the text.
This is super common with multi-page footnotes—but can randomly happen ANYWHERE in ANY book!
... and
there's absolutely no way to correct this stuff without a keen eye and elbow grease.
Partial Solution? Rip and Pull + Renumber Footnotes!
To solve this problem and massively speed-up my workflow...
I've since created 2 separate footnote helper programs for myself:
- Program 1 detects an HTML class.
- Rips it out and pulls it to the end of the file.
- Program 2 detects <sup>+number OR an HTML class+number.
- Renumbers footnotes starting from 1.
- Linkify everything back/forth.
Program 1's general steps are:
- Find "footnote code" somewhere in text.
- Move to end of file.
Program 2's general steps are:
- Find "superscript numbers" in text.
- Begin renumbering from scratch.
- Make sure to generate links pointing ONE WAY.
- Hit a flipping point.
- Example: I place a <hr/> before all footnote sections.
- Reset numbering.
- Find "superscripts", and begin renumbering THOSE.
- Make sure to generate links pointing OTHER WAY.
Program 1 will help:
- When footnotes get split and mixed all up in your text.
- Correct PDF / OCR / DOCX -> EPUB errors.
- (Or ugly EPUB->EPUB errors.)
- Gather a chapter's/book's notes in a single location (at the end of the files!)
- (Making it easier for a human to correct next stages!)
Program 2 will help:
- Correct all the busted up numbering
- + Linkify your footnotes back/forth!
because, depending on which tools you use, the original footnote's numbering will disappear (or get mangled)!
So you'll get stuff like:
- <sup>1</sup>
- <a href=""><sup>1</sup></a>
- <a href=""><sup>2</sup></a>
- <sup>4</sup>
where:
- Footnotes 1 and 4 WERE NOT auto-detected
- so got left behind in the middle of the text.
- Footnotes 2 and 3 WERE auto-detected
- and auto-linked for you!
- ... But the conversion tool only thought there was 2 footnotes in the book, so they accidentally got renumbered to "1" and "2"!!!
So you'll have to rearrange/renumber your footnotes/endnotes to:
- <sup>1</sup>
- <sup>2</sup>
- <sup>3</sup>
- <sup>4</sup>
and then relink everything from scratch.
- - -
Side Note: For more Program 1 and 2 info/code examples, and common footnote problems/patterns I've noticed across books, see:
- - -
Quote:
Originally Posted by famfam
But if the number of footnotes is very big its exhausting and not practible.
Any idea what can one do?
|
Pray to the ebook gods!!!
- - -
Side Note: Another ultimate footnote topic to read is:
which described the different messes you'll run across, and how to code footnotes cleanly/properly to save yourself mountains of pain/headaches in the future.
Side Note #2: Anyway, if you want even more information, I highly recommend typing this into your favorite search engine:
- footnotes superscript Tex2002ans site:mobileread.com
- footnotes Tex2002ans site:mobileread.com
- endnotes Tex2002ans site:mobileread.com
I've written over 200 topics over the years discussing every single aspect of this digitizing-footnotes-in-ebooks problem.