Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 06-20-2023, 02:57 PM   #1
famfam
Connoisseur
famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.
 
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
How to extract footnotes from pdf?

My problem is to extract the footnotes from pdf or even from epub:
The footnotes at the end of a page are disturbing the ebook production from pdf or even from FR. Is there any solution how to find and extract the footnotes and collect them in a word document? And after thist operation put the collected footnotes as a whole list at the and of the book?
I did that many times manually and successfully. But if the number of footnotes is very big its exhausting and not practible.
Any idea what can one do?

(I put the threat here because fount no better place.)
famfam is offline   Reply With Quote
Old 06-25-2023, 10:22 AM   #2
famfam
Connoisseur
famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.
 
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
09-04-2021, 10:47 PM
Trenchant Edges [New Plugin Development Plan] Extracting footnotes/endnotes and Indexing dates
Link to this Thread of Trenchant Edges:
https://www.mobileread.com/forums/sh...ract+footnotes

Does anybody know, what happened with that project?
Was there any success ot solution?
famfam is offline   Reply With Quote
Advert
Old 07-07-2023, 02:14 AM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by famfam View Post
My problem is to extract the footnotes from pdf or even from epub:
The footnotes at the end of a page are disturbing the ebook production from pdf or even from FR. Is there any solution how to find and extract the footnotes and collect them in a word document?
For PDF/Scanned Footnotes?

ABBYY Finereader.

Then a whole lot of elbow grease, tools, and Regular Expressions.

I've written about the process I use many times over the years, for example:

I've used those methods to digitize over 700 books (dense, heavily footnoted, mostly Non-Fiction).

For more info, also see many other topics describing the footnote/endnote problems and steps on how to mitigate the issues:

For EPUB Footnotes?

Every single one is going to have uniquely messy code, which you have to reverese engineer and come up with Regular Expressions in order to correct.

- - -

If you want even more fun, there was:

where I described:

- - -

Quote:
Originally Posted by famfam View Post
And after thist operation put the collected footnotes as a whole list at the and of the book?

I did that many times manually and successfully.
Yes, this is the method I ultimately settled on and first hinted in 2020 + 2021:

Quote:
Originally Posted by Tex2002ans View Post
A while back, I created a little Python program to move all footnotes to the bottom of the HTML file.

It's very helpful when dealing with footnotes mixed right in the middle of the text (mostly due to OCR).


* * *

And on Footnotes vs. Endnotes:

I think it's best to have Footnotes at the bottom of each HTML file. This ensures:
  • each chapter is "standalone"
    • Easily posted on a website as a single article, etc.

[...]
See the original topic for even more methods/discussion:

In it, I also described the common OCR problem of:
  • Half the footnotes get "auto-detected"
    • and moved to the back of the book.
  • Half the footnotes don't "get detected"
    • and are left in the middle of the text.

This is super common with multi-page footnotes—but can randomly happen ANYWHERE in ANY book!

... and there's absolutely no way to correct this stuff without a keen eye and elbow grease.

Partial Solution? Rip and Pull + Renumber Footnotes!

To solve this problem and massively speed-up my workflow...

I've since created 2 separate footnote helper programs for myself:
  • Program 1 detects an HTML class.
    • Rips it out and pulls it to the end of the file.
  • Program 2 detects <sup>+number OR an HTML class+number.
    • Renumbers footnotes starting from 1.
    • Linkify everything back/forth.

Program 1's general steps are:
  • Find "footnote code" somewhere in text.
  • Move to end of file.

Program 2's general steps are:
  • Find "superscript numbers" in text.
  • Begin renumbering from scratch.
    • Make sure to generate links pointing ONE WAY.
  • Hit a flipping point.
    • Example: I place a <hr/> before all footnote sections.
  • Reset numbering.
  • Find "superscripts", and begin renumbering THOSE.
    • Make sure to generate links pointing OTHER WAY.

Program 1 will help:
  • When footnotes get split and mixed all up in your text.
  • Correct PDF / OCR / DOCX -> EPUB errors.
    • (Or ugly EPUB->EPUB errors.)
  • Gather a chapter's/book's notes in a single location (at the end of the files!)
    • (Making it easier for a human to correct next stages!)

Program 2 will help:
  • Correct all the busted up numbering
  • + Linkify your footnotes back/forth!

because, depending on which tools you use, the original footnote's numbering will disappear (or get mangled)!

So you'll get stuff like:
  • <sup>1</sup>
  • <a href=""><sup>1</sup></a>
  • <a href=""><sup>2</sup></a>
  • <sup>4</sup>

where:
  • Footnotes 1 and 4 WERE NOT auto-detected
    • so got left behind in the middle of the text.
  • Footnotes 2 and 3 WERE auto-detected
    • and auto-linked for you!
    • ... But the conversion tool only thought there was 2 footnotes in the book, so they accidentally got renumbered to "1" and "2"!!!

So you'll have to rearrange/renumber your footnotes/endnotes to:
  • <sup>1</sup>
  • <sup>2</sup>
  • <sup>3</sup>
  • <sup>4</sup>

and then relink everything from scratch.

- - -

Side Note: For more Program 1 and 2 info/code examples, and common footnote problems/patterns I've noticed across books, see:

- - -

Quote:
Originally Posted by famfam View Post
But if the number of footnotes is very big its exhausting and not practible.

Any idea what can one do?
Pray to the ebook gods!!!

- - -

Side Note: Another ultimate footnote topic to read is:

which described the different messes you'll run across, and how to code footnotes cleanly/properly to save yourself mountains of pain/headaches in the future.

Side Note #2: Anyway, if you want even more information, I highly recommend typing this into your favorite search engine:
  • footnotes superscript Tex2002ans site:mobileread.com
  • footnotes Tex2002ans site:mobileread.com
  • endnotes Tex2002ans site:mobileread.com

I've written over 200 topics over the years discussing every single aspect of this digitizing-footnotes-in-ebooks problem.

Last edited by Tex2002ans; 07-07-2023 at 04:05 AM. Reason: [
Tex2002ans is offline   Reply With Quote
Old 07-10-2023, 03:33 PM   #4
famfam
Connoisseur
famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.famfam ought to be getting tired of karma fortunes by now.
 
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
That's very interesting for me and very helpfull for me. I hope.
I apologize for not reading your reply until today.
I will try to understand and work with it. Step by step. We will see, what kind of result and success will come out of that.
Thank you so much for your fantastic engagement for epubs. I value that very highly.
famfam is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract ISBN from PDF? mdroberts Calibre 14 12-16-2016 07:32 AM
How to extract webm videos from PDF JackSPk PDF 5 01-08-2016 08:38 AM
How to extract embedded TRUE pdf from .me file ?? anil4523 PDF 0 06-13-2015 01:01 AM
Extract PDF from Palm PDB-file? Tobago PDF 1 02-18-2010 07:32 AM
[REQ] Extract the first PDF page as image Format C: PDF 2 02-09-2009 10:53 AM


All times are GMT -4. The time now is 07:28 PM.


MobileRead.com is a privately owned, operated and funded community.