MobileRead Forums - View Single Post - Fully Automated ebook file parsing, ISBN extraction, Titel Extraction and metadata

isbnread · 02-20-2017, 10:20 AM

Why is there no software that goes through a directory, converts the PDF, EPUB, oet other format to text. then agressively searches the text for ISBN number, title etc. Corrects the metadata of the ebook. Also extracts the IMG for tesseract OCR to check if the title can be deduced. Library of Congress entries are also good sources.

parsing PDF's can also be done with python modules for eve nmore effective automatic library cleaning.

02-20-2017, 10:20 AM	#1
isbnread Junior Member Posts: 1 Karma: 10 Join Date: Feb 2017 Device: kindle	Fully Automated ebook file parsing, ISBN extraction, Titel Extraction and metadata Why is there no software that goes through a directory, converts the PDF, EPUB, oet other format to text. then agressively searches the text for ISBN number, title etc. Corrects the metadata of the ebook. Also extracts the IMG for tesseract OCR to check if the title can be deduced. Library of Congress entries are also good sources. parsing PDF's can also be done with python modules for eve nmore effective automatic library cleaning.