Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 04-26-2017, 07:41 PM   #1
Neck Beard
Member
Neck Beard began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Dec 2012
Device: none
[Request] extract/conversion/save tool for machine learning

Wanted: Something that will process books from my library and return a book object with the following attributes:

Code:
class Book(object):

    def __init__(self, file):
       #should file be epub, htmlz, or txt?
        self.read(file)

    def read(self, file):
        self.author = 'Author Name'
        self.title = 'Book Title'
        #list of lists where each list is the chapter text including prologs and 
         epilogs
        self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)]
[Considerations]
Does something like this exist already?
What is the best way of doing this?

[Options]
Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps.

Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books.

Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters?
Neck Beard is offline   Reply With Quote
Advert
Old 04-26-2017, 11:33 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 31,845
Karma: 8697710
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you want to extract text the easiest way to do it is to convert to txt.

If you want to do it using calibre APIs then you will need to spend the time to familiarize yourself with them. The setting up a calibre development environment section in the user manual tells you how to get started.

You basically need to run the input format plugin on the file, then you can use calibre.ebooks.oeb.polish.container.Container object to access the contents of the result of running the input format plugin
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
App, tool or way to Bulk extract export all videos, images & content from PDFs? Is crashnburn PDF 5 01-30-2017 12:15 PM
App, tool or way to Bulk extract export all videos, images & content from Epubs? crashnburn ePub 5 01-11-2016 04:20 PM
Free (Kindle App) Flash Cards: Alphabet (An Early Learning Tool) arcadata Deals, Freebies, and Resources (No Self-Promotion) 1 07-12-2011 08:46 PM
[Old Thread] Auto Extract ISBN-Feature request UnraisedArc Calibre 60 03-23-2011 10:31 AM
Feature request: Conversion, save backup nickdma Conversion 5 01-22-2011 01:46 PM


All times are GMT -4. The time now is 01:05 PM.


MobileRead.com is a privately owned, operated and funded community.