Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 04-26-2017, 06:41 PM   #1
Neck Beard
Member
Neck Beard began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Dec 2012
Device: none
[Request] extract/conversion/save tool for machine learning

Wanted: Something that will process books from my library and return a book object with the following attributes:

Code:
class Book(object):

    def __init__(self, file):
       #should file be epub, htmlz, or txt?
        self.read(file)

    def read(self, file):
        self.author = 'Author Name'
        self.title = 'Book Title'
        #list of lists where each list is the chapter text including prologs and 
         epilogs
        self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)]
[Considerations]
Does something like this exist already?
What is the best way of doing this?

[Options]
Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps.

Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books.

Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters?
Neck Beard is offline   Reply With Quote
Old 04-26-2017, 10:33 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you want to extract text the easiest way to do it is to convert to txt.

If you want to do it using calibre APIs then you will need to spend the time to familiarize yourself with them. The setting up a calibre development environment section in the user manual tells you how to get started.

You basically need to run the input format plugin on the file, then you can use calibre.ebooks.oeb.polish.container.Container object to access the contents of the result of running the input format plugin
kovidgoyal is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
App, tool or way to Bulk extract export all videos, images & content from PDFs? Is crashnburn PDF 5 01-30-2017 11:15 AM
App, tool or way to Bulk extract export all videos, images & content from Epubs? crashnburn ePub 5 01-11-2016 03:20 PM
Free (Kindle App) Flash Cards: Alphabet (An Early Learning Tool) arcadata Deals and Resources (No Self-Promotion or Affiliate Links) 1 07-12-2011 07:46 PM
[Old Thread] Auto Extract ISBN-Feature request UnraisedArc Calibre 60 03-23-2011 09:31 AM
Feature request: Conversion, save backup nickdma Conversion 5 01-22-2011 12:46 PM


All times are GMT -4. The time now is 07:30 PM.


MobileRead.com is a privately owned, operated and funded community.