[Request] extract/conversion/save tool for machine learning

Neck Beard · 04-26-2017, 06:41 PM

Wanted: Something that will process books from my library and return a book object with the following attributes:

Code:

class Book(object):

    def __init__(self, file):
       #should file be epub, htmlz, or txt?
        self.read(file)

    def read(self, file):
        self.author = 'Author Name'
        self.title = 'Book Title'
        #list of lists where each list is the chapter text including prologs and 
         epilogs
        self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)]

[Considerations]
Does something like this exist already?
What is the best way of doing this?

[Options]
Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps.

Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books.

Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters?

kovidgoyal · 04-26-2017, 10:33 PM

If you want to extract text the easiest way to do it is to convert to txt.

If you want to do it using calibre APIs then you will need to spend the time to familiarize yourself with them. The setting up a calibre development environment section in the user manual tells you how to get started.

You basically need to run the input format plugin on the file, then you can use calibre.ebooks.oeb.polish.container.Container object to access the contents of the result of running the input format plugin

04-26-2017, 06:41 PM	#1
Neck Beard Member Posts: 10 Karma: 10 Join Date: Dec 2012 Device: none	[Request] extract/conversion/save tool for machine learning Wanted: Something that will process books from my library and return a book object with the following attributes: Code: class Book(object): def __init__(self, file): #should file be epub, htmlz, or txt? self.read(file) def read(self, file): self.author = 'Author Name' self.title = 'Book Title' #list of lists where each list is the chapter text including prologs and epilogs self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)] [Considerations] Does something like this exist already? What is the best way of doing this? [Options] Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps. Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books. Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
App, tool or way to Bulk extract export all videos, images & content from PDFs? Is	crashnburn	PDF	5	01-30-2017 11:15 AM
App, tool or way to Bulk extract export all videos, images & content from Epubs?	crashnburn	ePub	5	01-11-2016 03:20 PM
Free (Kindle App) Flash Cards: Alphabet (An Early Learning Tool)	arcadata	Deals and Resources (No Self-Promotion or Affiliate Links)	1	07-12-2011 07:46 PM
[Old Thread] Auto Extract ISBN-Feature request	UnraisedArc	Calibre	60	03-23-2011 09:31 AM
Feature request: Conversion, save backup	nickdma	Conversion	5	01-22-2011 12:46 PM

04-26-2017, 10:33 PM	#2
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you want to extract text the easiest way to do it is to convert to txt. If you want to do it using calibre APIs then you will need to spend the time to familiarize yourself with them. The setting up a calibre development environment section in the user manual tells you how to get started. You basically need to run the input format plugin on the file, then you can use calibre.ebooks.oeb.polish.container.Container object to access the contents of the result of running the input format plugin