04-26-2017, 06:41 PM | #1 |
Member
Posts: 10
Karma: 10
Join Date: Dec 2012
Device: none
|
[Request] extract/conversion/save tool for machine learning
Wanted: Something that will process books from my library and return a book object with the following attributes:
Code:
class Book(object): def __init__(self, file): #should file be epub, htmlz, or txt? self.read(file) def read(self, file): self.author = 'Author Name' self.title = 'Book Title' #list of lists where each list is the chapter text including prologs and epilogs self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)] Does something like this exist already? What is the best way of doing this? [Options] Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps. Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books. Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters? |
04-26-2017, 10:33 PM | #2 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you want to extract text the easiest way to do it is to convert to txt.
If you want to do it using calibre APIs then you will need to spend the time to familiarize yourself with them. The setting up a calibre development environment section in the user manual tells you how to get started. You basically need to run the input format plugin on the file, then you can use calibre.ebooks.oeb.polish.container.Container object to access the contents of the result of running the input format plugin |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
App, tool or way to Bulk extract export all videos, images & content from PDFs? Is | crashnburn | 5 | 01-30-2017 11:15 AM | |
App, tool or way to Bulk extract export all videos, images & content from Epubs? | crashnburn | ePub | 5 | 01-11-2016 03:20 PM |
Free (Kindle App) Flash Cards: Alphabet (An Early Learning Tool) | arcadata | Deals and Resources (No Self-Promotion or Affiliate Links) | 1 | 07-12-2011 07:46 PM |
[Old Thread] Auto Extract ISBN-Feature request | UnraisedArc | Calibre | 60 | 03-23-2011 09:31 AM |
Feature request: Conversion, save backup | nickdma | Conversion | 5 | 01-22-2011 12:46 PM |