MobileRead Forums - View Single Post - [Request] extract/conversion/save tool for machine learning

Neck Beard · 04-26-2017, 07:41 PM

Wanted: Something that will process books from my library and return a book object with the following attributes:

Code:

class Book(object):

    def __init__(self, file):
       #should file be epub, htmlz, or txt?
        self.read(file)

    def read(self, file):
        self.author = 'Author Name'
        self.title = 'Book Title'
        #list of lists where each list is the chapter text including prologs and 
         epilogs
        self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)]

[Considerations]
Does something like this exist already?
What is the best way of doing this?

[Options]
Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps.

Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books.

Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters?

04-26-2017, 07:41 PM	#1
Neck Beard Member Posts: 10 Karma: 10 Join Date: Dec 2012 Device: none	[Request] extract/conversion/save tool for machine learning Wanted: Something that will process books from my library and return a book object with the following attributes: Code: class Book(object): def __init__(self, file): #should file be epub, htmlz, or txt? self.read(file) def read(self, file): self.author = 'Author Name' self.title = 'Book Title' #list of lists where each list is the chapter text including prologs and epilogs self.text = [ [p.text for p in ch] for ch in book.find_all(chapters)] [Considerations] Does something like this exist already? What is the best way of doing this? [Options] Extract from "Calibre Library" folder directly? Would mean extracting directly from epubs, since most of my ebooks are in epub. Downsides: don't know how to parse epub files, therefore, don't know how to define read function to get them into self.text. Working with Calibre library folder could screw up library if not careful. Upside no unnecessary extra conversion or saving steps. Convert then extract? Convert all books to htmlz then extract. Downsides: conversion time, and parsing. Upsides: could probably define a read functions with BeautifulSoup that would parse books. Something else? Convert and save to txt files in new folder would be much slower but wouldn't risk screwing up library folder. Read funtion would be easy to get author and title info but how to strip away unwanted text and save to list of chapters?