MobileRead Forums - View Single Post

jorm · 03-13-2012, 12:39 PM

do you think we could identify a paragraph in a book by looking for several sentences that are place together with punctuation. Using simple rules for number of words per sentence. And counts of proper nouns. I might try to devise an algorithm and see if I can run it on a sample of books. And extract a real sentence. I can do pdf, epub, html and text since I can read those directly. I know calibre can read mobi but I have not figured out how to read it programatically yet.

If not we can move more into the fussy logic of word frequencies and proper names. This approach would be more interesting but require a lot more design and programming. In this approach do we try to process the header info as well. Or do we still try to make our way to the content.

Perhaps if we get the first option functional we can capture some metrics and then setup the service where we store the metrics and we can do some analysis on it to determine how much fussy variance do we allow.

I do want it to be computationally cheap. Because I want to encourage people to help populate the data with their data that they have already sorted and tagged so others can benefit.

I can do the backend service and database. If someone is familiar with python and plugin developing and willing to help that would be great.

03-13-2012, 12:39 PM	#5
jorm Member Posts: 17 Karma: 10 Join Date: Mar 2012 Device: nook	do you think we could identify a paragraph in a book by looking for several sentences that are place together with punctuation. Using simple rules for number of words per sentence. And counts of proper nouns. I might try to devise an algorithm and see if I can run it on a sample of books. And extract a real sentence. I can do pdf, epub, html and text since I can read those directly. I know calibre can read mobi but I have not figured out how to read it programatically yet. If not we can move more into the fussy logic of word frequencies and proper names. This approach would be more interesting but require a lot more design and programming. In this approach do we try to process the header info as well. Or do we still try to make our way to the content. Perhaps if we get the first option functional we can capture some metrics and then setup the service where we store the metrics and we can do some analysis on it to determine how much fussy variance do we allow. I do want it to be computationally cheap. Because I want to encourage people to help populate the data with their data that they have already sorted and tagged so others can benefit. I can do the backend service and database. If someone is familiar with python and plugin developing and willing to help that would be great.