Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-09-2007, 10:24 PM   #16
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,404
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Since I'm on a roll here's one for HTML it might need a little adjustment

Code:
import re
from hyphenate import hyphenate_word as hyphenate

def process_text(match):
    src = match.group(1)
    return re.sub('\S+', lambda match : u'\u00ad'.join(hyphenate(match.group())), src)

src = open('file', 'rb').read()
result = re.sub(r'>([^><]+)<', process_text, src)
kovidgoyal is online now   Reply With Quote
Old 10-09-2007, 11:29 PM   #17
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
okay I can defintely see the advantages of hyphenating from the html file because I have to deal with Gutenmark, a program that was written specifically under the assumption that gutenberg text is formatted in a specific way.
Goshzilla is offline   Reply With Quote
Advert
Old 10-10-2007, 01:26 AM   #18
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by Goshzilla View Post
okay I can defintely see the advantages of hyphenating from the html file because I have to deal with Gutenmark, a program that was written specifically under the assumption that gutenberg text is formatted in a specific way.
The simplest way to add soft hyphens doesn't even require any programming. You could use a sed script to do a global substitution of each word with its soft hyphen equivalent. This would be a quick a dirty way to accomplish the goal particularly if you really only want to hyphenate longer words.
DaleDe is offline   Reply With Quote
Old 10-10-2007, 11:22 PM   #19
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
I've noticed some bizar hyphenations being used in the program. For instance sometimes words like "y-our" get hyphenated, same goes for "s-mall" and "u-sual" in some instances plural forms "word-s" or "travel-s" when I use the hyphenation python script on individiual words like that it doesn't return erroneous results, but when I take an entire line, use

a=f.readline()
b=a.split()
(while loop that goes through the length of b)
a=re.sub(b[i],'^'.join(hyphenate(b[i])), a)


the line hyphenates words incorrectly. I think this has something to do with the way .sub works by finding a "pattern" in the string, so once a line has been altered, sub doesn't quit work as well because I'm looking at a new string.


update: It's the dang punctuation that messes it up, if I don't remove the punctuation marks before calling the hyphenation method, the word gets hyphenated incorrectly. I could either just alter the simple script I wrote, or alter the hyphenate code.

I could use a line like a=re.sub("[,;:.!]", '', a) to remove the punctuation marks, but when there is an apostrophe I would like to cut out the word preceeding the apostrophe mark, so something like "Washington's" becomes "Washington"

Last edited by Goshzilla; 10-10-2007 at 11:33 PM.
Goshzilla is offline   Reply With Quote
Old 10-10-2007, 11:49 PM   #20
wallcraft
reader
wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.
 
wallcraft's Avatar
 
Posts: 6,977
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
Quote:
Originally Posted by NatCh View Post
Do the em dashes in question have spaces on either side of them?

I.e. "word — word" vs. "word—word"

The reason I ask is that the absence of spaces may make the parser think it's supposed to be a single word — I think the first is the "technically" proper usage, anyway ....
Space are usually included (or they used to be) in Britain, but in the US the spaces are usually omitted. See the wikipedia article on Dash.
wallcraft is offline   Reply With Quote
Advert
Old 10-11-2007, 10:35 AM   #21
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
Thanks for the primer, wallcraft — all this time I time I thought I was being non-standard, and I just learned I've been developing my own "house style!"
NatCh is offline   Reply With Quote
Old 10-11-2007, 12:28 PM   #22
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,404
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you do modify hyphenate send me the modified code.

Quote:
Originally Posted by Goshzilla View Post
I've noticed some bizar hyphenations being used in the program. For instance sometimes words like "y-our" get hyphenated, same goes for "s-mall" and "u-sual" in some instances plural forms "word-s" or "travel-s" when I use the hyphenation python script on individiual words like that it doesn't return erroneous results, but when I take an entire line, use

a=f.readline()
b=a.split()
(while loop that goes through the length of b)
a=re.sub(b[i],'^'.join(hyphenate(b[i])), a)


the line hyphenates words incorrectly. I think this has something to do with the way .sub works by finding a "pattern" in the string, so once a line has been altered, sub doesn't quit work as well because I'm looking at a new string.


update: It's the dang punctuation that messes it up, if I don't remove the punctuation marks before calling the hyphenation method, the word gets hyphenated incorrectly. I could either just alter the simple script I wrote, or alter the hyphenate code.

I could use a line like a=re.sub("[,;:.!]", '', a) to remove the punctuation marks, but when there is an apostrophe I would like to cut out the word preceeding the apostrophe mark, so something like "Washington's" becomes "Washington"
kovidgoyal is online now   Reply With Quote
Old 10-13-2007, 08:18 PM   #23
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
I never did modify the existing hyphenation algorithm. I only know basic coding, basic objects, I am still learning about writing my own search trees and hash tables, and I still haven't figured all the little nuances in Python just yet.

So here is what I wrote:
import re
from hyphenate import hyphenate_word as hyphenate
def process(y):
if (y=='\n'):
return y
b=re.sub("[,()*;:!?.]", '', y)
b=re.sub('"','',b)
b=re.sub('[\[\]{}<>]','',b)
k=b.split()
i=0
while(i<len(k)):
y=re.sub(k[i],'^'.join(hyphenate(k[i])),y)
i=i+1
return y


import re
from hyphenate import hyphenate_word as hyphenate
from process import process as process
f=open('gltrv10.txt', 'rb')
g=open('r23.txt', 'w')
a=f.readline()
while(a!=''):
g.write(process(a))
a=f.readline()
Goshzilla is offline   Reply With Quote
Old 10-13-2007, 10:12 PM   #24
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,404
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Cool glad you could fix your problem and keep learning python, it's a rewarding experience.
kovidgoyal is online now   Reply With Quote
Old 04-08-2010, 12:47 AM   #25
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
I just thought I would revive this really old thread, in that I discovered that the Nook does support the soft hyphens. Only sort of, it breaks the word at the point where the soft hyphen is used, but it does not insert a "-" at the point where it breaks the word. Still, that's better than nothing.

Now I would like to be able to edit the contents of epub files to include hyphenations for every word found within the body of the text. I imagine this "BeautifulSoup" program will be what I need to accomplish such a task?
Goshzilla is offline   Reply With Quote
Old 04-08-2010, 01:05 AM   #26
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,404
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yes beautifulsoup can be used for that.
kovidgoyal is online now   Reply With Quote
Old 04-08-2010, 05:56 PM   #27
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
So I've been working on this all this morning and I have come across a stopping block.

I can extract text from anything within the <p> tags, but I have a problem trying to write a loop to hyphenate a sentence, the issue is that there are specially defined characters like "&rdquo;" which is the entry for a right sided double quotation mark. I can strip out the text from a paragraph, but it still leaves symbols like that intact.


update***

I got it done, finally. Though it isn't the most efficient code in the world, but it can now scan through an entire directory for html files, and modify them one by one, inserting hyphens. There is however one issue I'd like to get around. I am hitting a stopping block when I try to edit an html file that is encoded in something other than ascii.

Last edited by Goshzilla; 04-08-2010 at 07:24 PM.
Goshzilla is offline   Reply With Quote
Old 04-08-2010, 10:23 PM   #28
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,404
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use the xml_to_unicode function in calibre and pass Beautifulsoup the unicode.
kovidgoyal is online now   Reply With Quote
Old 04-09-2010, 02:55 PM   #29
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
Okay I've done as much as I could out of it, I've uploaded an epub of Hound of the Baskervilles to demonstrate the soft hyphen reflow through out the entire text.

update: I'm perfecting the code step by step. With the latest iteration, I can have the program crawl through a directory with subfolders each containing html files, and then modify them all subfolder by subfolder. Now I can do bulk editing quickly.

Last edited by Goshzilla; 04-10-2010 at 01:37 AM.
Goshzilla is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl and Regex Alexander Turcic Lounge 3 01-25-2011 07:48 PM
perl for the DR800/DR1000? Mr. X iRex 1 03-16-2010 07:47 AM
lit2mobi written in Perl working tompe Bookeen 7 01-19-2008 01:06 PM
Perl processing alexxxm Sony Reader 3 11-26-2007 06:13 AM
Any perl or python gurus? jbenny Workshop 0 11-23-2007 03:27 PM


All times are GMT -4. The time now is 03:28 PM.


MobileRead.com is a privately owned, operated and funded community.