Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-09-2007, 04:31 PM   #1
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
Any good Perl scripters out there?

I have the software installed on my comp but I have never written a perl script before. I have done programming in Java and C++, so I naively thought that a program to hyphenate every word in a paragraph with a softhyphen(one that does not show up unless the word is positioned at the end of a line) would be easy. Turns out it isn't. But luckily it was throughly researched by Donald Knuth when he wrote his LaTex program.

There is a perl command equivalent that uses the LaTex hyphenation scheme

http://www.gemjack.com/gems/text-hyp...xt/Hyphen.html

I would like to be able to write a simple script that can take an entire html file, hyphenate the body's text with softhyphens and then save it back to the html file. The reason why this interests me is that for alot of ebooks I make for the Gemstar have alot of blank spaces between words due to fullscreen justification.

note: I would prefer to work in html since the texts I use are Project Gutenberg texts, and I convert them to html using the Gutenmark tool.
Goshzilla is offline   Reply With Quote
Old 10-09-2007, 05:46 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,845
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use beatifulsoup + python hyphenate shouldn't need more than a 100 line script.
kovidgoyal is offline   Reply With Quote
Advert
Old 10-09-2007, 05:47 PM   #3
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,901
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
If we put in soft hyphens in an LRS would it work when converted to LRF?
JSWolf is offline   Reply With Quote
Old 10-09-2007, 05:54 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,845
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It might depends on whether SONY's reader software supports soft hyphens.
kovidgoyal is offline   Reply With Quote
Old 10-09-2007, 06:04 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,845
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I just checked it doesn't. In fact it worse than doesn't. It converts them to real hyphens. And doesn't line break on them.
kovidgoyal is offline   Reply With Quote
Advert
Old 10-09-2007, 06:11 PM   #6
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
Quote:
Originally Posted by kovidgoyal View Post
It converts them to real hyphens. And doesn't line break on them.
Ouch.
NatCh is offline   Reply With Quote
Old 10-09-2007, 06:32 PM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,901
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
That's odd. As I've seen it break on a real hyphen if the real hyphen is already in the text. Doesn't break on em dashes for some idiotic reason.
JSWolf is offline   Reply With Quote
Old 10-09-2007, 06:41 PM   #8
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
Quote:
Originally Posted by JSWolf View Post
Doesn't break on em dashes for some idiotic reason.
Do the em dashes in question have spaces on either side of them?

I.e. "word — word" vs. "word—word"

The reason I ask is that the absence of spaces may make the parser think it's supposed to be a single word — I think the first is the "technically" proper usage, anyway ....
NatCh is offline   Reply With Quote
Old 10-09-2007, 06:45 PM   #9
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,901
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
I've seen "word— word" vs. "word—word" vs. "word — word"

If you have "word—word" it's treated as one long word. "word — word" might be wrapped at word leaving "— word" on the next line. If it fits, "word—" would be on the previous line and word would start the next line which looks best.
JSWolf is offline   Reply With Quote
Old 10-09-2007, 06:58 PM   #10
NatCh
Gizmologist
NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.NatCh ought to be getting tired of karma fortunes by now.
 
NatCh's Avatar
 
Posts: 11,615
Karma: 929550
Join Date: Jan 2006
Location: Republic of Texas Embassy at Jackson, TN
Device: Pocketbook Touch HD3
Any idea what it does with non-breaking spaces?
NatCh is offline   Reply With Quote
Old 10-09-2007, 07:13 PM   #11
jbenny
Addict
jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.
 
Posts: 323
Karma: 358
Join Date: May 2007
Device: Tablet PC and Nokia N800
Quote:
Originally Posted by NatCh View Post
Do the em dashes in question have spaces on either side of them?

I.e. "word — word" vs. "word—word"

The reason I ask is that the absence of spaces may make the parser think it's supposed to be a single word — I think the first is the "technically" proper usage, anyway ....
Actually, I am pretty sure that the correct usage of an em-dash has no spaces on either side, although some people use them. Of course, the typical ASCII representation of an em-dash is -- (two dashes).
jbenny is offline   Reply With Quote
Old 10-09-2007, 09:13 PM   #12
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
Quote:
Originally Posted by kovidgoyal View Post
Use beatifulsoup + python hyphenate shouldn't need more than a 100 line script.
How exactly would this work. I want to be able to preserve the original tags, styles etc. Except all the text within the body of the work will be hyphenated using soft hyphens.

The way I'm reading the way hyphenate.py file is written, it meerly returns an array with each substring of text where a hyphen can go, I suppose then I would have to write a for-loop based on the length of that array to create a string with the softhyphen apended something like

if say I took that a=hyphenate_word(perfect)
a=['per', 'fect']
string=a[0]
then I would want a for loop iterating for i=from 1 to length(a)-1
## something to append to string such that it appends "$$softhyphen$$+a[i]"
(writing it that way so that the softhyphen is never inserted at the end of a word)

now I just need to figure out how to use this BeautifulSoup script to figure out how to get the text inbetween the <body></body> tags, while preserving formatting tags like <p> and <br>, I don't want to completely drop the formatting, I only want those words to be modified then placed back into the html file.

Last edited by Goshzilla; 10-09-2007 at 09:15 PM.
Goshzilla is offline   Reply With Quote
Old 10-09-2007, 09:20 PM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,845
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
u'\u00ad'.join(result_of_call_to_hyphenate)

you'd need to recursively process the root tag of beautiful soup. All text is a tag of class NavigableString. Use isinstance to test for it. I'm not sure how easy it is to replace strings in the soup, but once you figure that out all you need to do is print unicode(soup).encode('utf-8')
kovidgoyal is offline   Reply With Quote
Old 10-09-2007, 10:02 PM   #14
Goshzilla
Zealot
Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.Goshzilla has a complete set of Star Wars action figures.
 
Posts: 104
Karma: 346
Join Date: Oct 2007
Device: Rocket Ebook 1150
Maybe I should just edit the original gutenberg txt file first, hyphenate the txt file, that I know I have a rough idea of how to write out the script.
Goshzilla is offline   Reply With Quote
Old 10-09-2007, 10:11 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,845
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
For a txt file it's as simple as

Code:
import re
from hyphenate import hyphenate_word as hyphenate
src = open('file', 'rb').read()
result = re.sub('\S+', lambda match : u'\u00ad'.join(hyphenate(match.group())), src)
Actually if you can come up with a regexp that matches only text between tags you can use this technique for HTML as well.

Last edited by kovidgoyal; 10-09-2007 at 10:15 PM.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl and Regex Alexander Turcic Lounge 3 01-25-2011 07:48 PM
perl for the DR800/DR1000? Mr. X iRex 1 03-16-2010 07:47 AM
lit2mobi written in Perl working tompe Bookeen 7 01-19-2008 01:06 PM
Perl processing alexxxm Sony Reader 3 11-26-2007 06:13 AM
Any perl or python gurus? jbenny Workshop 0 11-23-2007 03:27 PM


All times are GMT -4. The time now is 10:44 AM.


MobileRead.com is a privately owned, operated and funded community.