Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-07-2009, 10:03 PM   #1
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Removing Line-breaks / Preserving Paragraphs

Please find attached repartee, a python script that--I believe--should do a fairly good job of automatically removing linebreaks without interfering with paragraph breaks.

I just finished throwing it together, so it doubtless leaves much to be desired. However I would be grateful if people could either test it or point me to some unorthodoxly line-broken/paragraph-broken files upon which I could try the program myself.

The script doesn't touch the input file (unless you purposely specify the input file's name as also the output file) and is programmed not to output anything if it doesn't think it can tell line-breaks apart from paragraph-breaks.

If you find a file that the script should fix (i.e.: it has both line-breaks and paragraph breaks), but it refuses, saying "Unable to find a clear and/or consistent line break / paragraph break pattern.", please send the file (or a portion thereof) my way for analysis.

Keep in mind though that the script is meant to be used on full size plaintext novels or reasonably long short stories. It is more likely to break with very short pieces of text, almost certainly won't do anything useful with flash fiction, and may behave erratically with complexly formatted (i.e.: language text book, and other similarly non-novel type of) text files.

- Ahi

Ps.: In particular, I would be grateful, Gideon, if you tried it on the file you recently had trouble with and let me know the results.
Attached Files
File Type: zip repartee.zip (1.4 KB, 267 views)
ahi is offline   Reply With Quote
Old 06-08-2009, 01:44 AM   #2
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Here's the updated version that has had the parsing/identification logic fixed to not trip up on files where there are two spaces after punctuation and other similar quirks.

- Ahi

Ps.: It should be noted that repartee does not abuse poems, and other similar text so long as it is indented by a few spaces on the beginning of each line (which thus sets it apart from other lines).
Attached Files
File Type: zip repartee.zip (1.5 KB, 234 views)
ahi is offline   Reply With Quote
Advert
Old 06-08-2009, 02:00 AM   #3
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
I gave it a quick go with a file I had that was like my notorious file from the other day was supposed to be (really did have no spaces after each line, just a paragraph). Most paragraphs did begin with a tab, however. But the script just said nothing could be identified and so nothing could be done.
Gideon is offline   Reply With Quote
Old 06-08-2009, 02:07 AM   #4
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Gideon View Post
I gave it a quick go with a file I had that was like my notorious file from the other day was supposed to be (really did have no spaces after each line, just a paragraph). Most paragraphs did begin with a tab, however. But the script just said nothing could be identified and so nothing could be done.
Any chance you could put back up online for me at least 15-20 paragraphs of it, Gideon? The script is probably tripping up on something trivial I am yet to think about. (I've only tested in on about half a dozen or so files thus far... but the logic should work on a fairly broad range of reasonably consistent text files.)

- Ahi

Ps.: Or, post the output (assuming you are using the zip file from the second post. That one gives some numbers as to what it identifies as word space, line-break, et al.

Last edited by ahi; 06-08-2009 at 02:11 AM.
ahi is offline   Reply With Quote
Old 06-08-2009, 02:12 AM   #5
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
This was just something I put together to test it. But I can post it. I only created five paragraphs, though.
Attached Files
File Type: txt drop.txt (4.2 KB, 318 views)
Gideon is offline   Reply With Quote
Advert
Old 06-08-2009, 02:22 AM   #6
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Gideon View Post
This was just something I put together to test it. But I can post it. I only created five paragraphs, though.
This one works on the attached file (i.e.: the drop.txt file you attached)... some of the script's pickiness was reduced.

It's, generally speaking, more likely to deal well with longer text though.

- Ahi
Attached Files
File Type: zip repartee.zip (1.4 KB, 255 views)

Last edited by ahi; 06-08-2009 at 02:27 AM.
ahi is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing unnecessary line breaks in books. Wintersdark Calibre 17 09-04-2010 04:34 AM
Tool for removing line breaks in text documents kahn10 Sony Reader 9 08-22-2010 10:05 PM
Removing Returns, Preserving Paragraphs Gideon Workshop 41 06-19-2009 05:07 AM
Removing extra line breaks plemming Calibre 0 07-31-2008 07:50 PM
Book Designer - too many breaks/paragraphs? moneytoo Sony Reader 10 10-25-2007 02:48 PM


All times are GMT -4. The time now is 06:47 AM.


MobileRead.com is a privately owned, operated and funded community.