Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Reading and Management

Notices

Reply
 
Thread Tools Search this Thread
Old 07-28-2008, 01:24 PM   #1
monojohnny
Junior Member
monojohnny began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2008
Device: Kindle
Quick n' dirty Ruby Program: convert text files (Kindle - others?)

http://www.gutenberg.org of course is great for free books - which are readily viewable on a Kindle...however anybody who has tried this will have noticed the Kindle renders them in a rather awkward way, something like this:

//
The text seems to
wrap
at funny places when
reading books downloaded from Gutenberg.org
and it
makes for a not-too-pleasant reading
experience...
//

The following Ruby program seems to do a decent job of pre-converting the Gutenberg texts so they look semi-decent on a Kindle:

--- 'split.rb' CUT HERE ---
if ($_.size==2) then
printf("\n\n");
else
chomp!
printf("%s ",$_);
end
--- CUT HERE ---

Run like this:

ruby -an strip.rb war_of_the_worlds.txt > converted_war_of_the_worlds.txt

I hope this helps other people !

Cheers

John
----


Notes:

The program above ASSUMES that any line of exactly two characters is a blank line (CTRL-R+CTRL+M, no text) : so we want to break here - as a paragraph break - hence *double* newlines. Otherwise the 'chomp!' just removes any end-of-line chars - and lets the paragraph flow (essentially each paragraph is one-big-line - which is what the Kindle seems to like: normal text editors incidently DON'T like this much (unless you turn on word-wrap of course!).

I think I have worked out why: the Gutenberg texts (the ones I looked at, at least) seem to be pre-wordwrapped and terminated with a DOS-style ending : CTRL R/CTRL M.

The kindle will automatically wrap text, so there's no need to have it pre-wrapped: (in fact, because the font is not monospaced, it would be incredibly difficult to do this). When the kindle sees any 'end of line' (for instance the CTRL-R/CTRL-M above) it will honour that.

The result is the swewed text you see - with original wrapping preserved and the kindle applies it's own.

The Ruby program above (Perl programmers should be able to convert this quite easily), seems to do a decent job of pre-converting so that it cuts out all the CTRL-R/CTRL-Ms and puts in a double newline character to separate paragraphs.

Ruby Language is here: http://www.ruby-lang.org/en/

Last edited by monojohnny; 07-28-2008 at 01:45 PM. Reason: Correcting typo.
monojohnny is offline   Reply With Quote
Old 07-28-2008, 01:55 PM   #2
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
CTRL-J, not CTRL-R and the order is backwards. It is a Return followed by a new-line.

Dale
DaleDe is offline   Reply With Quote
Old 07-28-2008, 03:01 PM   #3
monojohnny
Junior Member
monojohnny began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2008
Device: Kindle
CTRL chars...(yup my mistake)

Thanks for the correction, indeed it is : \r\n as you said ..

head -1 origin.txt | od -c
0000000 T h e O r i g i n o f S p
0000020 e c i e s b y m e a n s o
0000040 f N a t u r a l S e l e c t
0000060 i o n ; \r \n

Additionally, changing the 'printf("\n\n") to print("\t\n") saves newlines, (I checked a real book :-) ) and makes it look more real...

I wonder if 'CTRL-L' (I think?) New page would work well for chapters.....(it doesn't :-( just tried it...)

Are there already some decent TXT->PRC commandline tools out there ? I know there is one for Windows, but just wondered about cross-platform...Anybody know ?

Cheers

John
monojohnny is offline   Reply With Quote
Old 07-28-2008, 07:56 PM   #4
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by monojohnny View Post
Thanks for the correction, indeed it is : \r\n as you said ..
The characters and order dates back to before bidirectional printing. It takes longer to send the print head back to the beginning of the line than it does to move the paper one line down. Therefore, you need to send the commands in this order to speed up the printing. Sending the return by itself was used to support overprinting, underlining, etc.

Of course on an electronic screen you really don't need both characters so some systems drop one or the other. Macs used to use only the \r while Unix uses only the \n. I am not sure what OS X does these days. PC's keep them both but some programs can get confused it they don't see both of them in the correct order.

Dale
DaleDe is offline   Reply With Quote
Old 07-29-2008, 10:25 AM   #5
monojohnny
Junior Member
monojohnny began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2008
Device: Kindle
Thanks again for that , make sense.

Posting the final version of program here, in case its of some use to other people.

-- CUT HERE, save as 'convert.rb' --
# Simple Ruby script to convert text-format ebooks from Project Gutenberg
# To a format which is more readable on an Amazon Kindle.
#
# Run with 'ruby -an convert.rb <original.txt> > <new.txt>'
#
# links: http://www.gutenberg.org, http://www.ruby-lang.org
#

BEGIN { BLANKLINE=sprintf("\r\n"); }

if $_ == BLANKLINE then
printf("\n\t"); next;
else
chomp!
printf(" %s",$_);
end
-- CUT HERE --
monojohnny is offline   Reply With Quote
Old 01-14-2009, 08:32 AM   #6
hansel
JSR FFD2
hansel can extract oil from cheesehansel can extract oil from cheesehansel can extract oil from cheesehansel can extract oil from cheesehansel can extract oil from cheesehansel can extract oil from cheesehansel can extract oil from cheesehansel can extract oil from cheese
 
hansel's Avatar
 
Posts: 305
Karma: 1045
Join Date: Aug 2008
Location: Rotterdam, Netherlands, Europe, Sol 3
Device: iliad
Hello All,
I just came in this old thread by following a link...

I use the folowing ruby script (conv.rb) to convert Gutenberg txt files to simple html:
Code:
#!/usr/bin/ruby
txt = IO.read(ARGV[0])
txt.gsub!(/\r/,'')
parts = txt.split(/\n\n\n\n/)
parts.shift
parts.pop

$stderr.print "%s: bytes=%d, parts=%d\n" % [ARGV[0], txt.size, parts.size]
print "<html>\n<head><title>#{ARGV[0]}</title></head>\n<body>\n"
parts.each do |part|
  pars = part.split(/\n\n+/)
  head = pars.shift
  print "<h1>#{head}</h1>\n"
  pars.each do |par|
   par.gsub!(/\[\d+\].+/, '')
   par.gsub!(/_(.*?)_/m, '<i>\1</i>')
   print "<p>#{par}</p>\n"
  end
end
After converting you can convert it to a nice custom pdf (with toc) with html2pdf:

Code:
ruby conv.rb xxx.txt >xxx.html
htmldoc -f xxx.pdf --header "" --footer "" --top 3mm --bottom 1mm --left 1mm --right 1mm --size 12x15cm xxx.html
You might have to experiment a bit with the script for optimal results (depending on the exact text lay-out)...

Hope this helps somebody
hansel is offline   Reply With Quote
Reply

Tags
gutenberg format text txt


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
which is the best program to convert files to epub ?? ornillo General Discussions 16 06-28-2010 09:29 AM
Quick and dirty conversion of html to epub WITH intra-file links Birdonawire ePub 2 06-18-2010 02:18 AM
Convert zip with multiple text files to MOBI mindfire Calibre 1 03-27-2010 10:19 AM
PRS-600 Quick 'n Dirty PRSA-CL6 (Reader cover with light) Review scottjl Sony Reader 5 12-14-2009 07:08 AM
A java/Groovy program to convert plain text into pdf for eslick dracodoc PDF 1 04-12-2009 09:29 AM


All times are GMT -4. The time now is 06:27 PM.


MobileRead.com is a privately owned, operated and funded community.