![]() |
#1 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jul 2008
Device: Kindle
|
Quick n' dirty Ruby Program: convert text files (Kindle - others?)
http://www.gutenberg.org of course is great for free books - which are readily viewable on a Kindle...however anybody who has tried this will have noticed the Kindle renders them in a rather awkward way, something like this:
// The text seems to wrap at funny places when reading books downloaded from Gutenberg.org and it makes for a not-too-pleasant reading experience... // The following Ruby program seems to do a decent job of pre-converting the Gutenberg texts so they look semi-decent on a Kindle: --- 'split.rb' CUT HERE --- if ($_.size==2) then printf("\n\n"); else chomp! printf("%s ",$_); end --- CUT HERE --- Run like this: ruby -an strip.rb war_of_the_worlds.txt > converted_war_of_the_worlds.txt I hope this helps other people ! Cheers John ---- Notes: The program above ASSUMES that any line of exactly two characters is a blank line (CTRL-R+CTRL+M, no text) : so we want to break here - as a paragraph break - hence *double* newlines. Otherwise the 'chomp!' just removes any end-of-line chars - and lets the paragraph flow (essentially each paragraph is one-big-line - which is what the Kindle seems to like: normal text editors incidently DON'T like this much (unless you turn on word-wrap of course!). I think I have worked out why: the Gutenberg texts (the ones I looked at, at least) seem to be pre-wordwrapped and terminated with a DOS-style ending : CTRL R/CTRL M. The kindle will automatically wrap text, so there's no need to have it pre-wrapped: (in fact, because the font is not monospaced, it would be incredibly difficult to do this). When the kindle sees any 'end of line' (for instance the CTRL-R/CTRL-M above) it will honour that. The result is the swewed text you see - with original wrapping preserved and the kindle applies it's own. The Ruby program above (Perl programmers should be able to convert this quite easily), seems to do a decent job of pre-converting so that it cuts out all the CTRL-R/CTRL-Ms and puts in a double newline character to separate paragraphs. Ruby Language is here: http://www.ruby-lang.org/en/ Last edited by monojohnny; 07-28-2008 at 01:45 PM. Reason: Correcting typo. |
![]() |
![]() |
![]() |
#2 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
CTRL-J, not CTRL-R and the order is backwards. It is a Return followed by a new-line.
Dale |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jul 2008
Device: Kindle
|
CTRL chars...(yup my mistake)
Thanks for the correction, indeed it is : \r\n as you said ..
head -1 origin.txt | od -c 0000000 T h e O r i g i n o f S p 0000020 e c i e s b y m e a n s o 0000040 f N a t u r a l S e l e c t 0000060 i o n ; \r \n Additionally, changing the 'printf("\n\n") to print("\t\n") saves newlines, (I checked a real book :-) ) and makes it look more real... I wonder if 'CTRL-L' (I think?) New page would work well for chapters.....(it doesn't :-( just tried it...) Are there already some decent TXT->PRC commandline tools out there ? I know there is one for Windows, but just wondered about cross-platform...Anybody know ? Cheers John |
![]() |
![]() |
![]() |
#4 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Of course on an electronic screen you really don't need both characters so some systems drop one or the other. Macs used to use only the \r while Unix uses only the \n. I am not sure what OS X does these days. PC's keep them both but some programs can get confused it they don't see both of them in the correct order. Dale |
|
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jul 2008
Device: Kindle
|
Thanks again for that , make sense.
Posting the final version of program here, in case its of some use to other people. -- CUT HERE, save as 'convert.rb' -- # Simple Ruby script to convert text-format ebooks from Project Gutenberg # To a format which is more readable on an Amazon Kindle. # # Run with 'ruby -an convert.rb <original.txt> > <new.txt>' # # links: http://www.gutenberg.org, http://www.ruby-lang.org # BEGIN { BLANKLINE=sprintf("\r\n"); } if $_ == BLANKLINE then printf("\n\t"); next; else chomp! printf(" %s",$_); end -- CUT HERE -- |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
JSR FFD2
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 305
Karma: 1045
Join Date: Aug 2008
Location: Rotterdam, Netherlands, Europe, Sol 3
Device: iliad
|
Hello All,
I just came in this old thread by following a link... I use the folowing ruby script (conv.rb) to convert Gutenberg txt files to simple html: Code:
#!/usr/bin/ruby txt = IO.read(ARGV[0]) txt.gsub!(/\r/,'') parts = txt.split(/\n\n\n\n/) parts.shift parts.pop $stderr.print "%s: bytes=%d, parts=%d\n" % [ARGV[0], txt.size, parts.size] print "<html>\n<head><title>#{ARGV[0]}</title></head>\n<body>\n" parts.each do |part| pars = part.split(/\n\n+/) head = pars.shift print "<h1>#{head}</h1>\n" pars.each do |par| par.gsub!(/\[\d+\].+/, '') par.gsub!(/_(.*?)_/m, '<i>\1</i>') print "<p>#{par}</p>\n" end end Code:
ruby conv.rb xxx.txt >xxx.html htmldoc -f xxx.pdf --header "" --footer "" --top 3mm --bottom 1mm --left 1mm --right 1mm --size 12x15cm xxx.html Hope this helps somebody |
![]() |
![]() |
![]() |
Tags |
gutenberg format text txt |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
which is the best program to convert files to epub ?? | ornillo | General Discussions | 16 | 06-28-2010 09:29 AM |
Quick and dirty conversion of html to epub WITH intra-file links | Birdonawire | ePub | 2 | 06-18-2010 02:18 AM |
Convert zip with multiple text files to MOBI | mindfire | Calibre | 1 | 03-27-2010 10:19 AM |
PRS-600 Quick 'n Dirty PRSA-CL6 (Reader cover with light) Review | scottjl | Sony Reader | 5 | 12-14-2009 07:08 AM |
A java/Groovy program to convert plain text into pdf for eslick | dracodoc | 1 | 04-12-2009 09:29 AM |