Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 04-13-2013, 04:15 AM   #1
b0rsuk
meles meles
b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.
 
b0rsuk's Avatar
 
Posts: 109
Karma: 163588
Join Date: May 2008
Location: Persepolis
Device: Pocketbook InkPad 3
What are the longest languages ?

Hi, I'm not sure where to ask this. But it is related to ebooks, because they make it easier to count characters, words and pages.

I'm wondering what langauges are usually the longest. You can attempt a distraction by saying "it depends on circumstances, translation", or "German numerals can theoreticaly be infinite". But I mean in practice. Even better, I have an idea how to measure this ! But I need your help.

The idea is to compare the same book translated to different languages. The measuring stick should be character count. Not word count, because this can be misleading. I want the most verbose languages possible, where speaking usually takes the most time. Not page count, because it depends on other factors like margins, font size, and so on.

I was thinking about comparing The Bible and/or War and Peace in different langauges. But I can't find a place where a large number of languages is listed. Ebooks are digital, which makes counting characters vastly easier. I think I prefer War and Peace, because The Bible tends to be written in archaic language to make it sound more profound. Or rather, it's often left that way.

I think Russian may be one of longest, slavic languages tend to be quite long and Russian is noticeably longer than Polish. For instance, words with feminine gender often have 2 syllabes more.
b0rsuk is offline   Reply With Quote
Old 04-13-2013, 11:44 AM   #2
Rizla
Member Retired
Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.Rizla ought to be getting tired of karma fortunes by now.
 
Posts: 3,183
Karma: 11721895
Join Date: Nov 2010
Device: Nook STR (rooted) & Sony T2
I can tell you French is 20-30% longer than English.
Rizla is offline   Reply With Quote
Advert
Old 04-13-2013, 12:16 PM   #3
Zetmolm
Guru
Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.Zetmolm ought to be getting tired of karma fortunes by now.
 
Posts: 612
Karma: 2031728
Join Date: Jan 2010
Device: PocketBook Touch (622), PocketBook Touch Lux 2, Pocketbook Touch HD 3
War and Peace is perhaps not such a good example. There are several 'original' versions of the book in Russian, so you would need to make sure the translator used the same version as the one you are comparing to. Also, having been published almost 150 years ago, War and Peace may not be the perfect example of 'modern' language.

But anyway, interesting idea, to compare the length of languages. I'll be interested to see the outcome.
Zetmolm is offline   Reply With Quote
Old 04-14-2013, 06:31 PM   #4
barutanseijin
Nxfgrrjks
barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.
 
barutanseijin's Avatar
 
Posts: 99
Karma: 925422
Join Date: Nov 2012
Location: New York, NY
Device: aura hd
Alphabets will use more characters than ideograms:

In English:

I'm going home.

In Japanese:

帰る。

It helps that i can elide the subject.
barutanseijin is offline   Reply With Quote
Old 04-14-2013, 11:34 PM   #5
ATDrake
Wizzard
ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.
 
Posts: 11,517
Karma: 33048258
Join Date: Mar 2010
Location: Roundworld
Device: Kindle 2 International, Sony PRS-T1, BlackBerry PlayBook, Acer Iconia
Well, if you're looking for a source of multiple languages comparisons, the old "I Can Eat Glass" project where someone solicited a bunch of translations (quality naturally varies) for the phrase "I can eat glass, it doesn't hurt me", seems like it may be of use.

The original seems to be down, but here are three sites where people have mirrored and/or added to the languages presented (including using the original language scripts instead of transliterations).

Personally, I'd say that if written in the Roman alphabet instead of Canadian Aboriginal Syllabics, Inuktitut (which I used to be minorly interested in learning several years ago) seems like a pretty good candidate for length, given its agglutinative structure and the circumlocatory nature of some of the phrasing required due to its vocabulary limitations.

Tamil and Burmese also look to have impressive character counts, based on the length of their sentences as written in script compared to the other samples.
ATDrake is offline   Reply With Quote
Advert
Old 04-15-2013, 12:15 AM   #6
b0rsuk
meles meles
b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.
 
b0rsuk's Avatar
 
Posts: 109
Karma: 163588
Join Date: May 2008
Location: Persepolis
Device: Pocketbook InkPad 3
I'm asking for a long book like War and Peace (does anyone have a better idea ?) to have a sample of significant size. "I can eat glass..." is such a tiny sample.

I forgot about ideograms and Japanese. To level the playing field, we would have to compare with Japanese written using roman alphabet. http://en.wikipedia.org/wiki/Romanization_of_Japanese

Romanized Japanese is not perfect, but I'm not looking for a perfect solution.
b0rsuk is offline   Reply With Quote
Old 04-15-2013, 12:34 AM   #7
ATDrake
Wizzard
ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.ATDrake ought to be getting tired of karma fortunes by now.
 
Posts: 11,517
Karma: 33048258
Join Date: Mar 2010
Location: Roundworld
Device: Kindle 2 International, Sony PRS-T1, BlackBerry PlayBook, Acer Iconia
The difficulty with trying for a long sample is that it would be very hard to track down the necessary translations for it (assuming new enough to still be in print or old enough to be in public domain), especiallly in e-book format, as well as archaic language reasons (especially with public domain translations, which are often stilted in character due to generally being at least a century old).

I'd suggest that rather than War and Peace, perhaps try looking for samples of a popular, modern international bestseller, such as one of those inevitable thriller novels like The Da Vinci Code, which is bound to have a large amount of modern language translations, some of which you may be lucky enough to find first-chapter samples of online at various vendors and/or publisher websites.

Failing that, I know the Asterix and Tintin comic book adventure albums have all been translated into upwards of 30 languages each and seem to mostly be in print (though may be difficult to track down).

ETA: Your best bet for an e-book available modern language text with lots of translations may be Paulo Coelho's The Alchemist, which Wikipedia says has 67 translations, some of which Coelho himself released to the internet at large and encouraged people to download, compared to TDVC's 40 or so, and Lord of the Rings' 38.

Last edited by ATDrake; 04-15-2013 at 12:41 AM.
ATDrake is offline   Reply With Quote
Old 04-15-2013, 01:36 AM   #8
barutanseijin
Nxfgrrjks
barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.
 
barutanseijin's Avatar
 
Posts: 99
Karma: 925422
Join Date: Nov 2012
Location: New York, NY
Device: aura hd
Quote:
Originally Posted by b0rsuk View Post
I'm asking for a long book like War and Peace (does anyone have a better idea ?) to have a sample of significant size. "I can eat glass..." is such a tiny sample.

I forgot about ideograms and Japanese. To level the playing field, we would have to compare with Japanese written using roman alphabet. http://en.wikipedia.org/wiki/Romanization_of_Japanese

Romanized Japanese is not perfect, but I'm not looking for a perfect solution.
Only no one uses romanised Japanese outside of bilingual signs & textbooks. It's really hard to read. Besides, what's the point of levelling the playing field if we're trying to find which language takes up more space?

There's Chinese, too. It's probably at least as efficient as Japanese if written in characters rather than pinyin.
barutanseijin is offline   Reply With Quote
Old 04-15-2013, 06:48 AM   #9
travger
Evangelist
travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.travger ought to be getting tired of karma fortunes by now.
 
travger's Avatar
 
Posts: 480
Karma: 270594
Join Date: Aug 2010
Device: palm tx, Windows7, Galaxy A5
Comparing things only in Roman alphabet can add several letters to the name. For example:

Name 'Zhenya' has only 4 characters in Russian
travger is offline   Reply With Quote
Old 04-15-2013, 12:40 PM   #10
b0rsuk
meles meles
b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.
 
b0rsuk's Avatar
 
Posts: 109
Karma: 163588
Join Date: May 2008
Location: Persepolis
Device: Pocketbook InkPad 3
Quote:
Originally Posted by travger View Post
Comparing things only in Roman alphabet can add several letters to the name. For example:

Name 'Zhenya' has only 4 characters in Russian
I wouldn't force Russian to the Roman alphabet, because it already has one. Japanese, Chinese don't even use an alphabet. One character can mean a whole word.

So okay, let's exclude Japanese and Chinese from the statistics. The point is not to find flaws in this method - it has plenty - but to get a semi-reliable result with a quite big sample. If we can achieve that, we can talk about improvements.
b0rsuk is offline   Reply With Quote
Old 04-17-2013, 10:31 AM   #11
Freeshadow
temp. out of service
Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.
 
Posts: 2,787
Karma: 24285242
Join Date: May 2010
Location: Duisburg (DE)
Device: PB 623
Even then:
Comparing word lengths only makes sense between languages of similar grammar, because differences in grammatical structure more often than not force you to add words to a sentence to keep the information inside.
This esp. when one of the languages is flexing (like PL) and the other not.
Example would be following piece of dialgue: Two people, one male the other female, are asked if they were out to do something specific (e.g. visit a museum).
While in PL the answers
"byłam" (I was) and "nie, odwiedzałem kumpla" (no I was visiting a pal) are sufficient to indicate who said what - because of gender suffixes to the verbs;
you have to write at least "I was - said $_female" to transmit the same amount of information.
Now keep the same in mind for adjectives, adverbs and differences between grammatical treatment of times...
While some languages do a lot of it by pre- and suffixes others require heaps of additional words for it.

German allows to save place with nouns - you can stick multiple to each other. so while you have to say "Office of Foobaric Affairs" in Polish, you simply have a german "Foobaroffice". Sounds like it would make things a lot shorter. Nevertheless, because Polish is flexing it allows for using less words in other cases. In fact it's a size difference of about 1/3rd shorter texts in Polish.

Then you have to keep in mind that not every word has a corresponding equivalent in every language.

There is no word for "toe" in PL (finger of foot is used)
There is no singular "parent" in DE (parentspart is used)

My points are as follows:
  1. Grammar matters more than raw word lenght
  2. An example like 'I can eat glass (...)' is by far not complex enough to allow reasonable sampling

Last edited by Freeshadow; 04-17-2013 at 10:44 AM.
Freeshadow is offline   Reply With Quote
Old 04-17-2013, 11:42 PM   #12
barutanseijin
Nxfgrrjks
barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.
 
barutanseijin's Avatar
 
Posts: 99
Karma: 925422
Join Date: Nov 2012
Location: New York, NY
Device: aura hd
Quote:
Originally Posted by Freeshadow View Post
My points are as follows:
  1. Grammar matters more than raw word lenght
  2. An example like 'I can eat glass (...)' is by far not complex enough to allow reasonable sampling
Excellent points.

It occurs to me that the relative simplicity of English grammar forces writers to use dialogue tags where writers of other languages might get away with grammatical hints. All those "he saids" & "she saids" pad the word count.

The original text of the Tale of Genji has no names because refering to a person by name was considered rude in the Heian court. Honorifics, humble forms and the various inflections of politeness hint at who is speaking about whom, but there are also explicit references to positions, etc. that help identify characters. In other words, there are grammatical efficiencies that are undone by social rules. Translators give the characters names. Nevertheless, it remains a pretty thick book when translated.
barutanseijin is offline   Reply With Quote
Old 04-18-2013, 11:46 AM   #13
b0rsuk
meles meles
b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.b0rsuk can program the VCR without an owner's manual.
 
b0rsuk's Avatar
 
Posts: 109
Karma: 163588
Join Date: May 2008
Location: Persepolis
Device: Pocketbook InkPad 3
I understand, but I would still like to compare languages in this simple way. Like any statistic, it would have to be taken with a grain of salt, but could be illuminating. There's no way to make a perfect comparison of languages, and an imperfect one is the next best thing.

Another way to compare could be using a speech synthesizer. If the programs are mature, it would take care of spelling issues. For example French has quite ancient spelling. The French equivalent of "many" - "beaucoup" looks much longer but is actually quick to pronounce. It's roughly "bocoo", "c" like in "corn".

If you keep finding flaws in the idea, it's not constructive. You can go and try to get a job in QA or Software Testing, but probably not in something creative. Your reservations will hold you back.

Last edited by b0rsuk; 04-18-2013 at 11:53 AM.
b0rsuk is offline   Reply With Quote
Old 04-18-2013, 02:02 PM   #14
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 2,977
Karma: 18343081
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
The nice thing about a short text is that you can count characters by eye. If you go for an entire book, then you have to figure out how to count the text characters in it. It takes work to extract only the text from most ebook formats, and non-English languages will need more than one byte per character, so you can't just use the file size to roughly count characters. It is interesting because there are so many other problems to solve before you can even get down to determining relative length.
rkomar is offline   Reply With Quote
Old 04-18-2013, 10:46 PM   #15
barutanseijin
Nxfgrrjks
barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.barutanseijin ought to be getting tired of karma fortunes by now.
 
barutanseijin's Avatar
 
Posts: 99
Karma: 925422
Join Date: Nov 2012
Location: New York, NY
Device: aura hd
This is an interesting topic, and for that reason, i would be very surprised if someone hasn't already done exactly what you're proposing. After all, the relative wordiness of languages is a practical problem for translators, and one would think linguists would also find it interesting. I might try a literature search before reinventing the wheel.

But if i had to do this tonight, i'd use the "Communist Manifesto" mainly because it's in the public domain, has been translated with relatively modern language, and has been rendered into many many languages. Even more importantly, the marxists.org site has a handy page with links to many -- but not all -- the translations from a singe web page. (It's easier to work from a single page than to poke around on sites in languages i don't understand.) Finally, i'd pipe the output of lynx -dump into wc -m. wc counts new line & EOF characters, but wothehell, quick-n-dirty would fit the parameters you've laid out.

When all is said & done, it's your project, so go ahead and do it your way. I'd be curious to know what you find, even if you exclude the languages i'm most interested in.
barutanseijin is offline   Reply With Quote
Reply

Tags
comparison, curiosity, languages, research, statistics

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
The longest configure list ever? twobob Kindle Developer's Corner 22 08-29-2012 12:10 PM
Longest ebook you have read and enjoyed tech_au Reading Recommendations 71 07-16-2011 01:58 PM
What have you had the longest? Stitchawl Lounge 69 02-26-2011 02:18 AM
Your longest reading marathon? ardeegee Lounge 13 10-02-2010 01:44 PM
What files take longest to load/index? ProDigit Sony Reader 4 10-24-2008 02:52 PM


All times are GMT -4. The time now is 01:32 PM.


MobileRead.com is a privately owned, operated and funded community.