Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-22-2022, 01:13 PM   #76
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Good afternoon, M Sarmat89,

Thank you for your message. Please confirm this; you wanted me to go back to the original text file as the file to use with your line of perl code.

The perl code that you provided is; perl -pe "s:^([^[]+?) *(?=\[):\1\t:" <your-file-here >destination.tsv

The resulting converted file is what I am to use with pglossary.

Now, according to you, this should have converted to the 3 correct stardict files.

I was getting question marks on two of the files. Koreader was seeing the dicitionary but not finding the word searched.

Please, kindly confirm the procedure that you recommended and I will try this again.

Cordially,
pz
pzack is offline  
Old 09-22-2022, 03:15 PM   #77
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
Quote:
Originally Posted by pzack View Post
you wanted me to go back to the original text file as the file to use with your line of perl code.
No. First, you need to unfold the lines, by using perl commands from post 25. Then, insert the tabs by the perl command I given. Obviously, the source file on each stage in the output file from the previous one.

Then, after fixing missing tabulations manually and converting it, you should have a working dictionary, which you can test with StarDict or GoldenDict.
Sarmat89 is offline  
Advert
Old 09-22-2022, 09:22 PM   #78
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Goodevening M Sarmat89,

Thank you for your response. Yes, I may have forgotten about unfolding the lines.

However, this file is over 6000 pages of text and in notepad there are over 500,000 lines. I can't go through this file correcting tabs. And, I don't know where to place the tab.

The big question that I have is would I have at least a partially functioning dicitonary if I don't perform the manual corrections? Would I be able to search and find a good portion
of words? Or will pyglossary give me a corrupt stardict dictionary if I don't make the tab corrections?

You seem to be suggesting that the manual corrections must be made before conversion.

cordially,pz
pzack is offline  
Old 09-22-2022, 10:21 PM   #79
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
Yes, you should have most words in. PyGlossary will tell you which lines didn't get the tab correctly.
Sarmat89 is offline  
Old 09-23-2022, 11:25 AM   #80
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Hello,M. Sarmat89?

Thank you for confirming my message. I plan to fiddle with this over the weekend and will let you know what I come up with.

I imagine that pyglossary is going to indicate quite a lot of missing tab lines which will still present a grand problem for me if I choose to correct the missing tabs.

If the conversion gives me at least a somewhat serviceable dictionary I might be satisfied with that. I guess the rest I can do at a leisurely pace. I certainly wouldn't bother with manual corrections if pyglossary won't even convert to a half-way usable dictionary.

But, so far, I don't have a dictionary.

cordially,
pz
pzack is offline  
Advert
Old 09-23-2022, 10:00 PM   #81
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Good morning M. Sarmat89,

For some reason I thought that the full text file is a text or .txt extension file; in fact-and I did not realise this until today-that the file is shown as an .html file.

I can see this file in notepad so I assumed that this was a text or .txt file.

Does this make a difference in your conversion procedure that you outlined to me or do we have to first convert the html to a text file?

If it does indeed need conversion to a text file do you know how this may be done under linux?

Sorry for the possibly grand screw-up on my part.

Cordiall,
pz

Last edited by pzack; 09-23-2022 at 10:02 PM.
pzack is offline  
Old 09-24-2022, 02:40 AM   #82
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
If you can see it as a text in a text editor, it is a text file and not HTML one.
Sarmat89 is offline  
Old 09-25-2022, 04:47 PM   #83
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Hello, M. Sarmat89,

I believe I followed your quide correctly, however, I still had corrupt index files even though pyglossary built the stardict files.

I looked again at the code M. Markismus gave me, the four lines of perl code that built the csv file. If you look at his 9/11 message you have this after the boxed lines of code:

Problem

So what's the problem? You now have an article with the key '©' that has a quite new meaning. Apparently, there are articles that have subsections separated from the main article in the same way that articles are separated.

Stardict
Using my script I've added to the txt-file a csv-extension and ran it using
Code:

perl pocketbookdic.pl zymogène.S-delimiter .txt.csv fr '|,|'

The result in both the xml- and zipped binary form are also uploaded.
__________________________________________________ __________________
I don't know what this is and if this operation should have been done; nobody commented on this. There are listed some files generated from the above code. Don't know if any of these files were to be built for the pyglossary convrsion.

In any case, when I installed the stardict files this time koreader did not even open the dictionary on a common word search.

Very cordially,
pz
pzack is offline  
Old 09-25-2022, 05:46 PM   #84
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
Again, what exactly you did? All commands, please.
Sarmat89 is offline  
Old 09-25-2022, 06:46 PM   #85
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Hello, M. Sarmat89,

Thank you for respondig today, Sunday.

I first used the four lines of perl code from post#25 from Markismus to unfold the lines of the original full text file.

Next, I used your perl code on the output file built from the four lines of code; output4.csv.
perl -pe "s:^([^[]+?) *(?=\[):\1\t:" <your-file-here >destination.tsv

Afterwards, I used that tsv file in pyglossary for conversion to stardict.

Did I miss a step? Are the line unfolding codes correct?

cordially,
pz
pzack is offline  
Old 09-25-2022, 09:01 PM   #86
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
I suppose you used "< output4.csv" instead of "< your-file-here"?

Were there any error messages from pyglossary? When you import the created dictionary in GoldenDict, how many articles is shown there?

To check the unfolding, just open the resulting file and check for line breaks.
Sarmat89 is offline  
Old 09-25-2022, 09:13 PM   #87
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Good evening M Sarmat89,

Thank you for responding.

I indeed used output4.csv in your line of code to achieve the tsv file.

I couldn't count the error messages; pyglossary doesn't show numbers of errors but there were quite a few having to do with lines with no tabs.

I suppose you mean when I imported the dicitionary in stardict(under koreader)and not Godendict?

I am sorry,but, what do you mean by articles? Do you mean words, headwords?

Are you referring to the csv file to check for unfolding lines and how to I check for linebreaks,what should I see?

cordially,
pz
pzack is offline  
Old 09-25-2022, 10:20 PM   #88
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
When you open "destination.tsv", you should see some very long lines with tabulation after the first word.

You can use any available stardict or goldendict on your PC to check the dictionary before putting it to the device. Check the headword count to tell whether the conversion succeeded.
Sarmat89 is offline  
Old 09-26-2022, 12:07 PM   #89
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Hello, M. Sarmat89,

You must forgive my lack of knowledge here but how do I identify a tab?

When I open the tsv file in VIM in terminal I see ¨^M ^M ^M before a headword; the headwords are not on separate lines. I see ^M or ^M ^M throughout the text but not the 3 ^M's before headwords. You have paragraphs containing the headwords.

In notepad++ I don't see this; there are no ^M markings, headwords are separate but I don't know if they are tabbed.

Can you tell me how to put the stardict files into goldendict and stardict for testing. I have goldendict on my windows machine.

Cordially,
pz
pzack is offline  
Old 09-26-2022, 12:39 PM   #90
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 24,369
Karma: 111805467
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by pzack View Post
Hello, M. Sarmat89,

You must forgive my lack of knowledge here but how do I identify a tab?

In notepad++ I don't see this; there are no ^M markings, headwords are separate but I don't know if they are tabbed.
In notepad++, under View => Show Symbol, if you enable show white space and tab or show all characters, you will see an arrow pointing right for a tab. If you select show all characters, you will also see the EOL Depending on which encoding is being used, you will see LF (Linux EOL), CRLF (Windows EOL) or CR (old Mac EOL).
DNSB is offline  
Closed Thread

Tags
pyglossary

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to PDF conversion causes all the text to be aligned to the left Swifty4635 Conversion 1 01-16-2022 10:17 PM
Desktop App How do I run PyGlossary on Windows ? Bilingual Kobo Reader 2 07-12-2020 01:54 PM
epub 2 PDF conversion with OCR in PDF possible? hobi2000 Conversion 2 03-25-2019 03:20 AM
PDF conversion keeping pdf page highstream Conversion 3 05-31-2016 11:46 AM
PDF to PDF conversion creates much larger file? rocketcat Conversion 11 09-30-2011 07:37 PM


All times are GMT -4. The time now is 12:34 PM.


MobileRead.com is a privately owned, operated and funded community.