![]() |
#76 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Good afternoon, M Sarmat89,
Thank you for your message. Please confirm this; you wanted me to go back to the original text file as the file to use with your line of perl code. The perl code that you provided is; perl -pe "s:^([^[]+?) *(?=\[):\1\t:" <your-file-here >destination.tsv The resulting converted file is what I am to use with pglossary. Now, according to you, this should have converted to the 3 correct stardict files. I was getting question marks on two of the files. Koreader was seeing the dicitionary but not finding the word searched. Please, kindly confirm the procedure that you recommended and I will try this again. Cordially, pz |
![]() |
![]() |
#77 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
|
Quote:
Then, after fixing missing tabulations manually and converting it, you should have a working dictionary, which you can test with StarDict or GoldenDict. |
|
![]() |
Advert | |
|
![]() |
#78 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Goodevening M Sarmat89,
Thank you for your response. Yes, I may have forgotten about unfolding the lines. However, this file is over 6000 pages of text and in notepad there are over 500,000 lines. I can't go through this file correcting tabs. And, I don't know where to place the tab. The big question that I have is would I have at least a partially functioning dicitonary if I don't perform the manual corrections? Would I be able to search and find a good portion of words? Or will pyglossary give me a corrupt stardict dictionary if I don't make the tab corrections? You seem to be suggesting that the manual corrections must be made before conversion. cordially,pz |
![]() |
![]() |
#79 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
|
Yes, you should have most words in. PyGlossary will tell you which lines didn't get the tab correctly.
|
![]() |
![]() |
#80 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Hello,M. Sarmat89?
Thank you for confirming my message. I plan to fiddle with this over the weekend and will let you know what I come up with. I imagine that pyglossary is going to indicate quite a lot of missing tab lines which will still present a grand problem for me if I choose to correct the missing tabs. If the conversion gives me at least a somewhat serviceable dictionary I might be satisfied with that. I guess the rest I can do at a leisurely pace. I certainly wouldn't bother with manual corrections if pyglossary won't even convert to a half-way usable dictionary. But, so far, I don't have a dictionary. cordially, pz |
![]() |
Advert | |
|
![]() |
#81 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Good morning M. Sarmat89,
For some reason I thought that the full text file is a text or .txt extension file; in fact-and I did not realise this until today-that the file is shown as an .html file. I can see this file in notepad so I assumed that this was a text or .txt file. Does this make a difference in your conversion procedure that you outlined to me or do we have to first convert the html to a text file? If it does indeed need conversion to a text file do you know how this may be done under linux? Sorry for the possibly grand screw-up on my part. Cordiall, pz Last edited by pzack; 09-23-2022 at 10:02 PM. |
![]() |
![]() |
#82 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
|
If you can see it as a text in a text editor, it is a text file and not HTML one.
|
![]() |
![]() |
#83 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Hello, M. Sarmat89,
I believe I followed your quide correctly, however, I still had corrupt index files even though pyglossary built the stardict files. I looked again at the code M. Markismus gave me, the four lines of perl code that built the csv file. If you look at his 9/11 message you have this after the boxed lines of code: Problem So what's the problem? You now have an article with the key '©' that has a quite new meaning. Apparently, there are articles that have subsections separated from the main article in the same way that articles are separated. Stardict Using my script I've added to the txt-file a csv-extension and ran it using Code: perl pocketbookdic.pl zymogène.S-delimiter .txt.csv fr '|,|' The result in both the xml- and zipped binary form are also uploaded. __________________________________________________ __________________ I don't know what this is and if this operation should have been done; nobody commented on this. There are listed some files generated from the above code. Don't know if any of these files were to be built for the pyglossary convrsion. In any case, when I installed the stardict files this time koreader did not even open the dictionary on a common word search. Very cordially, pz |
![]() |
![]() |
#84 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
|
Again, what exactly you did? All commands, please.
|
![]() |
![]() |
#85 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Hello, M. Sarmat89,
Thank you for respondig today, Sunday. I first used the four lines of perl code from post#25 from Markismus to unfold the lines of the original full text file. Next, I used your perl code on the output file built from the four lines of code; output4.csv. perl -pe "s:^([^[]+?) *(?=\[):\1\t:" <your-file-here >destination.tsv Afterwards, I used that tsv file in pyglossary for conversion to stardict. Did I miss a step? Are the line unfolding codes correct? cordially, pz |
![]() |
![]() |
#86 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
|
I suppose you used "< output4.csv" instead of "< your-file-here"?
Were there any error messages from pyglossary? When you import the created dictionary in GoldenDict, how many articles is shown there? To check the unfolding, just open the resulting file and check for line breaks. |
![]() |
![]() |
#87 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Good evening M Sarmat89,
Thank you for responding. I indeed used output4.csv in your line of code to achieve the tsv file. I couldn't count the error messages; pyglossary doesn't show numbers of errors but there were quite a few having to do with lines with no tabs. I suppose you mean when I imported the dicitionary in stardict(under koreader)and not Godendict? I am sorry,but, what do you mean by articles? Do you mean words, headwords? Are you referring to the csv file to check for unfolding lines and how to I check for linebreaks,what should I see? cordially, pz |
![]() |
![]() |
#88 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 431
Karma: 2146264
Join Date: Nov 2015
Device: none
|
When you open "destination.tsv", you should see some very long lines with tabulation after the first word.
You can use any available stardict or goldendict on your PC to check the dictionary before putting it to the device. Check the headword count to tell whether the conversion succeeded. |
![]() |
![]() |
#89 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Hello, M. Sarmat89,
You must forgive my lack of knowledge here but how do I identify a tab? When I open the tsv file in VIM in terminal I see ¨^M ^M ^M before a headword; the headwords are not on separate lines. I see ^M or ^M ^M throughout the text but not the 3 ^M's before headwords. You have paragraphs containing the headwords. In notepad++ I don't see this; there are no ^M markings, headwords are separate but I don't know if they are tabbed. Can you tell me how to put the stardict files into goldendict and stardict for testing. I have goldendict on my windows machine. Cordially, pz |
![]() |
![]() |
#90 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,369
Karma: 111805467
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
In notepad++, under View => Show Symbol, if you enable show white space and tab or show all characters, you will see an arrow pointing right for a tab. If you select show all characters, you will also see the EOL Depending on which encoding is being used, you will see LF (Linux EOL), CRLF (Windows EOL) or CR (old Mac EOL).
|
![]() |
![]() |
Tags |
pyglossary |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to PDF conversion causes all the text to be aligned to the left | Swifty4635 | Conversion | 1 | 01-16-2022 10:17 PM |
Desktop App How do I run PyGlossary on Windows ? | Bilingual | Kobo Reader | 2 | 07-12-2020 01:54 PM |
epub 2 PDF conversion with OCR in PDF possible? | hobi2000 | Conversion | 2 | 03-25-2019 03:20 AM |
PDF conversion keeping pdf page | highstream | Conversion | 3 | 05-31-2016 11:46 AM |
PDF to PDF conversion creates much larger file? | rocketcat | Conversion | 11 | 09-30-2011 07:37 PM |