![]() |
#1 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
conversion pyglossary pdf
Good afternoon,
I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader. I have tried converting a full text file of this pdf (I did not create it) but pyglossary is giving me a boat load of no tab errors and the stardict files that it creates from this txt file are empty. I also have an xml file but I cannot get pyglossary to convert it to stardict even though pyglossary is supposed to support .xml Can anyone suggest ways to convert this pdf to a stardict dictionary? I know of no program that would convert pdf to stardict in pyglossary. Perhaps, there is another conversion tool-I am fishing for a way to do this. The dictionary would be much more useful to me under stardict. Cordially, pz |
![]() |
![]() |
#2 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,762
Karma: 24088559
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
You might want to ask about converting the xml version in the Index of Custom Dictionaries for Kobo eReader thread. MR member Markismus might be able to help you, because he often converts non-standard dictionary files to Pocketbook dictionaries. Last edited by Doitsu; 09-07-2022 at 05:25 PM. |
|
![]() |
Advert | |
|
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 962
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, OnyxNotePro, Note5, Kobo Glo, Aura
|
XML is a language for data storage, not a dictionary format. So you can't expect pyglossary to support any XML whatsoever. However, you can put the XML-file online and post a link to it. Maybe you're lucky.
Suggestions The Pdf-(2-epub-)2-html-2-stardict tool isn't there, yet. Probably never. The problem is that the nice styling of a PDF puts a lot of extra code in there, that has to be differentiated from the words&definitions. Optically easy, but not code-wise. You could try to get ABBYY Finereader to recognize it and specify the output format as a spreadsheet or CSV-format. However, even ABBYY's output will still have a lot of noise, that you'll have to deal with. What is the name of the dictionary? Maybe it's already present in a nicer format than PDF. |
![]() |
![]() |
#4 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Good morning M. Markismus,
Thank you for taking the time to respond to my query. I mentioned .xml because it is a supported format in pyglossary(according to github)for conversion, however, the xml file that I have is not converting. I am not sure if this file actually contains the whole dictonary anyway. The only thing that I can think of is convert the full text file that I have but it is not tab delimited. When I look at this file in notepad I see that the headword is not separated out-it is the leading word-but it is part of the definition which is a paragraph. Pyglossary asks for a tab delimited file citing no-tab errors as it was converting ; it produced the three stardict files from my text file but they were empty. I did not create the text and xml files. If there is a way to do a mass conversion of the text file, that is, get the leading head word separated out, and I think that this is what is meant by tab-delimiting a file-then pyglossary may correctly convert the text file. It is almost there but needs the head word separated from the definition. However, I admit that I don't fully understand the structure of a tab delimited file. I have seen something about dumping the text into excel or another spreadsheet to build a tab-delimited file but,unfortunately, I have zero experience and knowledge of spreadsheets. The dictionary has over 100,000 words and I certainly cannot do it manually. And then there is the file converter "penelope" but I don't know if there is any help in that direction. Cordially, pz Last edited by pzack; 09-09-2022 at 12:49 PM. |
![]() |
![]() |
#5 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 962
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, OnyxNotePro, Note5, Kobo Glo, Aura
|
@pzack Why don't you post a link to the non-tab-delimited file?
If what you're saying turns out correct, than all you would need is to prefix each line with a repetition of the 1st word and a delimiter. Sed could do that on Linux, any pattern-substitution in Perl, Python, Awk or Lua could do that. You could even do it in Excel. First column your line, second column the LEFT-function, third column a concatenation of both column-values with a delimiter in between. Last edited by Markismus; 09-09-2022 at 01:07 PM. |
![]() |
Advert | |
|
![]() |
#6 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Good afternoon M. Markismus,
Thank you for your quick response. As I indicated, I don't know how to work with excel and spreadsheets. However, you have suggested some other possibilities of tab-delimiting the text file. May I impose upon you to give me an example of how I may do this with the apps that you listed. If you would choose one that may be the simplist to work with. Please understand that I am not a programmer and I am shakey with working with scripts. But I can work in linux terminal. Your example could be short and sweet. I figured that there may be a way to do this and I did see a script for converting this file to tab-delimited but I can't find it; it was a short script for use in linux. Please let me see what you come up with before I try a new thread on a tab delimited conversion. I think, thanks to you, that we may be headed in the right direction. And here's hoping that once converted-if it can be done-that pyglossary will cooperate and give me a stardict dictionary! cordially, pz |
![]() |
![]() |
#7 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 962
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, OnyxNotePro, Note5, Kobo Glo, Aura
|
I already wrote it out with the Excel example. What prevents you from posting a link to the text file? If it's small, you could even zip it and upload it here.
|
![]() |
![]() |
#8 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
M. Markismus,
I want to add to my just-sent reply to you that, though I don't understand fully the structure of tab-delimited text files, I assume that pyglossary needs the head word as a hook on which to hang the definition. My sense is that the tab delimiting isolates or sets apart the headword so that pyglossary sees it as the headword and can build its index or pointers to the headword. This is how I understand it but this is purely conjecture on my part. If I am correct, then I need an app,maybe among the apps that you have provided for me, to isolate or tab? the headword which is the first word of each of the paragraphs that include the headword and definition. There are spaces between each paragraph of text. There are no illustrations in the text file. I would need the syntax to instruct the app to tab-delimit the first word which is the head word. Maybe this helps to clarify things. pz |
![]() |
![]() |
#9 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,762
Karma: 24088559
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
It also supports compiling and decompiling Babylon BGL dictionaries. The Babylon glossary source file syntax, which supports inflections, is very simple: Code:
#stripmethod=keep #sametypesequence=h #bookname=Spanish-English Dictionary libro|libros <p>single line definition of 'libro' (may contain html 3.2 tags, e.g <br>)</p> rana|ranas <p>single line definition of 'rana' (may contain html 3.2 tags, e.g <br>)</p> |
|
![]() |
![]() |
#10 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Good evening, Doitsu
Thank you for responding. My dictionary is not in a bgl format thus, I don't think that the stardict editor is useful here. Actually, I tried this editor and like pyglossary, it threw up countless no-tab errors in the full text file and gave me empty stardict files. Thank you the excel example but I don't understand excel. In looking again at the text file it is like this: headword space [prononciation of headword]space definition. In other words, the bracketed prononciation-in the international alphabet-is what separates what follows from the next headword and bracket. So that, what follows the bracketed prononciation of the headword will pertain to the headword until the next bracketed prononciation with the headword just before it. Now, where would one set the tab that would separate headword and bracket from the next headword and bracket? Again, I am trying to understand the workings of a tab delimited file and what pyglossary is looking for. cordially, pz |
![]() |
![]() |
#11 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 47,927
Karma: 174315098
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
If you can't post the whole text file here as an attachment to your message, then snip a chunk of text and post that. It'll make looking at your issues a lot simpler.
To attach the file, either use the paperclip next to the smiley icon at the top of the message entry box or the Manage Attachments in the Attach files box below the message entry box. A .txt file is limited to 1MB but you can attach a .zip file of up to 20MB. |
![]() |
![]() |
#12 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 531
Karma: 2268308
Join Date: Nov 2015
Device: none
|
It should be simple.
Get yourself an editor with regex support, like Notepad++ or VSCode. Replace Code:
^([^[]+?) *(?=\[) Code:
\1\t |
![]() |
![]() |
#13 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,762
Karma: 24088559
Join Date: Dec 2010
Device: Kindle PW2
|
Sarmat beat me to the answer.
![]() |
![]() |
![]() |
#14 |
Connoisseur
![]() Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
|
Dear Sarmat89,
Thank you for the information and for responding. The code is greek to me. Here is how my text file looks as an example(I did not build this file); cours [kur] n.m. definition........................................ ................................. .................................................. .................................................. ......... .................................................. .................................................. ........... .................................................. .................................................. .......... .................................................. .................................................. ............. coursier [kursje] n.m. definition........................................ ............................. .................................................. .................................................. ............. .................................................. .................................................. .............. Thus, you have headword space [prononciation] gender definition. The definitions can be in separate paragraphs and sometimes a number of paragraphs in a long definition and it is, I think, the bracketed prononciation with its headword before it that delimits the definitions. If I understand tab-delimiting correctly, then the headword and brackets would have a tab but I don't know where to place the tab and how to actually tab the text. There are over 100,000 words with definitions(6,000 pages plus)so the program has to run through the file placing somewhere the tab. Or tabs? If your example of code applies here, how would you plug in the actual format in this code, that is what represents what in your code looking at my example? I don't know what regex is and how it works. I have notepad++ under win 11 and I have never formatted a text file least of all built a tab-delimited file. I assume that the problem in pryglossary is getting the headword with the brackets tabbed so that stardict can find the word. Very cordially, pz |
![]() |
![]() |
#15 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 962
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, OnyxNotePro, Note5, Kobo Glo, Aura
|
Dear pzack,
This is not working. The example given reiterates the problem as you've described it. But it is not a sample. We already given you multiple solutions to that problem, but it doesn't seem to help you. Zip the text-file and post in on a file-hoster such as Dropbox or pCloud and share the link. Maybe we can help you, if you stop repeating the same information. |
![]() |
![]() |
Tags |
pyglossary |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to PDF conversion causes all the text to be aligned to the left | Swifty4635 | Conversion | 1 | 01-16-2022 10:17 PM |
Desktop App How do I run PyGlossary on Windows ? | Bilingual | Kobo Reader | 2 | 07-12-2020 01:54 PM |
epub 2 PDF conversion with OCR in PDF possible? | hobi2000 | Conversion | 2 | 03-25-2019 03:20 AM |
PDF conversion keeping pdf page | highstream | Conversion | 3 | 05-31-2016 11:46 AM |
PDF to PDF conversion creates much larger file? | rocketcat | Conversion | 11 | 09-30-2011 07:37 PM |