View Single Post
Old 09-11-2022, 07:55 AM   #25
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 897
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
So the problem is of course in the assumptions.

Conversion to csv-file
I've used sublime3, because it supports Perl regex. However, with a bit of googling you'll find the slight differences in regex implementation in editors. I've also included the perl commands.

If you use the following substitutions in order, you get a csv-file.

Find --> Replace ALL, e.g. perl -pe 's/\n\n+/||/sg'
'\n\n+' --> '||' , masking of the lines separating articles
'\n' --> ' ' , removal of the <EOL>-characters inside an article
'\|\|' --> '\n' , insertion of <EOL>-character at the end of an article. The article is now on 1 line.
'^(\S+)' --> '$1|,|$1' , Repeating the first word and introducing a delimiter, e.g. |,|. The reason for a complex delimiter is that it will not occur naturally in the article.
'^(\S+)' --> '$1,' , Splitting the first word and introducing a comma

The last two replacements are alternatives.
I've added the original text-file and the intermediate results.
You can recreate them with the commands
Code:
perl -pe 's/\n\n+/\|\|/sg' <original.txt> output1.txt
perl -pe 's/\n/ /sg' <output1.txt> output2.txt
perl -pe 's/\|\|/\n/sg' <output2.txt> output3.txt
perl -pe 's/^(\S+)/$1 /sg' <output3.txt> output4.csv
A final result in the classical csv-format is this:
Code:
zymogène, [zims3en] adj. (de zymo- et de -gène, du gr.gennân, engendrer, produire ; 1888, Larousse, comme qualificatif d’une substance qui produit un ferment soluble, par une transformation spontanée ; sens actuel, 1964, Larousse). Pouvoir zymogène, propriété des cellules de fabriquer leurs propres enzymes ; propriété des glandes spécialisées de produire les enzymes néces- saires à l'organisme.
©, n. m. (1964, Robert). Précurseur inactif d'un enzyme. (Syn. PROENZYME.)
zymotechnie, [zimotekni] n. f. (de zymo- et de -fechnie, du gr. tekhné, art [manuel], industrie, métier ; 1762, Acad.). Art de produire et de diriger une fermentation.
zymotechnique, [zimoteknik] adj. (de zymotechnie ; 1872, Littré). Qui se rapporte à la zymotechnie.
zymotique, [zimotik] adj. (gr. zumôtikos, propre à faire fermenter, de zumôtos, fer- menté, dér. de zumoün, faire fermenter, de zum, levain ; 1855 [d'après Robert, 1977], puis 1868, Souviron, 585). Qui se rapporte aux ferments solubles.
zythum, {zitsm] ou zython [zit5] n.m. (lat. zythum, bière, boisson faite avec de l'orge, du gr. zuthos, décoction d'orge, bière ; 1710, Richelet — additions — [zythum], et 1923, Larousse [zython]). Bière que les Égyptiens préparaient avec de l’orge fermentée.
Problem
So what's the problem? You now have an article with the key '©' that has a quite new meaning. Apparently, there are articles that have subsections separated from the main article in the same way that articles are separated.

Stardict
Using my script I've added to the txt-file a csv-extension and ran it using
Code:
perl pocketbookdic.pl  zymogène.S-delimiter .txt.csv fr '|,|'
The result in both the xml- and zipped binary form are also uploaded.

The screen output (with '$isTestingOn = 1;' in the script) is like this:
Attached Thumbnails
Click image for larger version

Name:	Screenshot from 2022-09-11 14-32-07.png
Views:	219
Size:	257.0 KB
ID:	196444  
Attached Files
File Type: txt zymogène.txt (1.3 KB, 78 views)
File Type: txt n-||.txt (1.2 KB, 84 views)
File Type: txt n- .txt (1.2 KB, 78 views)
File Type: txt n.txt (1.2 KB, 92 views)
File Type: txt zymogène.S-delimiter .txt (1.3 KB, 80 views)
File Type: txt zymogène.S-, .txt (1.2 KB, 83 views)
File Type: zip zymogène.S-delimiter .txt_reconstructed.zip (1.8 KB, 91 views)
File Type: xml zymogène.S-delimiter .txt_reconstructed.xml (2.2 KB, 106 views)

Last edited by Markismus; 09-11-2022 at 08:50 AM.
Markismus is offline