View Single Post
Old 09-27-2012, 03:45 PM   #13
SIRSteiner
Nameless Being
 
Quote:
Originally Posted by rkomar View Post
May I ask, what is it that makes this project take so long for you? Is it the downloading of the entries, transforming them to some format that can be converted to PB .dic format,...? The reason I ask is that I presume that the Wiktionary entries are constantly being updated on the website, and I wonder how much work it would be to keep your version in sync with them.
The entries of en.wiktionary.org consists of many, many, many templates. That's why it's not easy to parse them. I took the 2,6 GB dump did some simple filters to reduce the size to 900 MB. Now I parse all titles with english content by sending the source like i.e "{{en-noun|tylari}}" to wiktionary.org with a script (linux wget) to get the "translated" template i.e. "tylarus (plural tylari)".

It's more complex because of different types of the entries, the special characters, tables etc. Therefore I look at the results and do some logical tests to search faults.

I searched the internet but i couldn't found a parser for the dump to do the job offline. The only way I see to do it faster is to create a local mirror of en.wiktionary.org

That's why it takes a long time.

Regards
Ronny
  Reply With Quote