Quote:
Originally Posted by xxyzz
Hi, jhowell:
The `content` string from `convert_to_json_content()` contains footnote number, which causes spaCy to mark words around this number as a single named entity. For example: "Viktor Chebrikov.69 Gorbachev", the number 69 is the footnote number and "Gorbachev" is the first word of the next sentence.
Could you please split the paragraph contains footnote number(also remove this number) to separate paragraphs if this feature doesn't require too much time to implement?
|
Luckily KFX format supports a semantic indicator that shows whether text is a footnote reference. That is set as long as the conversion process recognizes a properly formatted footnote.
Using that I can add an option to replace characters that make up a footnote link with spaces. So that in your example "Viktor_Chebrikov.69_Gorbachev" would become "Viktor_Chebrikov.___Gorbachev".
Hopefully that will be enough to solve the problem for you. If there are other cases of numbers appearing in text that cause problems for spaCy you may need to come up with other ways to recognize them and filter them out.