I also use sed and/or other tools for regex search and replace, but my methods are based on "heuristics" rather than scripts, because the output from the OCR program depends on the input. So my opinion is that detection of chapters is best done manually in each case, but when you have seen the pattern of the html file, you can batch search and replace for such elements as chapters, page breaks etc
|