Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-02-2016, 04:42 PM   #1
gmcclusky
Nameless Being
 
Request for Recipe...or suggestions on best approach

The Goal
Convert several hundred PDF and/or docx files into what looks like a single Wordpress xml export file. Optionally convert each one into an individual plain text WP xml export file or just plain txt files.

Why?
I have hundreds of PDFs that I would like to turn into blog entries in Squarespace. Squarespace facilitates importing from Wordpress xml files and this would GREATLY speed the process of getting the text from PDF's into blog entries in Squarespace.

Where I am At in the Conversion process
I have figured out how to bulk convert PDFs using Calibre into TXT files and minimally remove unwanted text and html tags-however, I have not been able to reformat the resulting text as desired (in WP xml export file format) I know what code from the WP xml export file I need to add into my resulting Calibre conversion, but I don't know if it is possible to bulk convert and append each pdf conversion into a single xml output file that follows Wordpress export format. If I end up with a perfectly formatted xml files for each individual PDF, that would still be great-and a yuge time saver. Also acceptable would be txt or html formatted versions of my PDFs that could be opened and manually copied and pasted into a new blog entry on Squarespace.

The PDFs are unfortunately copyrighted content so I have included a sample where I changed the text portions to lorem ipsum or generic data. Essentially I want to cherry pick a few short chunks of text from these pdfs and apply basic html tags like bold or h1 and line breaks to the output file so that the text is ready to copy and paste or import from a WP xml file and is perfectly formatted html-minimizing editorializing on the Squarespace side.

In Calibre conversion terms I would like to know:
1. How best to eliminate all unwanted text and html tags in the source doc
2. How to add all relevant Wordpress XML file tags/code to the Calibre xml output file (txt format) - wrap the WP tags around text so it looks like it was created by exporting from Wordpress. (open any WP xml export file and you can see the format it follows.
3. How to keep just the text I want and apply html formatting to the sentences of text I keep in the Calibre output file-and also have it placed within the appropriate tags in the resulting Wordpress xml file . See example below:

---USING CALIBRE SEARCH AND REPLACE WIZARD, here is what CALIBRE "sees" in a sample source document:

<!-- created by calibre's pdftohtml -->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<title>Microsoft Word - Antifoam.docx</title>

<meta name="generator" content="pdftohtml 0.36"/>
<meta name="author" content="author"/>
<meta name="date" content="2012-03-21T08:56:38+00:00"/>
</head>
<body bgcolor="#A0A0A0" vlink="blue" link="blue">
<a id="1"></a><img src="index-1_1.jpg"/><br>
ZZ<br>
Z Z <br>
X YZ<br>
Z YX<br>
X Q<br>
Q W<br>
V ZX<br>
Q WYXZ<br>
Z X<br>
X XYZ<br>
Y Y<br>
Q Q<br>
Z Z,<br>
Y YY<br>
Q Z. <br>
123 Main Street <br>
City, ST 00000-0000 <br>
www.domain.com <br>
<b>Corporate:</b> 000-555-1212 <br>
<b>Fax: </b>000-555-1212 <br>
<b>East:</b> 000-555-1212 <br>
<b>Type of Document</b><br>
Date: 01/31/16 <br>
<br>
Supersedes: 01/31/15 <br>
<b>PRODUCT #: 12345</b> <br>
<i><b>PRODUCT TITLE</b></i><br>
Product Subtitle<br>
<i><b>Product Description:</b></i> <br>
Lorem ipsum<br>
lorem ipsum. <i><b>PRODUCT TITLE </b></i>lorem ipsum. <br>
lorem ipsum.<i><b> </b></i><br>
<br>
<br>
<i><b>Product Directions:</b></i> <br>
Lorem ipsum<i><b>PRODUCT TITLE</b></i> lorem ipsum <br>
lorem ipsum<br>
Lorem ipsum<br>
lorem ipsum. <br>
<br>
<br>
<i><b>Product Specifications:</b></i> <br>
<br>
<br>
<b>Product Appearance:</b> <br>
Lorem ipsum <br>
<br>
<br>
<b>Density:</b> <br>
lorem ipsum<br>
<br>
<br>
<b>Product Ingredients: </b><br>
None <br>
<br>
<b>(lorem ipsum)</b> <br>
<br>
<b>Product Warnings:</b> <br>
Lorem ipsum <br>
<br>
<br>
<br>
<br>
<br>
<br>
Legal disclaimer line 1<br>
legal disclaimer line 2 <br>
legal disclaimer line 3<br>
<hr/>
</body>
</html>


--------------------------------------------------------------------------------


////END OF SAMPLE DOCUMENT AS CALIBRE SEES IT

If I have been unclear, please let me know...hopefully you get the gist of what I am asking for...Thanks in advance...
  Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Suggestions on longform articles recipe in French and Spanish mendesitba Recipes 1 01-02-2016 02:39 PM
Request/Idea: Approach to converting complex documents like PDFs ghudod Calibre 3 03-20-2013 12:29 AM
Request of suggestions for a strategy to create a CLEAN Calibre library RotAnal Library Management 5 09-24-2012 01:01 AM
Usability Request with suggestions. Deftonesrule Recipes 0 01-06-2011 09:19 AM
Request for recipe exdream Calibre 3 04-24-2010 10:13 AM


All times are GMT -4. The time now is 05:51 PM.


MobileRead.com is a privately owned, operated and funded community.