View Full Version : libprs500 - title/author matching regex
Megatron-UK 03-31-2008, 12:08 PM I've just started playing with libprs500 (0.4.46) in preperation for a Sony PRS505 I have on the way and I'm having a spot of bother trying to get the standard regex to correctly identify the author and title from the filename.
The standard syntax I believe is: (?P<author>.+) - (?P<title>[^_]+)
Which, if in the test box, I paste in the following string "H.P Lovecraft - At the Mountains of Madness.txt" correctly reports the following:
Title: "At the Mountains of Madness"
Author: "H.P. Lovecraft"
Series: "No Match"
Series Index: "No Match"
However, actually importing that same file into the library displays the following:
Title: "H.P. Lovecraft - At the Mountains of Madness"
Author: "H.P. Lovecraft"
(all other columns are blank as expected)
Is this standard behaviour or a bug?
Megatron-UK 03-31-2008, 01:07 PM Upon further investigation it only seems to do this with PDF documents; the author and title fields seem to map correctly against html, zip and text based files.
So if I rename a pdf, an html file, a text file and a zip all to the same name:
wibble - wobble.[pdf|zip|txt|html]
...then the html, text and zip version of the file will all correctly display as title="wobble", author="wibble".
However the pdf file will show as title="wibble - wobble" and author="wibble".
kovidgoyal 03-31-2008, 01:29 PM libprs500 tries to read metadata from the file itself first. Only if that fails does it use the filename.
Megatron-UK 03-31-2008, 01:38 PM Is this right though? I've attached an example of the difference in behaviour with the same filename for three different file types. There is no metadata set in the PDF file.
Megatron-UK 03-31-2008, 02:17 PM Ok, digging a bit and would I be correct in thinking that pdf-meta.exe is used to determine the author and title of PDF documents?
Running pdf-meta on my renamed document I get the following:
pdf-meta.exe author\ -\ title.pdf
Title : author - title
Author : Unknown
Publisher: None
Category : None
Comments : None
ISBN : None
It looks like libprs500 is taking the Title as shown by pdf-meta and not running the regex to split it based on the filename. I have a whole load of PDF docs that have varying states of correct/incorrect meta data and I'd rather load them into libprs500 using the filenames to determine author and title.
Other than using pdftk and writing a script to recurse through all of my files to insert metadata based on the filename, can we force libprs500 to use the filename instead, even for PDF's?
kovidgoyal 03-31-2008, 02:43 PM Open a ticket for a config option to customize this behavior.
Megatron-UK 03-31-2008, 03:27 PM I've recursed through all of my PDF documents and ran the following script:
#!/bin/bash
find . -name "*.pdf" -print | grep -v .pdf.new | while read PDFPATH
do
DIR=`echo $PDFPATH | awk -F/ '{print $2}'`
FILE=`echo $PDFPATH | awk -F/ '{print $3}'`
AUTHOR=`echo $FILE | awk -F\- '{print $1}' | sed 's/ *$//'`
VAR2=`basename "$FILE" .pdf | awk -F\- '{print $2}' | sed 's/ *$//' | sed 's/^ //'`
VAR3=`basename "$FILE" .pdf | awk -F\- '{print $3}' | sed 's/ *$//' | sed 's/^ //'`
if [ "$VAR3" = "" ]
then
TITLE=$VAR2
SERIES=""
else
TITLE=$VAR3
SERIES=$VAR2
fi
echo "InfoKey: Author
InfoValue: $AUTHOR
InfoKey: Title
InfoValue: $TITLE" > ./metadata
pdftk "$DIR"/"$FILE" update_info metadata output "$DIR"/"$FILE".new
done
This correctly sets the PDF metadata, based on my known-good filename format of:
AUTHOR - SERIES - TITLE.pdf
or
AUTHOR - TITLE.pdf
However... libprs500 is still displaying the PDF files that I have correctly set the metadata on in the form of "author - title". Almost as if it is ignoring both the metadata *and* the filename regex pattern matching altogether and simply using the filename, minus the pdf extension.
kovidgoyal 03-31-2008, 03:28 PM What does pdf-meta give you on the corrected PDF files?
Megatron-UK 03-31-2008, 03:38 PM pdf-meta now shows the correct author, but the title is still the filename minus the extension. e.g.
megatron@elderthing:/cygdrive/y/resources/Books/pdf books $ pdf-meta.exe author\ -\ title.pdf
Title : author - title
Author : Unknown
Publisher: None
Category : None
Comments : None
ISBN : None
megatron@elderthing:/cygdrive/y/resources/Books/pdf books $ pdf-meta.exe author\ -\ title.pdf.new
Title : author - title.pdf
Author : author
Publisher: None
Category : None
Comments : None
ISBN : None
On the corrected PDF file, it looks suspiciously like pdf-meta is silently dropping the extension and treating the basename as the title - the metadata certainly doesn't show title as being "author - title.pdf" when I view it in Acrobat.
kovidgoyal 03-31-2008, 04:18 PM Attach one of these PDF files here
Megatron-UK 04-01-2008, 03:16 AM Ok, will do that when I get back in from work.
Megatron-UK 04-01-2008, 12:25 PM Ok, this is a version of Douglas Adams HHGTTG. Not a great version, but that's not relevant.
Original version
megatron@elderthing:/cygdrive/y/resources/Books/pdf books/Douglas Adams $ pdf-meta.exe Douglas\ Adams\ -\ The\ Hitch\ Hikers\ Guide\ To\ The\ Galaxy.pdf
Title : Douglas Adams - The Hitch Hikers Guide To The Galaxy
Author : Unknown
Publisher: None
Category : None
Comments : None
ISBN : None
Corrected metadata version
megatron@elderthing:/cygdrive/y/resources/Books/pdf books/Douglas Adams $ pdf-meta.exe Douglas\ Adams\ -\ The\ Hitch\ Hikers\ Guide\ To\ The\ Galaxy.pdf.new
Title : Douglas Adams - The Hitch Hikers Guide To The Galaxy.pdf
Author : Douglas Adams
Publisher: None
Category : None
Comments : None
ISBN : None
I had to put an extra ".pdf" on the end of the corrected version in order to upload it.
JSWolf 04-01-2008, 12:29 PM I've had to remove the attachments as they are of a copywritten book. Please use the Libprs500 website's ticket system to attach them there.
Megatron-UK 04-01-2008, 01:20 PM My apologies. Here's one that's now in the public domain. E.E Smith's 'Triplanetary'.
No metadata to start with. Metadata added with the following command:
pdftk E.\ E.\ Doc\ Smith\ -\ Lensman\ 1\ -\ Triplanetary.pdf update_info metadata output E.\ E.\ Doc\ Smith\ -\ Lensman\ 1\ -\ Triplanetary_new.pdf
The metadata input file is trivial:
megatron@curse:/export/Apps and Resources/resources/Books $ cat metadata
InfoKey: Author
InfoValue: Mr NotaRealName
InfoKey: Title
InfoValue: This is a test document for libprs500
Then check the metadata with pdf-meta:
megatron@elderthing:/cygdrive/y/resources/Books $ pdf-meta.exe E.\ E.\ Doc\ Smith\ -\ Lensman\ 1\ -\ Triplanetary_new.pdf
Title : E. E. Doc Smith - Lensman 1 - Triplanetary_new
Author : Mr NotaRealName
Publisher: None
Category : None
Comments : None
ISBN : None
The Author is displayed correctly, but the Title should be "This is a test document for libprs500"... (as shown in the screengrab of Acrobat below). libprs500 therefore still displays the incorrect Title.
kovidgoyal 04-01-2008, 03:27 PM Fixed in svn
Megatron-UK 04-01-2008, 03:39 PM Excellent :-)
|