Loading 5000 Technical Papers in PDF format Advice

dFGJByjm4898IssG · 06-17-2012, 12:23 AM

I need some advice on how to load my technical directory into Calibre. I have over 5000 technical papers in PDF format. The files are all named by the paper number, for example, SPE000123456. Calibre loads all the files and extracts the paper title and the authors from the meta data and creates the Calibre library.

The problems I am trying to resolve are:

Calibre no longer keeps the original file name which contains the the paper number, which is a major reference that I need to find papers etc. So how can I incorporate the paper number when I load the files into Calibre.

Calibre generates the authors from the meta data but it contains non author information, for example:

Y. Cheng, SPE, West Virginia University
K.H. Coats, Coats Engineering; C.H. Whit - authors truncated

So how can I clean this up or import the files in a "clean" state.

Any advice would be appreciated.

Divingduck · 06-17-2012, 02:37 AM

Build a new library. Open setup in the add book dialog (pic 1) setup the correct entries (no check for "read metadata for content rather then file name" (pic2) and take a matching regular expression. You can test it when you put under File name a live example i.e. "Dirty, And Quick - This in an example.pdf" and then press test. Then you will see what happen with your metadata.

dFGJByjm4898IssG · 06-17-2012, 05:19 AM

Divingduck,

Thank you for replying.

But this mutually exclusive, you can either get the data from the meta data or from the file name, yes?

What I want to is to pickup the Title and Authors data from the meta data and at the same time pickup the paper number from the file name so that I can store the paper number, or as a sequence or a file name in Calibra. Is there a way to do this?

Again, thanks for replying.

Divingduck · 06-17-2012, 06:26 AM

no, you can do both. The trick is to use a other field for your file name.

Lets look at an example. Your file name is "10201155.pdf"
In the PDF you have define a title and an author. I use for this example the series field. You should create a custom column for your library to store the document number.

Go to the import dialog and put in (?P<series>.+) as regular expression, check that "read metadata for content rather then file name" is selected and apply the change. When you test it in the same window with file name "10201155.pdf" you will see the file name "10201155" in the field series. Now add one book for test (check before running that the metadata are in the file on the correct position).

dFGJByjm4898IssG · 06-17-2012, 10:29 AM

That is very nice, and works like a charm for series. However, when I created a custom column called "reference" and use (?P<reference>^.{12}) to get the paper number it does not work.

theducks · 06-17-2012, 11:56 AM

Quote:

Originally Posted by dFGJByjm4898IssG

That is very nice, and works like a charm for series. However, when I created a custom column called "reference" and use (?P<reference>^.{12}) to get the paper number it does not work.

Custom columns (when referenced) all start with a hashmark (#)

Hover the mouse pointer over the column title.

Divingduck · 06-17-2012, 03:11 PM

Oh, I forgot to mention this.
@ theducks, thank you for completing this explanation.

dFGJByjm4898IssG · 06-19-2012, 11:35 AM

Still not working. The test file is

spe00095428 The Application of Cutoffs in Integrated Reservoir Studies.pdf

If I use (?P<title>.{8}) I get 00095428 in the title field which is what I want.

I then create a Custom Column called Reference with a look up name of #reference

I then change regular expression to (?P<#reference>.{8}) and I get

calibre, version 0.8.56
ERROR: Unhandled exception: <b>error</b>:bad character in group name

Traceback (most recent call last):
File "site-packages\calibre\gui2\preferences\main.py", line 324, in commit
File "site-packages\calibre\gui2\preferences\adding.py", line 124, in commit
File "site-packages\calibre\gui2\widgets.py", line 149, in commit
File "site-packages\calibre\gui2\widgets.py", line 146, in pattern
File "re.py", line 190, in compile
File "re.py", line 242, in _compile
error: bad character in group name

What I am I doing wrong?

I appreciate very much your help.

theducks · 06-19-2012, 11:41 AM

Quote:

Originally Posted by dFGJByjm4898IssG

I then create a Custom Column called Reference with a look up name of #reference

Just checking: The lookup name used iduring column creation does NOT start with a #
The # is used when YOU refer to a custom column name

BTW I don't use more than the basic import template, so I am not much help there.

chaley · 06-19-2012, 01:33 PM

Calibre does not support using custom columns in the metadata extraction template regular expression. You must use one of the supported fields listed in the test box. Perhaps publisher is one you don't otherwise need so can use. After importing, you would use bulk metadata search/replace to copy the value to your custom column.

Divingduck · 06-19-2012, 01:49 PM

Yes, this is the way. I thought it is possible to do it in one step, but this didn't work. As charley mention you need to do a second step for moving the data in the right position.
Doing this, you should maybe integrate a second custom column what indicate the completeness of your metadata so that you are aware of what metadata you have already finished. I do this with a yes/no column.

Chary, thanks for helping out.

Edit: Here an example how to make a quick replacement from one to an other field

I use your regex (?P<publisher>.{8}) to move the extracted information in the Metadata Publisher
File name: spe00095428 calibre User Manual — calibre User Manual.pdf
Publisher will become "spe00095428"
In my pic below I import the file two times. Then mark the imported books and click on "edit metadata" and select tab search and replace and select in 'Search field' publisher and in 'Destination field' your custom field (here my name is '#alt_title'. After performing the change you will find the data in the right place. You can do then a second edit to clean up the field publisher.

Be careful and test it on an example. When you do it with a bigger count of books, be sure you have select the books you want to change. You can't reverse these action.

dFGJByjm4898IssG · 06-24-2012, 06:08 AM

Sorry for the late reply, I have been experimenting. :-)

I have loaded about one third of the papers and I used the series and series-index fields to keep the paper number. Just a few question:

1) I wanted to keep the leading zeros, but Calibre deleted them is there a way to do this?

2) After the load the Published date field contained the current date and not the date from the PDF. Any suggestions.

3) I was thinking of using the "Search Internet" plugin to load the Abstracts, DOI number etc from the OnePetro website

http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE

But I have not got it work. I do know that I need the leading zeros though. What I would like to do is just right click select OnePetro and load the information. Any pointers on this would be really appreciated.

theducks · 06-24-2012, 12:03 PM

Quote:

Originally Posted by dFGJByjm4898IssG

Sorry for the late reply, I have been experimenting. :-)

I have loaded about one third of the papers and I used the series and series-index fields to keep the paper number. Just a few question:

1) I wanted to keep the leading zeros, but Calibre deleted them is there a way to do this?

2) After the load the Published date field contained the current date and not the date from the PDF. Any suggestions.

3) I was thinking of using the "Search Internet" plugin to load the Abstracts, DOI number etc from the OnePetro website

http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE

But I have not got it work. I do know that I need the leading zeros though. What I would like to do is just right click select OnePetro and load the information. Any pointers on this would be really appreciated.

If you want leading 0's kept, you must use a text field. Numeric fields drop nonsense leading 0's. 001 is still = 1

dFGJByjm4898IssG · 06-25-2012, 11:58 AM

Okay, thanks.

I can this format this another way and I have used the "Search Internet" plugin to load the correct page for a given paper, the site address is:

http://www.onepetro.org/mslib/app/newSearch.do

And a manual address to get a paper is:

http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE

And the plugin address using the series_index is:

http://www.onepetro.org/mslib/app/Preview.do?paperNumber=SPE-{series_index:re(0$,)}-PA&societyCode=SPE

And this works.

My question is how to I extract the Title and other fields from the web page and place them in Calibre?

I have also posted this question on the "Search Internet" plugin forum as well.

Many thanks for your help.

06-17-2012, 12:23 AM	#1
dFGJByjm4898IssG Member Posts: 11 Karma: 10 Join Date: Jun 2012 Device: None	Loading 5000 Technical Papers in PDF format Advice I need some advice on how to load my technical directory into Calibre. I have over 5000 technical papers in PDF format. The files are all named by the paper number, for example, SPE000123456. Calibre loads all the files and extracts the paper title and the authors from the meta data and creates the Calibre library. The problems I am trying to resolve are: Calibre no longer keeps the original file name which contains the the paper number, which is a major reference that I need to find papers etc. So how can I incorporate the paper number when I load the files into Calibre. Calibre generates the authors from the meta data but it contains non author information, for example: Y. Cheng, SPE, West Virginia University K.H. Coats, Coats Engineering; C.H. Whit - authors truncated So how can I clean this up or import the files in a "clean" state. Any advice would be appreciated.

06-17-2012, 02:37 AM	#2
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	Build a new library. Open setup in the add book dialog (pic 1) setup the correct entries (no check for "read metadata for content rather then file name" (pic2) and take a matching regular expression. You can test it when you put under File name a live example i.e. "Dirty, And Quick - This in an example.pdf" and then press test. Then you will see what happen with your metadata. Attached Thumbnails

06-17-2012, 05:19 AM	#3
dFGJByjm4898IssG Member Posts: 11 Karma: 10 Join Date: Jun 2012 Device: None	Loading 5000 Technical Papers in PDF format Advice Divingduck, Thank you for replying. But this mutually exclusive, you can either get the data from the meta data or from the file name, yes? What I want to is to pickup the Title and Authors data from the meta data and at the same time pickup the paper number from the file name so that I can store the paper number, or as a sequence or a file name in Calibra. Is there a way to do this? Again, thanks for replying.

06-17-2012, 10:29 AM	#5
dFGJByjm4898IssG Member Posts: 11 Karma: 10 Join Date: Jun 2012 Device: None	That is very nice, and works like a charm for series. However, when I created a custom column called "reference" and use (?P<reference>^.{12}) to get the paper number it does not work. Last edited by dFGJByjm4898IssG; 06-17-2012 at 11:21 AM.

06-19-2012, 01:49 PM	#11
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	Yes, this is the way. I thought it is possible to do it in one step, but this didn't work. As charley mention you need to do a second step for moving the data in the right position. Doing this, you should maybe integrate a second custom column what indicate the completeness of your metadata so that you are aware of what metadata you have already finished. I do this with a yes/no column. Chary, thanks for helping out. Edit: Here an example how to make a quick replacement from one to an other field I use your regex (?P<publisher>.{8}) to move the extracted information in the Metadata Publisher File name: spe00095428 calibre User Manual — calibre User Manual.pdf Publisher will become "spe00095428" In my pic below I import the file two times. Then mark the imported books and click on "edit metadata" and select tab search and replace and select in 'Search field' publisher and in 'Destination field' your custom field (here my name is '#alt_title'. After performing the change you will find the data in the right place. You can do then a second edit to clean up the field publisher. Be careful and test it on an example. When you do it with a bigger count of books, be sure you have select the books you want to change. You can't reverse these action. Attached Thumbnails Last edited by Divingduck; 06-19-2012 at 03:15 PM.

06-17-2012, 06:26 AM	#4
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	no, you can do both. The trick is to use a other field for your file name. Lets look at an example. Your file name is "10201155.pdf" In the PDF you have define a title and an author. I use for this example the series field. You should create a custom column for your library to store the document number. Go to the import dialog and put in (?P<series>.+) as regular expression, check that "read metadata for content rather then file name" is selected and apply the change. When you test it in the same window with file name "10201155.pdf" you will see the file name "10201155" in the field series. Now add one book for test (check before running that the metadata are in the file on the correct position).

06-17-2012, 03:11 PM	#7
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	Oh, I forgot to mention this. @ theducks, thank you for completing this explanation.

06-19-2012, 11:35 AM	#8
dFGJByjm4898IssG Member Posts: 11 Karma: 10 Join Date: Jun 2012 Device: None	Still not working. The test file is spe00095428 The Application of Cutoffs in Integrated Reservoir Studies.pdf If I use (?P<title>.{8}) I get 00095428 in the title field which is what I want. I then create a Custom Column called Reference with a look up name of #reference I then change regular expression to (?P<#reference>.{8}) and I get calibre, version 0.8.56 ERROR: Unhandled exception: <b>error</b>:bad character in group name Traceback (most recent call last): File "site-packages\calibre\gui2\preferences\main.py", line 324, in commit File "site-packages\calibre\gui2\preferences\adding.py", line 124, in commit File "site-packages\calibre\gui2\widgets.py", line 149, in commit File "site-packages\calibre\gui2\widgets.py", line 146, in pattern File "re.py", line 190, in compile File "re.py", line 242, in _compile error: bad character in group name What I am I doing wrong? I appreciate very much your help.

06-19-2012, 01:33 PM	#10
chaley Grand Sorcerer Posts: 12,529 Karma: 8075744 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Calibre does not support using custom columns in the metadata extraction template regular expression. You must use one of the supported fields listed in the test box. Perhaps publisher is one you don't otherwise need so can use. After importing, you would use bulk metadata search/replace to copy the value to your custom column.

06-24-2012, 06:08 AM	#12
dFGJByjm4898IssG Member Posts: 11 Karma: 10 Join Date: Jun 2012 Device: None	Sorry for the late reply, I have been experimenting. :-) I have loaded about one third of the papers and I used the series and series-index fields to keep the paper number. Just a few question: 1) I wanted to keep the leading zeros, but Calibre deleted them is there a way to do this? 2) After the load the Published date field contained the current date and not the date from the PDF. Any suggestions. 3) I was thinking of using the "Search Internet" plugin to load the Abstracts, DOI number etc from the OnePetro website http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE But I have not got it work. I do know that I need the leading zeros though. What I would like to do is just right click select OnePetro and load the information. Any pointers on this would be really appreciated.

06-25-2012, 11:58 AM	#14
dFGJByjm4898IssG Member Posts: 11 Karma: 10 Join Date: Jun 2012 Device: None	Okay, thanks. I can this format this another way and I have used the "Search Internet" plugin to load the correct page for a given paper, the site address is: http://www.onepetro.org/mslib/app/newSearch.do And a manual address to get a paper is: http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE And the plugin address using the series_index is: http://www.onepetro.org/mslib/app/Preview.do?paperNumber=SPE-{series_index:re(0$,)}-PA&societyCode=SPE And this works. My question is how to I extract the Title and other fields from the web page and place them in Calibre? I have also posted this question on the "Search Internet" plugin forum as well. Many thanks for your help.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[PROBLEM]ePub format texts not loading	lahonda_99	Astak EZReader	7	10-23-2010 03:03 AM
Technical eBook Layout Advice	Reg22	Writers' Corner	7	08-20-2010 02:10 PM
Normal books and technical papers (PDFs with annotation support)	NautilusIII	Which one should I buy?	7	08-05-2010 05:42 AM
Have anyone read technical papers on iLiad?	physics@war	iRex	2	04-16-2009 03:23 PM
Viewing Technical Papers on reader: Newbie	addepalli1	Sony Reader	14	01-27-2008 04:46 PM

Advert

Advert