![]() |
#1 |
Member
![]() Posts: 11
Karma: 10
Join Date: Jun 2012
Device: None
|
Loading 5000 Technical Papers in PDF format Advice
I need some advice on how to load my technical directory into Calibre. I have over 5000 technical papers in PDF format. The files are all named by the paper number, for example, SPE000123456. Calibre loads all the files and extracts the paper title and the authors from the meta data and creates the Calibre library.
The problems I am trying to resolve are: Calibre no longer keeps the original file name which contains the the paper number, which is a major reference that I need to find papers etc. So how can I incorporate the paper number when I load the files into Calibre. Calibre generates the authors from the meta data but it contains non author information, for example: Y. Cheng, SPE, West Virginia University K.H. Coats, Coats Engineering; C.H. Whit - authors truncated So how can I clean this up or import the files in a "clean" state. Any advice would be appreciated. |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
Build a new library. Open setup in the add book dialog (pic 1) setup the correct entries (no check for "read metadata for content rather then file name" (pic2) and take a matching regular expression. You can test it when you put under File name a live example i.e. "Dirty, And Quick - This in an example.pdf" and then press test. Then you will see what happen with your metadata.
|
![]() |
![]() |
![]() |
#3 |
Member
![]() Posts: 11
Karma: 10
Join Date: Jun 2012
Device: None
|
Loading 5000 Technical Papers in PDF format Advice
Divingduck,
Thank you for replying. But this mutually exclusive, you can either get the data from the meta data or from the file name, yes? What I want to is to pickup the Title and Authors data from the meta data and at the same time pickup the paper number from the file name so that I can store the paper number, or as a sequence or a file name in Calibra. Is there a way to do this? Again, thanks for replying. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
no, you can do both. The trick is to use a other field for your file name.
Lets look at an example. Your file name is "10201155.pdf" In the PDF you have define a title and an author. I use for this example the series field. You should create a custom column for your library to store the document number. Go to the import dialog and put in (?P<series>.+) as regular expression, check that "read metadata for content rather then file name" is selected and apply the change. When you test it in the same window with file name "10201155.pdf" you will see the file name "10201155" in the field series. Now add one book for test (check before running that the metadata are in the file on the correct position). |
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 11
Karma: 10
Join Date: Jun 2012
Device: None
|
That is very nice, and works like a charm for series. However, when I created a custom column called "reference" and use (?P<reference>^.{12}) to get the paper number it does not work.
Last edited by dFGJByjm4898IssG; 06-17-2012 at 10:21 AM. |
![]() |
![]() |
![]() |
#6 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,240
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
Oh, I forgot to mention this.
@ theducks, thank you for completing this explanation. |
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 11
Karma: 10
Join Date: Jun 2012
Device: None
|
Still not working. The test file is
spe00095428 The Application of Cutoffs in Integrated Reservoir Studies.pdf If I use (?P<title>.{8}) I get 00095428 in the title field which is what I want. I then create a Custom Column called Reference with a look up name of #reference I then change regular expression to (?P<#reference>.{8}) and I get calibre, version 0.8.56 ERROR: Unhandled exception: <b>error</b>:bad character in group name Traceback (most recent call last): File "site-packages\calibre\gui2\preferences\main.py", line 324, in commit File "site-packages\calibre\gui2\preferences\adding.py", line 124, in commit File "site-packages\calibre\gui2\widgets.py", line 149, in commit File "site-packages\calibre\gui2\widgets.py", line 146, in pattern File "re.py", line 190, in compile File "re.py", line 242, in _compile error: bad character in group name What I am I doing wrong? I appreciate very much your help. |
![]() |
![]() |
![]() |
#9 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,240
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
The # is used when YOU refer to a custom column name BTW I don't use more than the basic import template, so I am not much help there. |
|
![]() |
![]() |
![]() |
#10 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,525
Karma: 8065948
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Calibre does not support using custom columns in the metadata extraction template regular expression. You must use one of the supported fields listed in the test box. Perhaps publisher is one you don't otherwise need so can use. After importing, you would use bulk metadata search/replace to copy the value to your custom column.
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
Yes, this is the way. I thought it is possible to do it in one step, but this didn't work. As charley mention you need to do a second step for moving the data in the right position.
Doing this, you should maybe integrate a second custom column what indicate the completeness of your metadata so that you are aware of what metadata you have already finished. I do this with a yes/no column. Chary, thanks for helping out. Edit: Here an example how to make a quick replacement from one to an other field I use your regex (?P<publisher>.{8}) to move the extracted information in the Metadata Publisher File name: spe00095428 calibre User Manual — calibre User Manual.pdf Publisher will become "spe00095428" In my pic below I import the file two times. Then mark the imported books and click on "edit metadata" and select tab search and replace and select in 'Search field' publisher and in 'Destination field' your custom field (here my name is '#alt_title'. After performing the change you will find the data in the right place. You can do then a second edit to clean up the field publisher. Be careful and test it on an example. When you do it with a bigger count of books, be sure you have select the books you want to change. You can't reverse these action. Last edited by Divingduck; 06-19-2012 at 02:15 PM. |
![]() |
![]() |
![]() |
#12 |
Member
![]() Posts: 11
Karma: 10
Join Date: Jun 2012
Device: None
|
Sorry for the late reply, I have been experimenting. :-)
I have loaded about one third of the papers and I used the series and series-index fields to keep the paper number. Just a few question: 1) I wanted to keep the leading zeros, but Calibre deleted them is there a way to do this? 2) After the load the Published date field contained the current date and not the date from the PDF. Any suggestions. 3) I was thinking of using the "Search Internet" plugin to load the Abstracts, DOI number etc from the OnePetro website http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE But I have not got it work. I do know that I need the leading zeros though. What I would like to do is just right click select OnePetro and load the information. Any pointers on this would be really appreciated. |
![]() |
![]() |
![]() |
#13 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,240
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
|
|
![]() |
![]() |
![]() |
#14 |
Member
![]() Posts: 11
Karma: 10
Join Date: Jun 2012
Device: None
|
Okay, thanks.
I can this format this another way and I have used the "Search Internet" plugin to load the correct page for a given paper, the site address is: http://www.onepetro.org/mslib/app/newSearch.do And a manual address to get a paper is: http://www.onepetro.org/mslib/app/Pr...ocietyCode=SPE And the plugin address using the series_index is: http://www.onepetro.org/mslib/app/Preview.do?paperNumber=SPE-{series_index:re(0$,)}-PA&societyCode=SPE And this works. My question is how to I extract the Title and other fields from the web page and place them in Calibre? I have also posted this question on the "Search Internet" plugin forum as well. Many thanks for your help. |
![]() |
![]() |
![]() |
Tags |
meta data, pdf |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[PROBLEM]ePub format texts not loading | lahonda_99 | Astak EZReader | 7 | 10-23-2010 02:03 AM |
Technical eBook Layout Advice | Reg22 | Writers' Corner | 7 | 08-20-2010 01:10 PM |
Normal books and technical papers (PDFs with annotation support) | NautilusIII | Which one should I buy? | 7 | 08-05-2010 04:42 AM |
Have anyone read technical papers on iLiad? | physics@war | iRex | 2 | 04-16-2009 02:23 PM |
Viewing Technical Papers on reader: Newbie | addepalli1 | Sony Reader | 14 | 01-27-2008 03:46 PM |