How to grab plain (Sciencedirect) HTML?

johndoesecond · 02-01-2010, 12:58 PM

Hi all,

I'd like to save the plain HTML version of ScienceDirect articles, like this one:
http://dx.doi.org/10.1016/j.compenvurbsys.2009.06.001

I'd need to get only the HTML part (no menus,no-boxed FULL WIDTH version) for subsequent conversion in Mobi.

Is there any tool, browser add-on, or anything similar to get just that part of the HTML? Maybe something that would allow me to select the desired part of the Web page and save it as HTML?

Thank you in advance.

Regards.

Jonas777 · 02-01-2010, 04:12 PM

There is an option to purchase the full article in pdf or html. Does it look the same as the free sample?

johndoesecond · 02-01-2010, 05:47 PM

Quote:

Originally Posted by Jonas777

There is an option to purchase the full article in pdf or html. Does it look the same as the free sample?

The HTML formatting is pretty the same.

Here's a link that should show you a full article:

http://www.sciencedirect.com/science...b&artImgPref=F

As I said, PDF pages are just to big to fit confortly even my 9.7" Kindle DX display.

Any ideas how to effectively strip that HTML box (to MOBIze it after that)?

frabjous · 02-01-2010, 07:49 PM

Here's one thought. With Firefox (--I'm using 3.5--), go to the site, and then highlight the part you want. You might be able to copy and paste that into a Word Processor, but there's a good chance that won't work out too well.

So try this instead, after selecting the part you want, right click and choose "View Selection Source". Copy the HTML code it gives there into your favorite text editor. Precede with:

Code:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>

follow with:

Code:

</body></html>

Save it as an HTML file, and open it in a browser to see how it looks.

You might lose some formatting. If it's important, go back to the page, and look at the full web page source (Ctrl-U in Firefox), find the parts that look like:

<link rel="stylesheet" ... type="text/css">

and copy them into your new .html file between the <head> </head> parts. Make sure the full URL for the CSS file is in the href="..." part. Save again.

Worth a shot.

johndoesecond · 02-02-2010, 12:49 PM

Quote:

Originally Posted by frabjous

So try this instead, after selecting the part you want, right click and choose "View Selection Source". Copy the HTML code it gives there into your favorite text editor. Precede with:

Hi frabjous,

I'm not using Firefox, so wasn't aware of this feature. (I will from now on!)

However, I'm not sure this will do the trick. The way you're suggesting will get only the HTML, but I'd also need to download all the images (JPGs & C) linked in the document.

Any further hint?

Thanks again.

DDHarriman · 02-02-2010, 04:28 PM

Hi

I advice you to try the PDF version of the article in your DX - you can get it free from here if you want: http://198.81.200.2/science/journal/01989715, 3th text.

Before using it, let it be processed by “SoPDF” (you can get the files in the forum for free), and choose “Fix 2x With” with “White Space Croping” for your DX, it will give you a PDF file, without all the white margins and with the page cut in 2, so you can read it landscape.
Probably it’s enough for you to read the small size text, and will retain all the images and tables.

Here you have an example of that, and even in my 6” eBook reader(s) I can read it.

Let me know if this was of some help.

Best regards,

frabjous · 02-02-2010, 04:33 PM

Code:

However, I'm not sure this will do the trick. The way you're suggesting will get only the HTML, but I'd also need to download all the images (JPGs & C) linked in the document.

Any further hint?

I don't a lot of time to think about this today, but here's one thought -- adding one level of complexity.

Navigate to the page on that website, and then go to "Save Page As...". In the "Save As" dialog box, be sure to choose "Web Page, Complete" as the format to save it as. That will save the file as an .HTML file (say, science.html) and will create an folder (science_files) where it will put all the images. The only problem is that the page you just saved has all the menus and other nonsense.

So NOW open the science.html file you just saved in Firefox, highlight the part you want and view its code and follow the procedure I outlined above. The image links will link the ones on your harddrive rather than the remove site. So long as you save the .html file in the same folder as you saved the original, I think calibre (or whatever) should be able to find them when you convert to .mobi.

I'll have to test that later, however.

P.S. Didn't see DDHarriman's post. I'm a big fan of soPDF, but I'm not sure that's the way to go here. Try both methods and see what you prefer.

johndoesecond · 02-02-2010, 05:17 PM

Quote:

Originally Posted by DDHarriman

Hi

Here you have an example of that, and even in my 6” eBook reader(s) I can read it.

Let me know if this was of some help.

Best regards,

[QUOTE=frabjous;769857
I don't a lot of time to think about this today, but here's one thought -- adding one level of complexity.

[/QUOTE]

Thanks DDHarriman and frabjous!

Both hints were useful, and will do the job, depending on the article/PDF's formatting.

02-01-2010, 12:58 PM	#1
johndoesecond Connoisseur Posts: 55 Karma: 2000 Join Date: Jan 2010 Device: Kindle DX, Kindle 4, Kindle PW2	How to grab plain (Sciencedirect) HTML? Hi all, I'd like to save the plain HTML version of ScienceDirect articles, like this one: http://dx.doi.org/10.1016/j.compenvurbsys.2009.06.001 I'd need to get only the HTML part (no menus,no-boxed FULL WIDTH version) for subsequent conversion in Mobi. Is there any tool, browser add-on, or anything similar to get just that part of the HTML? Maybe something that would allow me to select the desired part of the Web page and save it as HTML? Thank you in advance. Regards. Last edited by johndoesecond; 02-01-2010 at 01:00 PM.

02-01-2010, 07:49 PM	#4
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Here's one thought. With Firefox (--I'm using 3.5--), go to the site, and then highlight the part you want. You might be able to copy and paste that into a Word Processor, but there's a good chance that won't work out too well. So try this instead, after selecting the part you want, right click and choose "View Selection Source". Copy the HTML code it gives there into your favorite text editor. Precede with: Code: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> follow with: Code: </body></html> Save it as an HTML file, and open it in a browser to see how it looks. You might lose some formatting. If it's important, go back to the page, and look at the full web page source (Ctrl-U in Firefox), find the parts that look like: <link rel="stylesheet" ... type="text/css"> and copy them into your new .html file between the <head> </head> parts. Make sure the full URL for the CSS file is in the href="..." part. Save again. Worth a shot.

02-02-2010, 04:33 PM	#7
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Code: However, I'm not sure this will do the trick. The way you're suggesting will get only the HTML, but I'd also need to download all the images (JPGs & C) linked in the document. Any further hint? I don't a lot of time to think about this today, but here's one thought -- adding one level of complexity. Navigate to the page on that website, and then go to "Save Page As...". In the "Save As" dialog box, be sure to choose "Web Page, Complete" as the format to save it as. That will save the file as an .HTML file (say, science.html) and will create an folder (science_files) where it will put all the images. The only problem is that the page you just saved has all the menus and other nonsense. So NOW open the science.html file you just saved in Firefox, highlight the part you want and view its code and follow the procedure I outlined above. The image links will link the ones on your harddrive rather than the remove site. So long as you save the .html file in the same folder as you saved the original, I think calibre (or whatever) should be able to find them when you convert to .mobi. I'll have to test that later, however. P.S. Didn't see DDHarriman's post. I'm a big fan of soPDF, but I'm not sure that's the way to go here. Try both methods and see what you prefer. Last edited by frabjous; 02-02-2010 at 04:38 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre Recipe HTML content differs from raw html of index.html.	krunk	Calibre	4	09-20-2010 10:48 PM
Grab news error with PDF output	chatainsim	Calibre	3	02-28-2010 07:59 PM
Automatically grab news at windows startup ?	phkoech	Calibre	3	08-25-2009 03:14 AM
The Sunday Times:Google makes a grab for e-books	Kris777	News	8	03-29-2009 02:18 PM
ScienceDirect making 4'000 e-books available	TadW	News	1	09-11-2007 06:56 AM

02-01-2010, 04:12 PM	#2
Jonas777 Groupie Posts: 153 Karma: 364 Join Date: Oct 2009 Location: Sweden Device: Amazon Kindle 2 Intl	There is an option to purchase the full article in pdf or html. Does it look the same as the free sample?

Advert

Advert