Input - getOnePage

 

Description

Query an HTTP url and copies the resulting page. Also copies any .gif or .jpg files that are referenced by the page. Resolves links so that the page can be read from the local disk and still have functional links to remote pages.

Assumes that the root URL is an HTML page.

Anything that is copied must come from the same site as the root URL.

 

Configuration Variables

url
The source URL.

 

Product

Builds a VectorProduct with a root page and a list of all files that were copied.

 

How it works

Maintains a list of urls to be fetched and a list of urls that have been fetched. All pages are fetched and renamed into a flat directory. The PIM keeps track of original URL references and the new names and it scans all .html pages and rewrites: 1) copied URLs so they point locally, 2) remote URLs so that they will work even when the page is viewed locally.

Revised: 12 Janurary 1999