urlgrabber.grabber (version 3.1.0) | index /home/groups/urlgrabber/web/contents/urlgrabber/grabber.py |
A high-level cross-protocol url-grabber.
GENERAL ARGUMENTS (kwargs)
Where possible, the module-level default is indicated, and legal
values are provided.
copy_local = 0 [0|1]
ignored except for file:// urls, in which case it specifies
whether urlgrab should still make a copy of the file, or simply
point to the existing copy. The module level default for this
option is 0.
close_connection = 0 [0|1]
tells URLGrabber to close the connection after a file has been
transfered. This is ignored unless the download happens with the
http keepalive handler (keepalive=1). Otherwise, the connection
is left open for further use. The module level default for this
option is 0 (keepalive connections will not be closed).
keepalive = 1 [0|1]
specifies whether keepalive should be used for HTTP/1.1 servers
that support it. The module level default for this option is 1
(keepalive is enabled).
progress_obj = None
a class instance that supports the following methods:
po.start(filename, url, basename, length, text)
# length will be None if unknown
po.update(read) # read == bytes read so far
po.end()
text = None
specifies an alternativ text item in the beginning of the progress
bar line. If not given, the basename of the file is used.
throttle = 1.0
a number - if it's an int, it's the bytes/second throttle limit.
If it's a float, it is first multiplied by bandwidth. If throttle
== 0, throttling is disabled. If None, the module-level default
(which can be set on default_grabber.throttle) is used. See
BANDWIDTH THROTTLING for more information.
timeout = None
a positive float expressing the number of seconds to wait for socket
operations. If the value is None or 0.0, socket operations will block
forever. Setting this option causes urlgrabber to call the settimeout
method on the Socket object used for the request. See the Python
documentation on settimeout for more information.
http://www.python.org/doc/current/lib/socket-objects.html
bandwidth = 0
the nominal max bandwidth in bytes/second. If throttle is a float
and bandwidth == 0, throttling is disabled. If None, the
module-level default (which can be set on
default_grabber.bandwidth) is used. See BANDWIDTH THROTTLING for
more information.
range = None
a tuple of the form (first_byte, last_byte) describing a byte
range to retrieve. Either or both of the values may set to
None. If first_byte is None, byte offset 0 is assumed. If
last_byte is None, the last byte available is assumed. Note that
the range specification is python-like in that (0,10) will yeild
the first 10 bytes of the file.
If set to None, no range will be used.
reget = None [None|'simple'|'check_timestamp']
whether to attempt to reget a partially-downloaded file. Reget
only applies to .urlgrab and (obviously) only if there is a
partially downloaded file. Reget has two modes:
'simple' -- the local file will always be trusted. If there
are 100 bytes in the local file, then the download will always
begin 100 bytes into the requested file.
'check_timestamp' -- the timestamp of the server file will be
compared to the timestamp of the local file. ONLY if the
local file is newer than or the same age as the server file
will reget be used. If the server file is newer, or the
timestamp is not returned, the entire file will be fetched.
NOTE: urlgrabber can do very little to verify that the partial
file on disk is identical to the beginning of the remote file.
You may want to either employ a custom "checkfunc" or simply avoid
using reget in situations where corruption is a concern.
user_agent = 'urlgrabber/VERSION'
a string, usually of the form 'AGENT/VERSION' that is provided to
HTTP servers in the User-agent header. The module level default
for this option is "urlgrabber/VERSION".
http_headers = None
a tuple of 2-tuples, each containing a header and value. These
will be used for http and https requests only. For example, you
can do
http_headers = (('Pragma', 'no-cache'),)
ftp_headers = None
this is just like http_headers, but will be used for ftp requests.
proxies = None
a dictionary that maps protocol schemes to proxy hosts. For
example, to use a proxy server on host "foo" port 3128 for http
and https URLs:
proxies={ 'http' : 'http://foo:3128', 'https' : 'http://foo:3128' }
note that proxy authentication information may be provided using
normal URL constructs:
proxies={ 'http' : 'http://user:host@foo:3128' }
Lastly, if proxies is None, the default environment settings will
be used.
prefix = None
a url prefix that will be prepended to all requested urls. For
example:
g = URLGrabber(prefix='http://foo.com/mirror/')
g.urlgrab('some/file.txt')
## this will fetch 'http://foo.com/mirror/some/file.txt'
This option exists primarily to allow identical behavior to
MirrorGroup (and derived) instances. Note: a '/' will be inserted
if necessary, so you cannot specify a prefix that ends with a
partial file or directory name.
opener = None
Overrides the default urllib2.OpenerDirector provided to urllib2
when making requests. This option exists so that the urllib2
handler chain may be customized. Note that the range, reget,
proxy, and keepalive features require that custom handlers be
provided to urllib2 in order to function properly. If an opener
option is provided, no attempt is made by urlgrabber to ensure
chain integrity. You are responsible for ensuring that any
extension handlers are present if said features are required.
data = None
Only relevant for the HTTP family (and ignored for other
protocols), this allows HTTP POSTs. When the data kwarg is
present (and not None), an HTTP request will automatically become
a POST rather than GET. This is done by direct passthrough to
urllib2. If you use this, you may also want to set the
'Content-length' and 'Content-type' headers with the http_headers
option. Note that python 2.2 handles the case of these
badly and if you do not use the proper case (shown here), your
values will be overridden with the defaults.
RETRY RELATED ARGUMENTS
retry = None
the number of times to retry the grab before bailing. If this is
zero, it will retry forever. This was intentional... really, it
was :). If this value is not supplied or is supplied but is None
retrying does not occur.
retrycodes = [-1,2,4,5,6,7]
a sequence of errorcodes (values of e.errno) for which it should
retry. See the doc on URLGrabError for more details on this. You
might consider modifying a copy of the default codes rather than
building yours from scratch so that if the list is extended in the
future (or one code is split into two) you can still enjoy the
benefits of the default list. You can do that with something like
this:
retrycodes = urlgrabber.grabber.URLGrabberOptions().retrycodes
if 12 not in retrycodes:
retrycodes.append(12)
checkfunc = None
a function to do additional checks. This defaults to None, which
means no additional checking. The function should simply return
on a successful check. It should raise URLGrabError on an
unsuccessful check. Raising of any other exception will be
considered immediate failure and no retries will occur.
If it raises URLGrabError, the error code will determine the retry
behavior. Negative error numbers are reserved for use by these
passed in functions, so you can use many negative numbers for
different types of failure. By default, -1 results in a retry,
but this can be customized with retrycodes.
If you simply pass in a function, it will be given exactly one
argument: a CallbackObject instance with the .url attribute
defined and either .filename (for urlgrab) or .data (for urlread).
For urlgrab, .filename is the name of the local file. For
urlread, .data is the actual string data. If you need other
arguments passed to the callback (program state of some sort), you
can do so like this:
checkfunc=(function, ('arg1', 2), {'kwarg': 3})
if the downloaded file has filename /tmp/stuff, then this will
result in this call (for urlgrab):
function(obj, 'arg1', 2, kwarg=3)
# obj.filename = '/tmp/stuff'
# obj.url = 'http://foo.com/stuff'
NOTE: both the "args" tuple and "kwargs" dict must be present if
you use this syntax, but either (or both) can be empty.
failure_callback = None
The callback that gets called during retries when an attempt to
fetch a file fails. The syntax for specifying the callback is
identical to checkfunc, except for the attributes defined in the
CallbackObject instance. The attributes for failure_callback are:
exception = the raised exception
url = the url we're trying to fetch
tries = the number of tries so far (including this one)
retry = the value of the retry option
The callback is present primarily to inform the calling program of
the failure, but if it raises an exception (including the one it's
passed) that exception will NOT be caught and will therefore cause
future retries to be aborted.
The callback is called for EVERY failure, including the last one.
On the last try, the callback can raise an alternate exception,
but it cannot (without severe trickiness) prevent the exception
from being raised.
interrupt_callback = None
This callback is called if KeyboardInterrupt is received at any
point in the transfer. Basically, this callback can have three
impacts on the fetch process based on the way it exits:
1) raise no exception: the current fetch will be aborted, but
any further retries will still take place
2) raise a URLGrabError: if you're using a MirrorGroup, then
this will prompt a failover to the next mirror according to
the behavior of the MirrorGroup subclass. It is recommended
that you raise URLGrabError with code 15, 'user abort'. If
you are NOT using a MirrorGroup subclass, then this is the
same as (3).
3) raise some other exception (such as KeyboardInterrupt), which
will not be caught at either the grabber or mirror levels.
That is, it will be raised up all the way to the caller.
This callback is very similar to failure_callback. They are
passed the same arguments, so you could use the same function for
both.
urlparser = URLParser()
The URLParser class handles pre-processing of URLs, including
auth-handling for user/pass encoded in http urls, file handing
(that is, filenames not sent as a URL), and URL quoting. If you
want to override any of this behavior, you can pass in a
replacement instance. See also the 'quote' option.
quote = None
Whether or not to quote the path portion of a url.
quote = 1 -> quote the URLs (they're not quoted yet)
quote = 0 -> do not quote them (they're already quoted)
quote = None -> guess what to do
This option only affects proper urls like 'file:///etc/passwd'; it
does not affect 'raw' filenames like '/etc/passwd'. The latter
will always be quoted as they are converted to URLs. Also, only
the path part of a url is quoted. If you need more fine-grained
control, you should probably subclass URLParser and pass it in via
the 'urlparser' option.
BANDWIDTH THROTTLING
urlgrabber supports throttling via two values: throttle and
bandwidth Between the two, you can either specify and absolute
throttle threshold or specify a theshold as a fraction of maximum
available bandwidth.
throttle is a number - if it's an int, it's the bytes/second
throttle limit. If it's a float, it is first multiplied by
bandwidth. If throttle == 0, throttling is disabled. If None, the
module-level default (which can be set with set_throttle) is used.
bandwidth is the nominal max bandwidth in bytes/second. If throttle
is a float and bandwidth == 0, throttling is disabled. If None, the
module-level default (which can be set with set_bandwidth) is used.
THROTTLING EXAMPLES:
Lets say you have a 100 Mbps connection. This is (about) 10^8 bits
per second, or 12,500,000 Bytes per second. You have a number of
throttling options:
*) set_bandwidth(12500000); set_throttle(0.5) # throttle is a float
This will limit urlgrab to use half of your available bandwidth.
*) set_throttle(6250000) # throttle is an int
This will also limit urlgrab to use half of your available
bandwidth, regardless of what bandwidth is set to.
*) set_throttle(6250000); set_throttle(1.0) # float
Use half your bandwidth
*) set_throttle(6250000); set_throttle(2.0) # float
Use up to 12,500,000 Bytes per second (your nominal max bandwidth)
*) set_throttle(6250000); set_throttle(0) # throttle = 0
Disable throttling - this is more efficient than a very large
throttle setting.
*) set_throttle(0); set_throttle(1.0) # throttle is float, bandwidth = 0
Disable throttling - this is the default when the module is loaded.
SUGGESTED AUTHOR IMPLEMENTATION (THROTTLING)
While this is flexible, it's not extremely obvious to the user. I
suggest you implement a float throttle as a percent to make the
distinction between absolute and relative throttling very explicit.
Also, you may want to convert the units to something more convenient
than bytes/second, such as kbps or kB/s, etc.
Modules | ||||||
|
Classes | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Functions | ||
|
Data | ||
DEBUG = None ST_ATIME = 7 ST_CTIME = 9 ST_DEV = 2 ST_GID = 5 ST_INO = 1 ST_MODE = 0 ST_MTIME = 8 ST_NLINK = 3 ST_SIZE = 6 ST_UID = 4 S_ENFMT = 1024 S_IEXEC = 64 S_IFBLK = 24576 S_IFCHR = 8192 S_IFDIR = 16384 S_IFIFO = 4096 S_IFLNK = 40960 S_IFREG = 32768 S_IFSOCK = 49152 S_IREAD = 256 S_IRGRP = 32 S_IROTH = 4 S_IRUSR = 256 S_IRWXG = 56 S_IRWXO = 7 S_IRWXU = 448 S_ISGID = 1024 S_ISUID = 2048 S_ISVTX = 512 S_IWGRP = 16 S_IWOTH = 2 S_IWRITE = 128 S_IWUSR = 128 S_IXGRP = 8 S_IXOTH = 1 S_IXUSR = 64 __version__ = '3.1.0' auth_handler = <urllib2.HTTPBasicAuthHandler instance> default_grabber = <urlgrabber.grabber.URLGrabber instance> have_keepalive = True have_range = 1 have_socket_timeout = True msg = <exceptions.ImportError instance> range_handlers = (<urlgrabber.byterange.HTTPRangeHandler instance>, <urlgrabber.byterange.HTTPSRangeHandler instance>, <urlgrabber.byterange.FileRangeHandler instance>, <urlgrabber.byterange.FTPRangeHandler instance>) |