Skip to content

Conversation

notevenaperson
Copy link
Contributor

@scoliono here are some changes for review. 8569763 should fix #1.

Also your email visible in git log is very nice.

…een requests to prevent being blocked

1. adds the -R flag
2. Should fix jamessucla#1 and
   adds the -nt flag
Because we no longer overwrite files without the user asking explicitly for it (-R flag)
The prompt also got in the way of running the script non-interactively
It's more informative to log the actual filename, which includes the page number. I also feel that gauging the progress is easy enough with (N/N) to make a percentage indicator unnecessary.

Changed from:
12% (1/8) done
25% (2/8) done
37% (3/8) done
50% (4/8) done
62% (5/8) done
75% (6/8) done
87% (7/8) done
100% (8/8) done

To:
Got ./OL370939M/100.jpg (1/8)
Got ./OL370939M/101.jpg (2/8)
Got ./OL370939M/102.jpg (3/8)
Got ./OL370939M/103.jpg (4/8)
Got ./OL370939M/104.jpg (5/8)
Got ./OL370939M/105.jpg (6/8)
Got ./OL370939M/106.jpg (7/8)
Got ./OL370939M/107.jpg (8/8)

(The command used to generate these logs was: `python3 ripper.py OL370939M -s 100 -e 107 -S 10`)
@jamessucla jamessucla self-requested a review September 14, 2021 08:09
@notevenaperson
Copy link
Contributor Author

@scoliono merge?

@jamessucla
Copy link
Owner

This does not appear to totally circumvent Archive.org's rate limiting, from my testing. Around 100 pages or so, you start downloading 5 KB HTML documents instead of images.
Also, a nitpick: when some pages have already been downloaded, the total count that is displayed is misleading. For example, this book has 556 pages total, and about half of the pages were already downloaded:

./chiltonstoyotace0000unse_b0s4/177.jpg (0/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/178.jpg (0/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/179.jpg (0/556) already on disk, skipping
Got ./chiltonstoyotace0000unse_b0s4/180.jpg (1/556)
Got ./chiltonstoyotace0000unse_b0s4/181.jpg (2/556)
Got ./chiltonstoyotace0000unse_b0s4/182.jpg (3/556)
Got ./chiltonstoyotace0000unse_b0s4/183.jpg (4/556)
Got ./chiltonstoyotace0000unse_b0s4/184.jpg (5/556)
Got ./chiltonstoyotace0000unse_b0s4/185.jpg (6/556)
Got ./chiltonstoyotace0000unse_b0s4/186.jpg (7/556)
Got ./chiltonstoyotace0000unse_b0s4/187.jpg (8/556)
Got ./chiltonstoyotace0000unse_b0s4/188.jpg (9/556)
Got ./chiltonstoyotace0000unse_b0s4/189.jpg (10/556)
Got ./chiltonstoyotace0000unse_b0s4/190.jpg (11/556)
Got ./chiltonstoyotace0000unse_b0s4/191.jpg (12/556)
Got ./chiltonstoyotace0000unse_b0s4/192.jpg (13/556)
Got ./chiltonstoyotace0000unse_b0s4/193.jpg (14/556)
Got ./chiltonstoyotace0000unse_b0s4/194.jpg (15/556)
Got ./chiltonstoyotace0000unse_b0s4/195.jpg (16/556)
./chiltonstoyotace0000unse_b0s4/196.jpg (16/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/197.jpg (16/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/198.jpg (16/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/199.jpg (16/556) already on disk, skipping

@jamessucla
Copy link
Owner

If you're too persistent with the requests, it looks like you can also get this traceback:

  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1349, in getresponse
    response.begin()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/Library/Python/3.9/lib/python/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/util/retry.py", line 410, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1349, in getresponse
    response.begin()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/git/archiveripper/ripper.py", line 107, in <module>
    main()
  File "~/git/archiveripper/ripper.py", line 89, in main
    contents = client.download_page(i, args.scale)
  File "~/git/archiveripper/api.py", line 141, in download_page
    res = self.session.get(self.book_page_urls[i] + "&scale=%d" % scale, headers={
  File "~/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "~/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "~/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "~/Library/Python/3.9/lib/python/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hangs after downloading a few pages
2 participants