Skip to content

mnist: use mirror only when original server is not available #196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 9, 2024

Conversation

tikkss
Copy link
Contributor

@tikkss tikkss commented Jun 19, 2024

MNIST's original server is sometimes refusing to provide the requested
resource. So, we use the mirror server only when the original server is
refusing it.

Moreover this includes the following changes.

  • Support fallback in Downloader

Fixes GH-195.

@tikkss
Copy link
Contributor Author

tikkss commented Jun 19, 2024

Some tests have failed. I would like to fix them in a separate pull request.

tikkss and others added 2 commits June 23, 2024 06:18
Because this may be used by subclasses.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@tikkss tikkss force-pushed the mnist-fallback-url branch from c5ef3b7 to 784a1d1 Compare June 25, 2024 20:45
tikkss and others added 3 commits June 27, 2024 06:05
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
tikkss and others added 3 commits July 3, 2024 05:42
Because `i` or `index` is clearer than `idx`.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Because `<` and `>` aren't part of the URL.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Because with no arguments, raises the exception in `$!` or raises a
`RuntimeError` if `$!` is `nil`.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@tikkss
Copy link
Contributor Author

tikkss commented Jul 2, 2024

Thank you so much for a lot of suggestions! @kou

It has been an exciting and enjoyable experience.

I've resolved all your reviewed points, so could you review it again?

@@ -155,6 +160,16 @@ def download(output_path, &block)
$stderr.puts "Redirect to #{url}"
return start_http(url, headers, limit - 1, &block)
else
if response.is_a?(Net::HTTPForbidden)
fallback_url = @fallback_urls.shift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid changing @fallback_urls destructively?
If we change @fallback_urls destructively, users can't call #download multiple times for the same Downloader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice users can't call #download multiple times for the same Downloader.
I will try to avoid changing @fallback_urls destructively.
Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented in f92ec1a. But I have some concerns.

If the same URL is specified in fallback_urls, it can result in an infinite loop:

diff --git a/lib/datasets/mnist.rb b/lib/datasets/mnist.rb
index 9129be9..a3e6230 100644
--- a/lib/datasets/mnist.rb
+++ b/lib/datasets/mnist.rb
@@ -54,6 +54,8 @@ module Datasets
     private
     def base_urls
       [
+        "http://yann.lecun.com/exdb/mnist/",
+        "http://yann.lecun.com/exdb/mnist/",
         "http://yann.lecun.com/exdb/mnist/",
         "https://ossci-datasets.s3.amazonaws.com/mnist/",
       ]
ruby -I lib test/run-test.rb -t MNISTTest
Loaded suite test
Started
|403: Forbidden: fallback: <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz> -> <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz>
403: Forbidden: fallback: <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz> -> <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz>
403: Forbidden: fallback: <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz> -> <http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz>
(snip)

While I believe this is a rare case, do we need to consider this scenario?

If we do need to consider it, I think we might need to use an instance variable like @fallback_index.
But, this could make the code less readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using argument instead?

diff --git a/lib/datasets/downloader.rb b/lib/datasets/downloader.rb
index fac8a13..3449855 100644
--- a/lib/datasets/downloader.rb
+++ b/lib/datasets/downloader.rb
@@ -40,7 +40,7 @@ module Datasets
             headers["Range"] = "bytes=#{start}-"
           end
 
-          start_http(@url, headers) do |response|
+          start_http(@url, @fallback_urls, headers) do |response|
             if response.is_a?(Net::HTTPPartialContent)
               mode = "ab"
             else
@@ -140,7 +140,7 @@ module Datasets
       end
     end
 
-    private def start_http(url, headers, limit = 10, &block)
+    private def start_http(url, fallback_urls, headers, limit = 10, &block)
       if limit == 0
         raise TooManyRedirects, "too many redirections: #{url}"
       end
@@ -158,17 +158,15 @@ module Datasets
           when Net::HTTPRedirection
             url = URI.parse(response[:location])
             $stderr.puts "Redirect to #{url}"
-            return start_http(url, headers, limit - 1, &block)
+            return start_http(url, fallback_urls, headers, limit - 1, &block)
           else
             if response.is_a?(Net::HTTPForbidden)
-              index = @fallback_urls.index(url)
-              fallback_index = index.nil? ? 0 : index + 1
-              fallback_url = @fallback_urls[fallback_index]
-              unless fallback_url.nil?
+              next_url, *rest_fallback_urls = fallback_urls
+              if next_url
                 message = "#{response.code}: #{response.message}: " +
-                          "fallback: <#{url}> -> <#{fallback_url}>"
+                          "fallback: <#{url}> -> <#{next_url}>"
                 $stderr.puts(message)
-                return start_http(fallback_url, headers, limit, &block)
+                return start_http(next_url, rest_fallback_urls, headers, &block)
               end
             end
 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Using argument, we don't have an instance variable, and the code became to readable.

I will implement it. Thank you so much!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented in 7b78fc2. Could you review it again?

tikkss and others added 2 commits July 8, 2024 06:17
Because users can't call `#download` multiple times for the same
`Downloader`.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
If the same URL is specified in `fallback_urls`.

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@kou kou merged commit 7c19520 into red-data-tools:master Jul 9, 2024
0 of 12 checks passed
@kou
Copy link
Member

kou commented Jul 9, 2024

Thanks!

@tikkss tikkss deleted the mnist-fallback-url branch July 9, 2024 20:40
@tikkss
Copy link
Contributor Author

tikkss commented Jul 9, 2024

Thanks for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mnist: CI has failed with 403 forbidden error
2 participants