Skip to content

Cloudml_train and job_collect #210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
philipus opened this issue May 14, 2020 · 1 comment
Open

Cloudml_train and job_collect #210

philipus opened this issue May 14, 2020 · 1 comment

Comments

@philipus
Copy link

i have a problem by applying mnist_mlp.R (https://github.com/rstudio/keras/blob/master/vignettes/examples/mnist_mlp.R) using cloudml_train on google cloud platform.

Even the job on google ai platforms run properly the job does not finish automatically. Also or because of that the job_collect functionality does not copy any files into local directory (runs)... when I cancel the job manually on google ai platform I see the the new job folder of the corresponding job.

So... why the hack the job runs for ever on google ai platform?!

I think the download functionality does not work properly. I also do not have a local runs directory created as it does in the mnist_mlp.R script. I think job_collect is the problem

cloudml::job_collect('Project Name', destination = '../runs', view = 'save')

does not copy anything in the destination folder

Any Idea what we can do?

R commands:

library(cloudml)
cloudml_train("mnist_mlp.R", config = "config.yml")

config.yml:

trainingInput:
scaleTier: BASIC
runtimeVersion: "2.1"
pythonVersion: "3.7"

@herambgadgil
Copy link

I had the same problem. The problem is with the below chunk in path-to-library/cloudml/cloudml/cloudml/deploy.py

# Stream output from subprocess to console.
for line in iter(process.stdout.readline, ""):
    sys.stdout.write(line.decode('utf-8'))

Once the execution is completed, this does not does not halt and hence enters a continuous loop.

Resolution : comment out the above chunk from deploy.py and it will give you a successful execution.
Downside : you won't be able to see step-by-step installation progress and hence won't get a hint from logs if there is an error in the script. But below chunk will ensure the check on successful execution. If there is an error in the script, it will keep on running endlessly.

# Finalize the process.
stdout, stderr = process.communicate()

# Detect a non-zero exit code.
if process.returncode != 0:
  fmt = "Command %s failed: exit code %s"
  print(fmt % (commands, process.returncode))
else:
  print("Command %s ran successfully." % (commands, ))

Note : Novice in python and cloud environment. Take my comments with pinch of a salt. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants