Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last improvements to GSoD '20 blgo post #1242

Merged
merged 2 commits into from
May 5, 2020
Merged

Last improvements to GSoD '20 blgo post #1242

merged 2 commits into from
May 5, 2020

Conversation

jorgeorpinel
Copy link
Contributor

@jorgeorpinel jorgeorpinel commented May 5, 2020

@shcheklein shcheklein temporarily deployed to dvc-landing-gsod-v9trva5mdy8or May 5, 2020 18:08 Inactive
@jorgeorpinel
Copy link
Contributor Author

Hi @casperdcl me again. This PR has a different link check issue. 3 URLs that work fine from browser as marked as 404 and 405. Please see https://circleci.com/gh/iterative/dvc.org/3305. Thanks

[Kurian](https://github.com/kurianbenoy), closed
[several tickets](https://github.com/iterative/dvc.org/issues?q=is%3Aissue+kurianbenoy),
produced a DVC intro tutorial in
[Kaggle](https://www.kaggle.com/kurianbenoy/introduction-to-data-version-control-dvc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ curl -IL  https://www.kaggle.com/kurianbenoy/introduction-to-data-version-control-dvc
HTTP/1.1 404 Not Found
...

looks like an upstream website issue; could add to exceptions?

produced a DVC intro tutorial in
[Kaggle](https://www.kaggle.com/kurianbenoy/introduction-to-data-version-control-dvc)
and
[Colab](https://colab.research.google.com/drive/1O1XmUZ8Roj1dFxWTrpE55_A7lVkWfG04),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ curl -IL https://colab.research.google.com/drive/1O1XmUZ8Roj1dFxWTrpE55_A7lVkWfG04
HTTP/1.1 405 Method Not Allowed
...

and
[Colab](https://colab.research.google.com/drive/1O1XmUZ8Roj1dFxWTrpE55_A7lVkWfG04),
and ended up giving a talk at
[PyCon India](https://in.pycon.org/cfp/2019/proposals/machine-learning-model-and-dataset-versioning~dRqRb/):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ curl -IL https://in.pycon.org/cfp/2019/proposals/machine-learning-model-and-dataset-versioning~dRqRb/
HTTP/1.1 405 Method Not Allowed
...

Copy link
Contributor

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there are issues with these three sites' servers (see comments). Likely they don't like HEAD requests; probably need to be added to scripts/exclude-links.txt

@shcheklein
Copy link
Member

@casperdcl @jorgeorpinel yeah, let's merge for (don't add to exclusions since we don't even test /blog on a regular basis). I know there are a quite a few other links there that require special handling. Some of them I mitigated with user-agent, but for others we'll need to do some additional tricks (e.g. enable cookies + redirects). For those left (linkedin, reddit, kaggle, etc - sites that fight hard with scrapping) we'll need to support website/* (wildcard) exclusion in the list. Let's deal with this when we start testing blog content properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants