Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download image from <a href = 'file.png'/> links #4

Open
oleksabor opened this issue Sep 7, 2018 · 2 comments
Open

download image from <a href = 'file.png'/> links #4

oleksabor opened this issue Sep 7, 2018 · 2 comments

Comments

@oleksabor
Copy link

Hello

I'm trying to use SkyScraper. There are web-pages that contains links to image files like below:

<A HREF="/bbb/yiifrontend/">[To Parent Directory]</A><br><br> 7/26/2018 11:46 AM 30097 <A HREF="/bbb/yiifrontend/formdocs/CertStoreSample.png">CertStoreSample.png</A><br> 3/7/2018 4:20 PM 68519 <A HREF="/bbb/yiifrontend/formdocs/Clipboard01.jpg">Clipboard01.jpg</A><br>
actually this is http directory browsing output but i suppose this can be met in other cases

I've checked ImageScraperObserver source code and found that it does not try to parse such kind of image links.
I've created custom IObserver<HtmlDoc> descendant and implemented logic like ImageScraperObserver has but modified to process <a href="sample.jpg"/>

Please help me to understand is there any other SkyScrapper in-built method to parse image links like <a href> and get byte array with image ?

@JonCanning
Copy link
Owner

Wow, I can barely remember writing this!

Looking at https://github.com/JonCanning/SkyScraper/blob/master/src/SkyScraper/Observers/ImageScraper/ImageScraperObserver.cs#L28

I would suggest something like

var imgSrcs = html["a"].Select(x => x.GetAttribute("href")).Where(x => x.LinkIsLocal(baseUri.ToString()));

@oleksabor
Copy link
Author

Thank you for your comment.
I'm out of work now so can't post source code unfortunately.
However I've made exactly as you said - processing all a links with href attributes in mine custom IObserver<HtmlDoc> descendant code.
I've added filter (Regex expression) to make it possible to distinguish between image and html hyper text references. To save images only if link matches with regex expression.

Is there in-built handling for such cases in the original code?
I can create pull request to merge this href image observer to the SkyScraper source code if you do not mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants