-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gain Improvements - Ludaro #46
Comments
@gaojiuli,I see we cannot install this project from pypi because that version is flawed. Can I expect this project to accept PR's or should I continue with my own fork? |
For people reading this message, here is my fork of the project Changes
Extraction With the new Css class the following can be done:
In case you need to cleanup data or extract stuff like a phone number or email address, the following manipulate options are available:
The order in which you supply manipulate options, is the order of execution so you can actually combine these manipulations. And many more will be added in the future. I have written tests for all features so take a look at this file if you are interested. With the current version on my dev branch you can:
I hope @gaojiuli , is interested in the way I moved forward with this project so we can merge our code once I am satisfied with a production version. I kept the philosophy of creating a scraper for everyone and with that in mind I changed the way we extract data. |
Accept pr. |
@gaojiuli, great news, happy that I can share my code. Before the PR, the following I have to do:
For the item.py I have a question regarding this code: if hasattr(self, 'save_url'):
self.url = getattr(self, 'save_url')
else:
self.url = "file:///tmp.data" Are you using this code or this junk code that can be removed? |
(Welcome any kind of optimization) |
re.findall issue
I reviewed the tests in this project after experiencing issues with my regex also catching some html as part of the process.
So I reviewed this test file: https://github.com/gaojiuli/gain/blob/master/tests/test_parse_multiple_items.py and catched the response of abstract_url.py
Version 0.1.4 of this project catches this as response:
re.findall
returns what is requested by your regex but not what is matched!Test incorrect
The base url http://quotes.toscrape.com/ and http://quotes.toscrape.com/page/1 are the same page and if you look into the html you shall only find a reference to "/page/2" but not to "/page/1". For this reason the test seems to work but it was actually flawed from the start.
re.match
I rewrote function abstract_url to:
and now this is the result of abstract_url:
This test: tests/test_parse_multiple_items.py now fails as it should.
The text was updated successfully, but these errors were encountered: