Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: adds WSP functional test for local package #107

Conversation

spirosdelviniotis
Copy link
Contributor

  • Adds WSP functional test for local package path.
  • Re-factored existing WSP functional test (add setup-teardown fixtures).
  • Re-factored utils.ftp_list_files to utils.list_files for re-usability.
  • Fixed WSP local package crawling mechanism.

Closes #106

Signed-off-by: Spiros Delviniotis [email protected]

Copy link
Contributor

@david-caro david-caro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@@ -89,16 +89,24 @@ def __init__(self, package_path=None, ftp_folder="WSP", ftp_host=None, ftp_netrc
def start_requests(self):
"""List selected folder on remote FTP and yield new zip files."""
if self.package_path:
yield Request(self.package_path, callback=self.handle_package_file)
dummy, new_files = local_list_files(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python, for non-used variables like this dummy it's 'convention' to use just _.

self.target_folder
)

for _file in new_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of _file and new_files, it's better to use file_path and new_files_paths so you know it's a string representing the file path, and not a file object (that is usually clear when you write the code yourself, but for someone else it might become very confusing when debugging).



def local_list_files(local_folder, target_folder):
"""List files from given package folder to target folder."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this function is supposed to do by the comment, can you rephrase? What does it mean that the files 'are listed from a dir to another dir'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was not clear before either, but better not add more confusion ;)

return list_files(local_folder, target_folder, files)


def list_files(remote_folder, target_folder, files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be called list_missing_files instead?
The files parameter would be better called file_names.
It feels to me that this is doing too many things at the same time... can you think of a way to split getting the list of missing files, with adding the paths to the files? (so there's no need to return both all_files and missing_files, making it easier to understand, test and use)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the all_files parameter does not seem to be used anywhere... we can remove it, then there's no need for a dummy variable anymore.

cleaner(package_location + 'IDAQPv20i01-03160015-1510863')


def cleaner(path='/tmp/WSP/'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad name for a function, cleaner gives the impression to be an object, that you call on (it's an entity that has meaning by itself) while this is actually a function, maybe rename to something like clean_dir?

@david-caro
Copy link
Contributor

Can you also try to separate them in individual commits? (it's a bit painful to do now, so next time try to create a lot of commits instead, that can be squashed easily later ;) )

@david-caro
Copy link
Contributor

btw. ran locally the functional tests with this pr, and got error (checking if it's my issue):

____________________________________________________ test_wsp_ftp _____________________________________________________

set_up_ftp_environment = {'CRAWLER_ARGUMENTS': {'ftp_host': 'ftp_server', 'ftp_netrc': '/code/tests/functional/WSP/fixtures/ftp_server/.netrc'}, 'CRAWLER_HOST_URL': 'http://scrapyd:6800', 'CRAWLER_PROJECT': 'hepcrawl'}
expected_results = [{'abstracts': [{'source': 'WSP', 'value': 'Abstract L\xe9vy bla-bla bla blaaa blaa bla blaaa blaa, bla blaaa blaa. Bla b..., City, City_code 123456, C. R. Country_1'}], 'full_name': 'author_surname_2, author_name_2'}], 'citeable': True, ...}]

    def test_wsp_ftp(set_up_ftp_environment, expected_results):
        crawler = get_crawler_instance(set_up_ftp_environment.get('CRAWLER_HOST_URL'))
    
        # The test must wait until the docker environment is up (takes about 5 seconds).
        sleep(5)
    
        results = CeleryMonitor.do_crawl(
            app=app,
            monitor_timeout=5,
            monitor_iter_limit=100,
            crawler_instance=crawler,
            project=set_up_ftp_environment.get('CRAWLER_PROJECT'),
            spider='WSP',
            settings={},
>           **set_up_ftp_environment.get('CRAWLER_ARGUMENTS')
        )

tests/functional/WSP/test_wsp.py:110: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hepcrawl/testlib/celery_monitor.py:81: in do_crawl
    **crawler_arguments
/hepcrawl_venv/lib/python2.7/site-packages/scrapyd_api/wrapper.py:184: in schedule
    json = self.client.post(url, data=data)
/hepcrawl_venv/lib/python2.7/site-packages/requests/sessions.py:535: in post
    return self.request('POST', url, data=data, json=json, **kwargs)
/hepcrawl_venv/lib/python2.7/site-packages/scrapyd_api/client.py:38: in request
    return self._handle_response(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <scrapyd_api.client.Client object at 0x4c685d0>, response = <Response [200]>

    def _handle_response(self, response):
        """
            Handles the response received from Scrapyd.
            """
        if not response.ok:
            raise ScrapydResponseError(
                "Scrapyd returned a {0} error: {1}".format(
                    response.status_code,
                    response.text))
    
        try:
            json = response.json()
        except ValueError:
            raise ScrapydResponseError("Scrapyd returned an invalid JSON "
                                       "response: {0}".format(response.text))
        if json['status'] == 'ok':
            json.pop('status')
            return json
        elif json['status'] == 'error':
>           raise ScrapydResponseError(json['message'])
E           ScrapydResponseError: spider 'WSP' not found

@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_wsp_local_package_test branch 3 times, most recently from 81edfed to 8c9fab2 Compare May 10, 2017 14:05

def list_missing_files(remote_folder, target_folder, file_names):
missing_files = []
for filename in file_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the same style for filename and file_names

netrc_location = os.path.join(os.path.dirname(
os.path.realpath(__file__)),
'fixtures/ftp_server/.netrc'
os.path.join('fixtures', 'ftp_server', '.netrc')
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indent here is messed up

}

clean_dir()
clean_dir(package_location + 'IDAQPv20i01-03160015-1510863')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this missing some os.path.join?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _, dirs, files  in next(os.walk(package_location)):
    for dir_name in dirs:
        clean_dir(dir_name)
    for file_name in files:
        os.unlink(file_name)

def set_up_local_environment():
package_location = os.path.join(os.path.dirname(
os.path.realpath(__file__)),
'fixtures/ftp_server/WSP/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad indent

)

assert [override_generated_fields(result) for result in results] == \
[override_generated_fields(expected) for expected in expected_results]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save the lists in temp variables, then just compare them.

spider='WSP',
settings={},
**set_up_environment.get('CRAWLER_ARGUMENTS')
**set_up_local_environment.get('CRAWLER_ARGUMENTS')
)

gottern_results = [override_generated_fields(result) for result in results]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/gottern/gotten/

@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_wsp_local_package_test branch from fe57c0c to 3824bc9 Compare May 10, 2017 16:32
for dir_name in dirs:
clean_dir(os.path.join(package_location, dir_name))
for file_name in files:
if '.zip' not in file_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not file_name.endswith('.zip')



def remove_generated_files(package_location):
for _, dirs, files in os.walk(package_location):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a for, you can just use the first result:

_, dirs, files = next(os.walk(...))
for dir_name in dirs:
...



def local_list_files(local_folder, target_folder):
"""List files from given local path to target folder."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not make much sense.

Copy link
Contributor

@david-caro david-caro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing and we merge right away 👍

missing_files.append(source_file)
all_files.append(source_file)
return all_files, missing_files
file_names = host.listdir(host.curdir + '/' + server_folder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join

* Adds WSP functional test for local package path.
* Refactored existing WSP functional test (add setup-teardown fixtures).
* Refactored `utils.ftp_list_files` to `utils.list_files` for reusability.
* Fixed WSP local package crawling mechanism.

Closes inspirehep#106

Signed-off-by: Spiros Delviniotis <[email protected]>
@spirosdelviniotis spirosdelviniotis force-pushed the hepcrawl_wsp_local_package_test branch from 1303c08 to 94af814 Compare May 11, 2017 07:20
@spirosdelviniotis
Copy link
Contributor Author

spirosdelviniotis commented May 11, 2017

@david-caro Ready! 💃

@david-caro david-caro merged commit 0c0fdf1 into inspirehep:master May 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WSP: fix local package crawling
2 participants