Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobFunnel 3.0 with localization, ABC and improved scraping #90

Merged
merged 66 commits into from
Sep 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
339d0fd
initial work
PaulMcInnis Aug 1, 2020
d47405a
got scraping for indeed going again, with legacy CSV support
PaulMcInnis Aug 2, 2020
d60d780
got delaying with DelayConfig and TFIDF + other filters working again
PaulMcInnis Aug 3, 2020
607cd39
fix some status parsing issues
PaulMcInnis Aug 3, 2020
6fdbb06
connected CLI parser, making no changes to args
PaulMcInnis Aug 4, 2020
ec123a5
getting glassdoor going again
PaulMcInnis Aug 5, 2020
562f300
more progress, converging on having job setters within base scraper, …
PaulMcInnis Aug 6, 2020
edb653d
more progress, working on the CLI side now. looking to streamline thi…
PaulMcInnis Aug 7, 2020
9462b18
add cerberus schema validator
PaulMcInnis Aug 7, 2020
0626bc3
implemented config and validation with Cerberus
PaulMcInnis Aug 8, 2020
33526a0
Got indeed going once more, now with the abstract job creation
PaulMcInnis Aug 9, 2020
7035242
moved bs4 parser into JobFunnelConfig for future CLI/cfg impl
PaulMcInnis Aug 9, 2020
04fec2d
Got set() and get() working with lots of assertions to keep it safe. …
PaulMcInnis Aug 9, 2020
ce37804
cleaning up some descriptions
PaulMcInnis Aug 9, 2020
81ec0c4
got delaying back, but only for fields that need it
PaulMcInnis Aug 10, 2020
6b3b2b0
added --recover CLI and fixed stdout logging handler
PaulMcInnis Aug 10, 2020
29d3573
Added back Monster scraper
PaulMcInnis Aug 11, 2020
af4d55d
Got GlassDoorStatic going again.
PaulMcInnis Aug 13, 2020
67f7b2b
added Remote and wage jobfields and added wage scraping to GlassDoor …
PaulMcInnis Aug 20, 2020
6c11b01
further improve defaults, schema and CLI interaction
PaulMcInnis Aug 21, 2020
ab7d9a3
fix demo settings YAML and make loggers display scraper name / jobfunnel
PaulMcInnis Aug 21, 2020
a124236
Added json entry property to Job, improved logging format, improved c…
PaulMcInnis Aug 22, 2020
f745f2b
Improve jobfunnel behaviour in --no-scrape mode to prevent writing an…
PaulMcInnis Aug 22, 2020
7854850
Fixed monster scraping (multi-page), Fixed TFIDF filter and duplicate…
PaulMcInnis Aug 22, 2020
2058067
Move is_old into Job, minimize JSON and CSV loading by setting as sel…
PaulMcInnis Aug 22, 2020
ccea515
Make filtering dynamic so we can prevent scraping un-promising jobs a…
PaulMcInnis Aug 22, 2020
ee63de7
moved update if newer into Job, Vastly improved duplicates detection …
PaulMcInnis Aug 24, 2020
8030c9e
Make it so that we always scrape duplicates to ensure that we can upd…
PaulMcInnis Aug 24, 2020
5830edb
Update readme.md
Aug 25, 2020
4734523
Formalize minimum required job fields a bit more, improve crash-out b…
PaulMcInnis Aug 29, 2020
9dc4c09
Implement high-priority JobFields so we can stage set() operations + …
PaulMcInnis Aug 29, 2020
caab319
undo delay bypass
PaulMcInnis Aug 29, 2020
f74d500
fix CLI defaults when mixing YAML and arguments and possibly defaults.
PaulMcInnis Aug 29, 2020
8dccf91
make demo not create hidden cache folder
PaulMcInnis Aug 29, 2020
224547e
Update readme.md
PaulMcInnis Aug 29, 2020
61fed01
Fix flake8 issues without introducing circular imports + fix remainin…
PaulMcInnis Aug 29, 2020
af93ccd
JobFunnelConfig -> JobFunnelConfigManager
PaulMcInnis Aug 29, 2020
a3b17a6
Use get since config sub-keys can be missing.
PaulMcInnis Aug 29, 2020
fc65c18
Make logging handled through a Logger class which is inheritable
PaulMcInnis Aug 29, 2020
f0f8b13
Add graphviz generator for call graphs + fix some minor leftovers
PaulMcInnis Aug 29, 2020
18c5224
Fix delay synchronization between worker threads by controlling acces…
PaulMcInnis Aug 29, 2020
dcf7f87
Cleanup + calculate job get/set actions once in BaseScraper
PaulMcInnis Aug 29, 2020
954a079
Fixed some minor validation issues, moved path creation logic out of …
PaulMcInnis Aug 29, 2020
d7f4703
Fix block list file naming + remove create_dirs call
PaulMcInnis Aug 30, 2020
f50f781
Fix PyEnv
PaulMcInnis Aug 30, 2020
0d6a31c
Sig. improved CLI vs YAML vs defaults handling, upped python requires…
PaulMcInnis Aug 31, 2020
408491d
min_delay -> min_duration
PaulMcInnis Aug 31, 2020
f03a6e9
Swap over to pip from pipenv to fix build
PaulMcInnis Aug 31, 2020
082d525
fix comparison
PaulMcInnis Aug 31, 2020
04635ab
Clean up imports + expand travis smoke testing to include USA
PaulMcInnis Aug 31, 2020
90a0398
Fix numerous issues still existing in CLI.
PaulMcInnis Sep 5, 2020
1d64943
remove import pdb!
PaulMcInnis Sep 5, 2020
fe45a4d
Fix test module structure
PaulMcInnis Sep 8, 2020
9733ef4
fix CLI by restricting sub-commands to either load (YAML) or custom (…
PaulMcInnis Sep 8, 2020
d31cbd8
Remove .idea (too-user-specific), and Pipfile (PipEnv is no longer ne…
PaulMcInnis Sep 8, 2020
10a3cbe
make log-level a base argument
PaulMcInnis Sep 10, 2020
82119fa
Add some cli parser testing + fix more minor bugs in CLI parser
PaulMcInnis Sep 10, 2020
28e061f
missed files
PaulMcInnis Sep 10, 2020
39d926f
Fix directory creation for cache folder
PaulMcInnis Sep 10, 2020
0ebaf86
Update travis to use new command format
PaulMcInnis Sep 10, 2020
8fcbe5d
simplify readme.md and remove some old demos I dont want to maintain
PaulMcInnis Sep 10, 2020
d89991e
Add versioning to cache files, cleanup logo files, fix block list def…
PaulMcInnis Sep 10, 2020
b9771d2
Update demo CSV image
PaulMcInnis Sep 11, 2020
fc42eca
Use tad preview b/c it's prettier + lower codecov standards for now
PaulMcInnis Sep 11, 2020
bce106b
Refactor custom --> inline, resolve more pylint warnings, lower proje…
PaulMcInnis Sep 11, 2020
cbbd917
Fix domain validation + minor readme changes
PaulMcInnis Sep 11, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
coverage:
# FIXME: Set these back to automatic once we up coverage more
status:
patch:
default:
target: 30%
project:
default:
target: 30%
12 changes: 9 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
data/
# Outputs
*.csv
demo/data/
demo_job_search_results

# IntelliJ/Pycharm configs
.idea/

# GraphViz
*.dot

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -177,4 +183,4 @@ $RECYCLE.BIN/
.com.apple.timemachine.donotpresent

# VScode trash
.vscode/
.vscode/
15 changes: 0 additions & 15 deletions .idea/JobFunnel.iml

This file was deleted.

4 changes: 0 additions & 4 deletions .idea/misc.xml

This file was deleted.

8 changes: 0 additions & 8 deletions .idea/modules.xml

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/vcs.xml

This file was deleted.

15 changes: 11 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
language: python
python:
- '3.6.9'
- '3.8.0'
install:
- 'pip install -e .'
- 'pip install flake8 pipenv pytest-cov pytest-mock'
- 'pipenv sync'
- 'python -m nltk.downloader stopwords'
before_script: 'flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics'
before_script:
- 'flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics'
script:
- 'python -m jobfunnel -s demo/settings.yaml -o demo/'
# Run CANADA_ENGLISH demo by settings YAML
- 'funnel load -s demo/settings.yaml -log-level DEBUG'
# Run an american search by CLI
- 'funnel inline -kw Python Data Scientist PHD AI -ps WA -c Seattle -l USA_ENGLISH -log-level DEBUG -csv demo_job_search_results/demo_search.csv -cache demo_job_search_results/cache2 -blf demo_job_search_results/demo_block_list.json -dl demo_job_search_results/demo_duplicates_list.json -log-file demo_job_search_results/log.log'
- 'pytest --cov=jobfunnel --cov-report=xml'
# - './tests/verify-artifacts.sh' TODO: verify that JSON exist and are good
# - './tests/verify_time.sh' TODO: some way of verifying execution time
after_success:
- 'bash <(curl -s https://codecov.io/bash)'
# - './demo/gen_call_graphs.sh' TODO: some way of showing .dot on GitHub?
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2017 Paul McInnis
Copyright (c) 2020 Paul McInnis

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
6 changes: 4 additions & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
include jobfunnel/config/settings.yaml
include jobfunnel/text/user_agent_list.txt
include jobfunnel/demo/settings.yaml
include jobfunnel/resources/user_agent_list.txt
include readme.md
include LICENSE
14 changes: 0 additions & 14 deletions Pipfile

This file was deleted.

Loading