JobSpy Extractor
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
Original websites:
Big picture
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
1) Inputs and defaults
Key environment variables:
JOBSPY_SITES(default:indeed,linkedin)JOBSPY_SEARCH_TERM(default:web developer)JOBSPY_LOCATION(default:UK)JOBSPY_RESULTS_WANTED(default:200)JOBSPY_HOURS_OLD(default:72)JOBSPY_COUNTRY_INDEED(default:UK)JOBSPY_LINKEDIN_FETCH_DESCRIPTION(default:true)JOBSPY_IS_REMOTE(unset by default)
2) Orchestrator flow
The service in orchestrator/src/server/services/jobspy.ts:
- Builds search-term list from UI or env
- Runs Python once per term with unique output file
- Reads JSON and maps to
CreateJobInput - De-dupes by
jobUrl - Deletes temp output files best-effort
3) Mapping and cleanup
- Normalizes salary ranges
- Converts empty values to null
- Keeps metadata like skills, ratings, remote flags when available
- Skips rows with invalid site or missing URL
Notes
JOBSPY_SEARCH_TERMScan be JSON array or|, comma, newline-delimited text.- Set
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0to speed runs. - Temp output files are stored under
data/imports/. - If workplace type is only
Remote, JobSpy runs withJOBSPY_IS_REMOTE=true. - If workplace type includes
HybridorOnsite, JobSpy cannot enforce those filters precisely, so the JobSpy-backed sources run without a workplace-type filter and may return broader results.
Common Problems
HybridorOnsitewas selected, but Indeed, LinkedIn, or Glassdoor still returned remote jobs. JobSpy only supports a strict remote toggle. Any workplace-type selection that includesHybridorOnsitebroadens those source results.- A run returned fewer LinkedIn descriptions than expected.
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0disables description fetching to speed up runs. - Different cities need different workplace-type filters. This is not supported in the current automatic-run flow. JobSpy receives one global workplace-type selection per run/query invocation.