JobSpy Extractor
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
Big picture
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
1) Inputs and defaults
Key environment variables:
JOBSPY_SITES(default:indeed,linkedin)JOBSPY_SEARCH_TERM(default:web developer)JOBSPY_LOCATION(default:UK)JOBSPY_RESULTS_WANTED(default:200)JOBSPY_HOURS_OLD(default:72)JOBSPY_COUNTRY_INDEED(default:UK)JOBSPY_LINKEDIN_FETCH_DESCRIPTION(default:true)
2) Orchestrator flow
The service in orchestrator/src/server/services/jobspy.ts:
- Builds search-term list from UI or env
- Runs Python once per term with unique output file
- Reads JSON and maps to
CreateJobInput - De-dupes by
jobUrl - Deletes temp output files best-effort
3) Mapping and cleanup
- Normalizes salary ranges
- Converts empty values to null
- Keeps metadata like skills, ratings, remote flags when available
- Skips rows with invalid site or missing URL
Notes
JOBSPY_SEARCH_TERMScan be JSON array or|, comma, newline-delimited text.- Set
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0to speed runs. - Temp output files are stored under
data/imports/.