JobSpy Extractor
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
Original websites:
Big picture
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
1) Inputs and defaults
Key environment variables:
JOBSPY_SITES(default:indeed,linkedin)JOBSPY_SEARCH_TERM(default:web developer)JOBSPY_LOCATION(default: empty)JOBSPY_RESULTS_WANTED(default:200)JOBSPY_HOURS_OLD(default:72)JOBSPY_COUNTRY_INDEED(default: empty)JOBSPY_LINKEDIN_FETCH_DESCRIPTION(default:true)JOBSPY_IS_REMOTE(unset by default)
2) Orchestrator flow
The service in orchestrator/src/server/services/jobspy.ts:
- Builds search-term list from UI or env
- Runs Python once per term with unique output file
- Reads JSON and maps to
CreateJobInput - De-dupes by
jobUrl - Deletes temp output files best-effort
3) Mapping and cleanup
- Normalizes salary ranges
- Converts empty values to null
- Keeps metadata like skills, ratings, remote flags when available
- Skips rows with invalid site or missing URL
Notes
JOBSPY_SEARCH_TERMScan be JSON array or|, comma, newline-delimited text.- Set
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0to speed runs. - Temp output files are stored under
data/imports/. - If workplace type is only
Remote, JobSpy runs withJOBSPY_IS_REMOTE=true. - If workplace type includes
HybridorOnsite, JobSpy cannot enforce those filters precisely, so the JobSpy-backed sources run without a workplace-type filter and may return broader results. - If no city is configured, JobSpy runs without a hidden UK location fallback. The selected pipeline country is passed through when available, but no fake city or UK default is injected.
Common Problems
HybridorOnsitewas selected, but Indeed, LinkedIn, or Glassdoor still returned remote jobs. JobSpy only supports a strict remote toggle. Any workplace-type selection that includesHybridorOnsitebroadens those source results.- A run returned fewer LinkedIn descriptions than expected.
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0disables description fetching to speed up runs. - Different cities need different workplace-type filters. This is not supported in the current automatic-run flow. JobSpy receives one global workplace-type selection per run/query invocation.