Skip to main content
Version: 0.1.21

JobSpy Extractor

A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.

Big picture

JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.

1) Inputs and defaults

Key environment variables:

  • JOBSPY_SITES (default: indeed,linkedin)
  • JOBSPY_SEARCH_TERM (default: web developer)
  • JOBSPY_LOCATION (default: UK)
  • JOBSPY_RESULTS_WANTED (default: 200)
  • JOBSPY_HOURS_OLD (default: 72)
  • JOBSPY_COUNTRY_INDEED (default: UK)
  • JOBSPY_LINKEDIN_FETCH_DESCRIPTION (default: true)

2) Orchestrator flow

The service in orchestrator/src/server/services/jobspy.ts:

  • Builds search-term list from UI or env
  • Runs Python once per term with unique output file
  • Reads JSON and maps to CreateJobInput
  • De-dupes by jobUrl
  • Deletes temp output files best-effort

3) Mapping and cleanup

  • Normalizes salary ranges
  • Converts empty values to null
  • Keeps metadata like skills, ratings, remote flags when available
  • Skips rows with invalid site or missing URL

Notes

  • JOBSPY_SEARCH_TERMS can be JSON array or |, comma, newline-delimited text.
  • Set JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0 to speed runs.
  • Temp output files are stored under data/imports/.