Skip to main content
Version: 0.2.2

Extractors Overview

This page helps you choose the right extractor for your run, understand key constraints, and navigate to detailed technical guides.

Extractor integrations are now registered through manifests and loaded automatically at orchestrator startup. Runtime discovery only scans extractors/*/(manifest.ts|src/manifest.ts) and does not read manifests from orchestrator/**. Extractor-specific run logic should also remain in extractors/<name>/ so orchestrator stays source-agnostic. To add a new source, follow Add an Extractor.

Extractor chooser

ExtractorBest use caseCore constraints/dependenciesNotable controlsOutput/behavior notes
GradcrackerUK graduate roles from GradcrackerCrawling stability depends on page structure and anti-bot behavior; tuned for low concurrencyGRADCRACKER_SEARCH_TERMS, GRADCRACKER_MAX_JOBS_PER_TERM, JOBOPS_SKIP_APPLY_FOR_EXISTINGScrapes listing metadata, then detail pages and apply URL resolution
JobSpyMulti-source discovery (Indeed, LinkedIn, Glassdoor)Requires Python wrapper execution per term; source availability and quality vary by site/locationJOBSPY_SITES, JOBSPY_SEARCH_TERMS, JOBSPY_RESULTS_WANTED, JOBSPY_HOURS_OLD, JOBSPY_LINKEDIN_FETCH_DESCRIPTIONProduces JSON per term, then orchestrator normalizes and de-duplicates by jobUrl
AdzunaAPI-based multi-country discovery with low scraping overheadRequires valid App ID/App Key; country must be in Adzuna-supported listADZUNA_APP_ID, ADZUNA_APP_KEY, ADZUNA_MAX_JOBS_PER_TERMAPI pagination to dataset output; orchestrator maps progress and de-duplicates by sourceJobId/jobUrl
Hiring CafeBrowser-backed discovery using Hiring Cafe search APIsSubject to upstream anti-bot checks; uses browser context and encoded search-state payloadsHIRING_CAFE_SEARCH_TERMS, HIRING_CAFE_COUNTRY, HIRING_CAFE_MAX_JOBS_PER_TERM, HIRING_CAFE_DATE_FETCHED_PAST_N_DAYSUses existing pipeline term/country/budget knobs and maps directly to normalized jobs
startup.jobsStartup-focused discovery through the published startup-jobs-scraper packageNo credentials required; detail enrichment depends on Playwright browser binaries being installedexisting pipeline searchTerms, selected country/cities, jobspyResultsWanted; npx playwright install for fresh environmentsAlgolia-backed search plus detail-page enrichment via package import; orchestrator maps normalized records and de-duplicates by jobUrl
Working NomadsRemote-only discovery through the public Working Nomads jobs APIPublic API is curated and remote-only; available fields are limited to API payload shapeexisting pipeline searchTerms, selected country/cities, jobspyResultsWanted, workplace typeFetches a single public JSON feed, filters locally by terms/location, infers job type when possible, and de-duplicates by source id / URL
UKVisaJobsUK visa sponsorship-focused rolesRequires authenticated session and periodic token/cookie refreshUKVISAJOBS_EMAIL, UKVISAJOBS_PASSWORD, UKVISAJOBS_MAX_JOBS, UKVISAJOBS_SEARCH_KEYWORDAPI pagination + dataset output; orchestrator de-dupes and may fetch missing descriptions
Manual ImportOne-off jobs not covered by scrapersInference quality depends on model/provider and input quality; some URLs cannot be fetched reliablyApp/API endpoints (/api/manual-jobs/infer, /api/manual-jobs/import)Accepts text/HTML/URL, runs inference, then saves and scores job after review

Which extractor should I use?

  • Use JobSpy for broad first-pass sourcing across common boards.
  • Use Adzuna when you want API-first discovery in supported non-UK markets.
  • Use Hiring Cafe when you want another term/country-driven source without adding credentials.
  • Use startup.jobs when you want startup-heavy listings without maintaining another scraper locally.
  • Use Working Nomads when you want another remote-only source with a public API and curated global roles.
  • Use Gradcracker when targeting graduate pipelines in the UK.
  • Use UKVisaJobs for sponsorship-specific UK searches.
  • Use Manual Import when you already have a specific posting and need direct import.

Many runs combine sources: broad discovery first, then manual import for high-priority jobs that scraping misses.

Common problems

How do I health-check an extractor?

  • Every runtime-registered extractor source exposes GET /api/:source/health.
  • Use the source id from the shared extractor catalog, for example /api/linkedin/health or /api/gradcracker/health.
  • The health check runs a minimal live probe for that source with a fixed search term and a result limit of 1.
  • A 200 response means the extractor returned at least one valid normalized job for that source.
  • A 503 response means the extractor could not run successfully, returned no jobs, or returned invalid jobs.
  • A 404 response means the source id is unknown or not registered at runtime on that instance.