jobdata

Using the jobdata API for Machine Learning with Cleaned Job Descriptions

Use our pre-processed job descriptions for machine learning, focusing on job title classification and skills extraction.

4 min read · Oct. 13, 2024
Table of contents

In this tutorial, we will explore how to utilize the jobdata API to create a machine learning data pipeline for two specific use cases: Job Title Classification and Skills Extraction from Job Descriptions. We will focus on using the description_string field, which provides a cleaned version of job descriptions, making it ideal for text processing tasks. Additionally, we will filter our data to include only English job postings using the language parameter.

Prerequisites

Before we begin, ensure you have the following:

  1. API Key: You need a valid API key with an active access subscription.
  2. Python Environment: Set up a Python environment with the following libraries installed:
  • requests for making API calls.
  • pandas for data manipulation.
  • scikit-learn for machine learning tasks.
  • nltk for natural language processing tasks.

You can install the required libraries using pip:

pip install requests pandas scikit-learn nltk

Step 1: Fetching Job Data

We will start by fetching job data from the API. We will use the /api/jobs/ endpoint to retrieve job listings, filtering for English job postings and including the description_str parameter to activate the description_string value for each job posting in the results.

Fetching Job Listings

Here’s how to fetch job listings using Python:

import requests
import pandas as pd

# Define your API key and endpoint
API_KEY = 'YOUR_API_KEY'
url = "https://jobdataapi.com/api/jobs/"

# Set parameters for the API request
params = {
    "language": "en",  # Filter for English job postings
    "description_str": "true",  # Include cleaned description
    "page_size": 5000  # Number of results per page
}

# Set headers for authorization
headers = {
    "Authorization": f"Api-Key {API_KEY}"
}

# Fetch job listings
response = requests.get(url, headers=headers, params=params)
data = response.json()

# Convert results to a DataFrame
job_listings = pd.DataFrame(data['results'])
print(job_listings.head())

Data Structure

The job_listings DataFrame will contain various fields, including:

  • id: Job ID
  • title: Job title
  • description_string: Cleaned job description
  • company: Company information
  • location: Job location

Step 2: Dynamic Job Title Classification

Use Case Overview

In this use case, we will classify job titles into categories dynamically using unsupervised learning techniques, such as clustering. This approach allows us to categorize job titles without predefined lists.

Data Preparation

  1. Feature Extraction: We will use the description_string for feature extraction. We can use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text data into numerical format.
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the job descriptions
X = vectorizer.fit_transform(job_listings['description_string'])
  1. Clustering Job Titles: We will use K-Means clustering to group job titles into categories.
from sklearn.cluster import KMeans

# Define the number of clusters (categories)
num_clusters = 8  # You can adjust this number based on your data
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model
kmeans.fit(X)

# Assign cluster labels to job titles
job_listings['cluster'] = kmeans.labels_

Analyzing Clusters

You can analyze the clusters to see which job titles belong to which category.

# Display job titles and their assigned clusters
for cluster in range(num_clusters):
    print(f"\nCluster {cluster}:")
    print(job_listings[job_listings['cluster'] == cluster]['title'].values)

Step 3: Skills Extraction from Job Descriptions

Use Case Overview

In this use case, we will extract skills mentioned in job descriptions. This can help job seekers understand the skills required for various positions.

Data Preparation

  1. Define Skills List: Create a list of skills you want to extract.
# Example skills list
skills_list = [
    "Python", "Java", "SQL", "Machine Learning", "Data Analysis", "Project Management",
    "JavaScript", "HTML", "CSS", "Marketing", "Communication", "Teaching"
]
  1. Extract Skills: We will create a function to extract skills from the description_string.
def extract_skills(description, skills):
    found_skills = [skill for skill in skills if skill.lower() in description.lower()]
    return found_skills

# Apply the function to extract skills
job_listings['extracted_skills'] = job_listings['description_string'].apply(lambda x: extract_skills(x, skills_list))

Analyzing Extracted Skills

You can analyze the extracted skills to see which skills are most frequently mentioned across job postings.

# Flatten the list of extracted skills and count occurrences
all_skills = [skill for sublist in job_listings['extracted_skills'] for skill in sublist]
skills_count = pd.Series(all_skills).value_counts()

print(skills_count)

Conclusion

In this tutorial, we demonstrated how to leverage the jobdata API to create a machine learning data pipeline for two use cases: Dynamic job title classification and skills extraction from job descriptions. By utilizing the description_string field, we were able to quickly extract relevant information for our models without the need for any cleaning or filtering of HTML tags and other unusable content that usually appears in raw job posts.

Next Steps

  • Experiment with clustering: Adjust the number of clusters in K-Means to see how it affects the categorization of job titles.
  • Expand the skills list: Include more skills to improve the extraction process.
  • Paginate through more job results: To increase your dataset, implement pagination in your API requests to fetch additional job listings. This can be done by iterating through pages (e.g. by using the URL provided by the next attribute when more pages for a query result are available) until you reach the desired number of job postings.
  • Deploy the model: Consider deploying the trained model as a web service for real-time predictions.

By following these steps, you can effectively utilize job data for various machine learning applications, enhancing the job search experience for users.

Related Docs

Integrating the jobdata API with Excel
Optimizing API Requests: A Guide to Efficient jobdata API Usage
Retrieving and Working with Industry Data for Imported Jobs
Merging Job Listings from Multiple Company Entries
Fetching and Maintaining Fresh Job Listings
How to Determine if a Job Post Requires Security Clearance
Integrating the jobdata API with n8n
Integrating the jobdata API with Zapier
Integrating the jobdata API with Make
Converting Annual FTE Salary to Monthly, Weekly, Daily, and Hourly Rates