Using the jobdata API for Machine Learning with Cleaned Job Descriptions
Use our pre-processed job descriptions for machine learning, focusing on job title classification and skills extraction.
Table of contents
In this tutorial, we will explore how to utilize the jobdata API to create a machine learning data pipeline for two specific use cases: Job Title Classification and Skills Extraction from Job Descriptions. We will focus on using the description_string
field, which provides a cleaned version of job descriptions, making it ideal for text processing tasks. Additionally, we will filter our data to include only English job postings using the language
parameter.
Prerequisites
Before we begin, ensure you have the following:
- API Key: You need a valid API key with an active access subscription.
- Python Environment: Set up a Python environment with the following libraries installed:
requests
for making API calls.pandas
for data manipulation.scikit-learn
for machine learning tasks.nltk
for natural language processing tasks.
You can install the required libraries using pip:
pip install requests pandas scikit-learn nltk
Step 1: Fetching Job Data
We will start by fetching job data from the API. We will use the /api/jobs/
endpoint to retrieve job listings, filtering for English job postings and including the description_str
parameter to activate the description_string
value for each job posting in the results.
Fetching Job Listings
Here’s how to fetch job listings using Python:
import requests
import pandas as pd
# Define your API key and endpoint
API_KEY = 'YOUR_API_KEY'
url = "https://jobdataapi.com/api/jobs/"
# Set parameters for the API request
params = {
"language": "en", # Filter for English job postings
"description_str": "true", # Include cleaned description
"page_size": 5000 # Number of results per page
}
# Set headers for authorization
headers = {
"Authorization": f"Api-Key {API_KEY}"
}
# Fetch job listings
response = requests.get(url, headers=headers, params=params)
data = response.json()
# Convert results to a DataFrame
job_listings = pd.DataFrame(data['results'])
print(job_listings.head())
Data Structure
The job_listings
DataFrame will contain various fields, including:
id
: Job IDtitle
: Job titledescription_string
: Cleaned job descriptioncompany
: Company informationlocation
: Job location
Step 2: Dynamic Job Title Classification
Use Case Overview
In this use case, we will classify job titles into categories dynamically using unsupervised learning techniques, such as clustering. This approach allows us to categorize job titles without predefined lists.
Data Preparation
- Feature Extraction: We will use the
description_string
for feature extraction. We can use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text data into numerical format.
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
# Fit and transform the job descriptions
X = vectorizer.fit_transform(job_listings['description_string'])
- Clustering Job Titles: We will use K-Means clustering to group job titles into categories.
from sklearn.cluster import KMeans
# Define the number of clusters (categories)
num_clusters = 8 # You can adjust this number based on your data
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
# Fit the model
kmeans.fit(X)
# Assign cluster labels to job titles
job_listings['cluster'] = kmeans.labels_
Analyzing Clusters
You can analyze the clusters to see which job titles belong to which category.
# Display job titles and their assigned clusters
for cluster in range(num_clusters):
print(f"\nCluster {cluster}:")
print(job_listings[job_listings['cluster'] == cluster]['title'].values)
Step 3: Skills Extraction from Job Descriptions
Use Case Overview
In this use case, we will extract skills mentioned in job descriptions. This can help job seekers understand the skills required for various positions.
Data Preparation
- Define Skills List: Create a list of skills you want to extract.
# Example skills list
skills_list = [
"Python", "Java", "SQL", "Machine Learning", "Data Analysis", "Project Management",
"JavaScript", "HTML", "CSS", "Marketing", "Communication", "Teaching"
]
- Extract Skills: We will create a function to extract skills from the
description_string
.
def extract_skills(description, skills):
found_skills = [skill for skill in skills if skill.lower() in description.lower()]
return found_skills
# Apply the function to extract skills
job_listings['extracted_skills'] = job_listings['description_string'].apply(lambda x: extract_skills(x, skills_list))
Analyzing Extracted Skills
You can analyze the extracted skills to see which skills are most frequently mentioned across job postings.
# Flatten the list of extracted skills and count occurrences
all_skills = [skill for sublist in job_listings['extracted_skills'] for skill in sublist]
skills_count = pd.Series(all_skills).value_counts()
print(skills_count)
Conclusion
In this tutorial, we demonstrated how to leverage the jobdata API to create a machine learning data pipeline for two use cases: Dynamic job title classification and skills extraction from job descriptions. By utilizing the description_string
field, we were able to quickly extract relevant information for our models without the need for any cleaning or filtering of HTML tags and other unusable content that usually appears in raw job posts.
Next Steps
- Experiment with clustering: Adjust the number of clusters in K-Means to see how it affects the categorization of job titles.
- Expand the skills list: Include more skills to improve the extraction process.
- Paginate through more job results: To increase your dataset, implement pagination in your API requests to fetch additional job listings. This can be done by iterating through pages (e.g. by using the URL provided by the
next
attribute when more pages for a query result are available) until you reach the desired number of job postings. - Deploy the model: Consider deploying the trained model as a web service for real-time predictions.
By following these steps, you can effectively utilize job data for various machine learning applications, enhancing the job search experience for users.