# Using the jobdata API for Machine Learning with Cleaned Job Descriptions

Use our pre-processed job descriptions for machine learning, focusing on job title classification and skills extraction.

---

In this tutorial, we will explore how to utilize the jobdata API to create a machine learning data pipeline for two specific use cases: **Job Title Classification** and **Skills Extraction from Job Descriptions**. We will focus on using the `description_string` field, which provides a cleaned version of job descriptions, making it ideal for text processing tasks. Additionally, we will filter our data to include only English job postings using the `language` parameter.

## Prerequisites

Before we begin, ensure you have the following:

1. **API Key**: You need a valid API key with an active [access subscription](/accounts/pricing/).
2. **Python Environment**: Set up a [Python](https://www.python.org/) environment with the following libraries installed:

- `requests` for making API calls.
- `pandas` for data manipulation.
- `scikit-learn` for machine learning tasks.
- `nltk` for natural language processing tasks.

You can install the required libraries using pip:

```bash
pip install requests pandas scikit-learn nltk
```

## Step 1: Fetching Job Data

We will start by fetching job data from the API. We will use the `/api/jobs/` endpoint to retrieve job listings, filtering for English job postings and including the `description_str` parameter to activate the `description_string` value for each job posting in the results.

### Fetching Job Listings

Here’s how to fetch job listings using Python:

```python
import requests
import pandas as pd

# Define your API key and endpoint
API_KEY = 'YOUR_API_KEY'
url = "https://jobdataapi.com/api/jobs/"

# Set parameters for the API request
params = {
    "language": "en",  # Filter for English job postings
    "description_str": "true",  # Include cleaned description
    "page_size": 5000  # Number of results per page
}

# Set headers for authorization
headers = {
    "Authorization": f"Api-Key {API_KEY}"
}

# Fetch job listings
response = requests.get(url, headers=headers, params=params)
data = response.json()

# Convert results to a DataFrame
job_listings = pd.DataFrame(data['results'])
print(job_listings.head())
```

### Data Structure

The `job_listings` DataFrame will contain various fields, including:

- `id`: Job ID
- `title`: Job title
- `description_string`: Cleaned job description
- `company`: Company information
- `location`: Job location

## Step 2: Dynamic Job Title Classification

### Use Case Overview

In this use case, we will classify job titles into categories dynamically using unsupervised learning techniques, such as clustering. This approach allows us to categorize job titles without predefined lists.

### Data Preparation

1. **Feature Extraction**: We will use the `description_string` for feature extraction. We can use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text data into numerical format.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the job descriptions
X = vectorizer.fit_transform(job_listings['description_string'])
```

2. **Clustering Job Titles**: We will use K-Means clustering to group job titles into categories.

```python
from sklearn.cluster import KMeans

# Define the number of clusters (categories)
num_clusters = 8  # You can adjust this number based on your data
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model
kmeans.fit(X)

# Assign cluster labels to job titles
job_listings['cluster'] = kmeans.labels_
```

### Analyzing Clusters

You can analyze the clusters to see which job titles belong to which category.

```python
# Display job titles and their assigned clusters
for cluster in range(num_clusters):
    print(f"\nCluster {cluster}:")
    print(job_listings[job_listings['cluster'] == cluster]['title'].values)
```

## Step 3: Skills Extraction from Job Descriptions

### Use Case Overview

In this use case, we will extract skills mentioned in job descriptions. This can help job seekers understand the skills required for various positions.

### Data Preparation

1. **Define Skills List**: Create a list of skills you want to extract.

```python
# Example skills list
skills_list = [
    "Python", "Java", "SQL", "Machine Learning", "Data Analysis", "Project Management",
    "JavaScript", "HTML", "CSS", "Marketing", "Communication", "Teaching"
]
```

2. **Extract Skills**: We will create a function to extract skills from the `description_string`.

```python
def extract_skills(description, skills):
    found_skills = [skill for skill in skills if skill.lower() in description.lower()]
    return found_skills

# Apply the function to extract skills
job_listings['extracted_skills'] = job_listings['description_string'].apply(lambda x: extract_skills(x, skills_list))
```

### Analyzing Extracted Skills

You can analyze the extracted skills to see which skills are most frequently mentioned across job postings.

```python
# Flatten the list of extracted skills and count occurrences
all_skills = [skill for sublist in job_listings['extracted_skills'] for skill in sublist]
skills_count = pd.Series(all_skills).value_counts()

print(skills_count)
```

## Conclusion

In this tutorial, we demonstrated how to leverage the jobdata API to create a machine learning data pipeline for two use cases: Dynamic job title classification and skills extraction from job descriptions. By utilizing the `description_string` field, we were able to quickly extract relevant information for our models without the need for any cleaning or filtering of HTML tags and other unusable content that usually appears in raw job posts.

### Next Steps

- **Experiment with clustering**: Adjust the number of clusters in K-Means to see how it affects the categorization of job titles.
- **Expand the skills list**: Include more skills to improve the extraction process.
- **Paginate through more job results**: To increase your dataset, [implement pagination](/c/optimizing-api-requests-a-guide-to-efficient-jobdata-api-usage/#pagination-with-apijobs) in your API requests to fetch additional job listings. This can be done by iterating through pages (e.g. by using the URL provided by the `next` attribute when more pages for a query result are available) until you reach the desired number of job postings.
- **Deploy the model**: Consider deploying the trained model as a web service for real-time predictions.

By following these steps, you can effectively utilize job data for various machine learning applications, enhancing the job search experience for users.
