diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.ipynb b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.ipynb new file mode 100644 index 0000000000..2446f11fc8 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.ipynb @@ -0,0 +1,1386 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "muD_5mFUYB-x" + }, + "source": [ + "# Job recommendation system\n", + "\n", + "The code sample contains the following parts:\n", + "\n", + "1. Data exploration and visualization\n", + "2. Data cleaning/pre-processing\n", + "3. Fake job postings identification and removal\n", + "4. Job recommendation by showing the most similar job postings\n", + "\n", + "The scenario is that someone wants to find the best posting for themselves. They have collected the data, but he is not sure if all the data is real. Therefore, based on a trained model, as in this sample, they identify with a high degree of accuracy which postings are real, and it is among them that they choose the best ad for themselves.\n", + "\n", + "For simplicity, only one dataset will be used within this code, but the process using one dataset is not significantly different from the one described earlier.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GTu2WLQmZU-L" + }, + "source": [ + "## Data exploration and visualization\n", + "\n", + "For the purpose of this code sample we will use Real or Fake: Fake Job Postings dataset available over HuggingFace API. In this first part we will focus on data exploration and visualization. In standard end-to-end workload it is the first step. Engineer needs to first know the data to be able to work on it and prepare solution that will utilize dataset the best.\n", + "\n", + "Lest start with loading the dataset. We are using datasets library to do that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "saMOoStVs0-s", + "outputId": "ba4623b9-0533-4062-b6b0-01e96bd4de39" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"victor/real-or-fake-fake-jobposting-prediction\")\n", + "dataset = dataset['train']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To better analyze and understand the data we are transferring it to pandas DataFrame, so we are able to take benefit from all pandas data transformations. Pandas library provides multiple useful functions for data manipulation so it is usual choice at this stage of machine learning or deep learning project.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rRkolJQKtAzt" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = dataset.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see 5 first and 5 last rows in the dataset we are working on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 556 + }, + "id": "WYGIRBUJSl3N", + "outputId": "ccd4abaf-1b4d-4fbd-85c8-54408c4f9f8a" + }, + "outputs": [], + "source": [ + "df.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, lets print a concise summary of the dataset. This way we will see all the column names, know the number of rows and types in every of the column. It is a great overview on the features of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UtxA6fmaSrQ8", + "outputId": "e8a1ce15-88e8-487c-d05e-74c024aca994" + }, + "outputs": [], + "source": [ + "df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point it is a good idea to make sure our dataset doen't contain any data duplication that could impact the results of our future system. To do that we firs need to remove `job_id` column. It contains unique number for each job posting so even if the rest of the data is the same between 2 postings it makes it different." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 556 + }, + "id": "f4LJCdKHStca", + "outputId": "b1db61e1-a909-463b-d369-b38c2349cba6" + }, + "outputs": [], + "source": [ + "# Drop the 'job_id' column\n", + "df = df.drop(columns=['job_id'])\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, the actual duplicates removal. We first pring the number of duplicates that are in our dataset, than using `drop_duplicated` method we are removing them and after this operation printing the number of the duplicates. If everything works as expected after duplicates removal we should print `0` as current number of duplicates in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Ow8SgJg2vJkB", + "outputId": "9a6050bf-f4bf-4b17-85a1-8d2980cd77ee" + }, + "outputs": [], + "source": [ + "# let's make sure that there are no duplicated jobs\n", + "\n", + "print(df.duplicated().sum())\n", + "df = df.drop_duplicates()\n", + "print(df.duplicated().sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tcpcjR8UUQCJ" + }, + "source": [ + "Now we can visualize the data from the dataset. First let's visualize data as it is all real, and later, for the purposes of the fake data detection, we will also visualize it spreading fake and real data.\n", + "\n", + "When working with text data it can be challenging to visualize it. Thankfully, there is a `wordcloud` library that shows common words in the analyzed texts. The bigger word is, more often the word is in the text. Wordclouds allow us to quickly identify the most important topic and themes in a large text dataset and also explore patterns and trends in textural data.\n", + "\n", + "In our example, we will create wordcloud for job titles, to have high-level overview of job postings we are working with." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 544 + }, + "id": "c0jsAvejvzQ5", + "outputId": "7622e54f-6814-47e1-d9c6-4ee13173b4f4" + }, + "outputs": [], + "source": [ + "from wordcloud import WordCloud # module to print word cloud\n", + "from matplotlib import pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# On the basis of Job Titles form word cloud\n", + "job_titles_text = ' '.join(df['title'])\n", + "wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text)\n", + "\n", + "# Plotting Word Cloud\n", + "plt.figure(figsize=(10, 6))\n", + "plt.imshow(wordcloud, interpolation='bilinear')\n", + "plt.title('Job Titles')\n", + "plt.axis('off')\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Different possibility to get some information from this type of dataset is by showing top-n most common values in given column or distribution of the values int his column.\n", + "Let's show top 10 most common job titles and compare this result with previously showed wordcould." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "0Ut0qo0ywv3_", + "outputId": "705fbbf0-4dc0-4ee1-d821-edccaff78a85" + }, + "outputs": [], + "source": [ + "# Get Count of job title\n", + "job_title_counts = df['title'].value_counts()\n", + "\n", + "# Plotting a bar chart for the top 10 most common job titles\n", + "top_job_titles = job_title_counts.head(10)\n", + "plt.figure(figsize=(10, 6))\n", + "top_job_titles.sort_values().plot(kind='barh')\n", + "plt.title('Top 10 Most Common Job Titles')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Job Titles')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can do the same for different columns, as `employment_type`, `required_experience`, `telecommuting`, `has_company_logo` and `has_questions`. These should give us reale good overview of different parts of our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "OaBkEWNLxkqK", + "outputId": "efbd9955-5630-4fdb-a6dd-f4ffe4b0a7d8" + }, + "outputs": [], + "source": [ + "# Count the occurrences of each work type\n", + "work_type_counts = df['employment_type'].value_counts()\n", + "\n", + "# Plotting the distribution of work types\n", + "plt.figure(figsize=(8, 6))\n", + "work_type_counts.sort_values().plot(kind='barh')\n", + "plt.title('Distribution of Work Types Offered by Jobs')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Work Types')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "5uTBPGXgyZEV", + "outputId": "d6c76b5f-25ce-4730-f849-f881315ca883" + }, + "outputs": [], + "source": [ + "# Count the occurrences of required experience types\n", + "work_type_counts = df['required_experience'].value_counts()\n", + "\n", + "# Plotting the distribution of work types\n", + "plt.figure(figsize=(8, 6))\n", + "work_type_counts.sort_values().plot(kind='barh')\n", + "plt.title('Distribution of Required Experience by Jobs')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Required Experience')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For employment_type and required_experience we also created matrix to see if there is any corelation between those two. To visualize it we created heatmap. If you think that some of the parameters can be related, creating similar heatmap can be a good idea." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 696 + }, + "id": "nonO2cHR1I-t", + "outputId": "3101b8b2-cf0a-413b-b0aa-eb2a3a96a582" + }, + "outputs": [], + "source": [ + "from matplotlib import pyplot as plt\n", + "import seaborn as sns\n", + "import pandas as pd\n", + "\n", + "plt.subplots(figsize=(8, 8))\n", + "df_2dhist = pd.DataFrame({\n", + " x_label: grp['required_experience'].value_counts()\n", + " for x_label, grp in df.groupby('employment_type')\n", + "})\n", + "sns.heatmap(df_2dhist, cmap='viridis')\n", + "plt.xlabel('employment_type')\n", + "_ = plt.ylabel('required_experience')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "mXdpeQFJ1VMu", + "outputId": "eb9a893f-5087-4dad-ceca-48a1dfeb0b02" + }, + "outputs": [], + "source": [ + "# Count the occurrences of unique values in the 'telecommuting' column\n", + "telecommuting_counts = df['telecommuting'].value_counts()\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "telecommuting_counts.sort_values().plot(kind='barh')\n", + "plt.title('Counts of telecommuting vs Non-telecommuting')\n", + "plt.xlabel('count')\n", + "plt.ylabel('telecommuting')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "8kEu4IKcVmSV", + "outputId": "94ae873f-9178-4c63-e855-d677f135e552" + }, + "outputs": [], + "source": [ + "has_company_logo_counts = df['has_company_logo'].value_counts()\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "has_company_logo_counts.sort_values().plot(kind='barh')\n", + "plt.ylabel('has_company_logo')\n", + "plt.xlabel('Count')\n", + "plt.title('Counts of With_Logo vs Without_Logo')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "Esv8b51EVvxx", + "outputId": "40355e3f-fc6b-4b16-d459-922cfede2f71" + }, + "outputs": [], + "source": [ + "has_questions_counts = df['has_questions'].value_counts()\n", + "\n", + "# Plot the counts\n", + "plt.figure(figsize=(8, 6))\n", + "has_questions_counts.sort_values().plot(kind='barh')\n", + "plt.ylabel('has_questions')\n", + "plt.xlabel('Count')\n", + "plt.title('Counts Questions vs NO_Questions')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the job recommendations point of view the salary and location can be really important parameters to take into consideration. In given dataset we have salary ranges available so there is no need for additional data processing rather than removal of empty ranges but if the dataset you're working on has specific values, consider organizing it into appropriate ranges and only then displaying the result." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "6SQO5PVLy8vt", + "outputId": "f0dbdf21-af94-4e56-cd82-938b7258c26f" + }, + "outputs": [], + "source": [ + "# Splitting benefits by comma and creating a list of benefits\n", + "benefits_list = df['salary_range'].str.split(',').explode()\n", + "benefits_list = benefits_list[benefits_list != 'None']\n", + "benefits_list = benefits_list[benefits_list != '0-0']\n", + "\n", + "\n", + "# Counting the occurrences of each skill\n", + "benefits_count = benefits_list.str.strip().value_counts()\n", + "\n", + "# Plotting the top 10 most common benefits\n", + "top_benefits = benefits_count.head(10)\n", + "plt.figure(figsize=(10, 6))\n", + "top_benefits.sort_values().plot(kind='barh')\n", + "plt.title('Top 10 Salaries Range Offered by Companies')\n", + "plt.xlabel('Frequency')\n", + "plt.ylabel('Salary Range')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the location we have both county, state and city specified, so we need to split it into individual columns, and then show top 10 counties and cities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_242StA_UZTF" + }, + "outputs": [], + "source": [ + "# Split the 'location' column into separate columns for country, state, and city\n", + "location_split = df['location'].str.split(', ', expand=True)\n", + "df['Country'] = location_split[0]\n", + "df['State'] = location_split[1]\n", + "df['City'] = location_split[2]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 959 + }, + "id": "HS9SH6p9UaJU", + "outputId": "6562e31f-6719-448b-c290-1a9610eb50c2" + }, + "outputs": [], + "source": [ + "# Count the occurrences of unique values in the 'Country' column\n", + "Country_counts = df['Country'].value_counts()\n", + "\n", + "# Select the top 10 most frequent occurrences\n", + "top_10_Country = Country_counts.head(10)\n", + "\n", + "# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels\n", + "plt.figure(figsize=(14, 10))\n", + "sns.barplot(y=top_10_Country.index, x=top_10_Country.values)\n", + "plt.ylabel('Country')\n", + "plt.xlabel('Count')\n", + "plt.title('Top 10 Most Frequent Country')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 959 + }, + "id": "j_cPJl8pUcWT", + "outputId": "bb87ec2d-750d-45b0-f64f-ae4b84b00544" + }, + "outputs": [], + "source": [ + "# Count the occurrences of unique values in the 'City' column\n", + "City_counts = df['City'].value_counts()\n", + "\n", + "# Select the top 10 most frequent occurrences\n", + "top_10_City = City_counts.head(10)\n", + "\n", + "# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels\n", + "plt.figure(figsize=(14, 10))\n", + "sns.barplot(y=top_10_City.index, x=top_10_City.values)\n", + "plt.ylabel('City')\n", + "plt.xlabel('Count')\n", + "plt.title('Top 10 Most Frequent City')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-R8hkAIjVF_s" + }, + "source": [ + "### Fake job postings data visualization \n", + "\n", + "What about fraudulent class? Let see how many of the jobs in the dataset are fake. Whether there are equally true and false offers, or whether there is a significant disproportion between the two. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 651 + }, + "id": "KJ5Aq2IizZ4r", + "outputId": "e1c10006-9f5a-4321-d90a-28915e02f8c3" + }, + "outputs": [], + "source": [ + "## fake job visualization\n", + "# Count the occurrences of unique values in the 'fraudulent' column\n", + "fraudulent_counts = df['fraudulent'].value_counts()\n", + "\n", + "# Plot the counts using a rainbow color palette\n", + "plt.figure(figsize=(8, 6))\n", + "sns.barplot(x=fraudulent_counts.index, y=fraudulent_counts.values)\n", + "plt.xlabel('Fraudulent')\n", + "plt.ylabel('Count')\n", + "plt.title('Counts of Fraudulent vs Non-Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "oyeB2MFRVIWi", + "outputId": "9236f907-4c16-49f7-c14b-883d21ae6c2c" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(10, 6))\n", + "sns.countplot(data=df, x='employment_type', hue='fraudulent')\n", + "plt.title('Count of Fraudulent Cases by Employment Type')\n", + "plt.xlabel('Employment Type')\n", + "plt.ylabel('Count')\n", + "plt.legend(title='Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 564 + }, + "id": "ORGFxjVVVJBi", + "outputId": "084304de-5618-436a-8958-6f36abd72be7" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(10, 6))\n", + "sns.countplot(data=df, x='required_experience', hue='fraudulent')\n", + "plt.title('Count of Fraudulent Cases by Required Experience')\n", + "plt.xlabel('Required Experience')\n", + "plt.ylabel('Count')\n", + "plt.legend(title='Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "GnRPXpBWVL7O", + "outputId": "8d347181-83d8-44ae-9c57-88d98825694d" + }, + "outputs": [], + "source": [ + "plt.figure(figsize=(30, 18))\n", + "sns.countplot(data=df, x='required_education', hue='fraudulent')\n", + "plt.title('Count of Fraudulent Cases by Required Education')\n", + "plt.xlabel('Required Education')\n", + "plt.ylabel('Count')\n", + "plt.legend(title='Fraudulent')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8qKuYrkvVPlO" + }, + "source": [ + "We can see that there is no connection between those parameters and fake job postings. This way in the future processing we can remove them." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BbOwzXmdaJTw" + }, + "source": [ + "## Data cleaning/pre-processing\n", + "\n", + "One of the really important step related to any type of data processing is data cleaning. For texts it usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks. \n", + "\n", + "In our case, to prepare data for Fake Job Postings recognition we will first, combine all relevant columns into single new record and then clean the data to work on it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jYLwp2wSaMdi" + }, + "outputs": [], + "source": [ + "# List of columns to concatenate\n", + "columns_to_concat = ['title', 'location', 'department', 'salary_range', 'company_profile',\n", + " 'description', 'requirements', 'benefits', 'employment_type',\n", + " 'required_experience', 'required_education', 'industry', 'function']\n", + "\n", + "# Concatenate the values of specified columns into a new column 'job_posting'\n", + "df['job_posting'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)\n", + "\n", + "# Create a new DataFrame with columns 'job_posting' and 'fraudulent'\n", + "new_df = df[['job_posting', 'fraudulent']].copy()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "FulR3zMiaMgI", + "outputId": "995058f3-f5f7-4aec-e1e0-94d42aad468f" + }, + "outputs": [], + "source": [ + "new_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0TpoEx1-YgCs", + "outputId": "8eaaf021-ae66-477a-f07b-fd8a353d17eb", + "scrolled": true + }, + "outputs": [], + "source": [ + "# import spacy\n", + "import re\n", + "import nltk\n", + "from nltk.corpus import stopwords\n", + "\n", + "nltk.download('stopwords')\n", + "\n", + "def preprocess_text(text):\n", + " # Remove newlines, carriage returns, and tabs\n", + " text = re.sub('\\n','', text)\n", + " text = re.sub('\\r','', text)\n", + " text = re.sub('\\t','', text)\n", + " # Remove URLs\n", + " text = re.sub(r\"http\\S+|www\\S+|https\\S+\", \"\", text, flags=re.MULTILINE)\n", + "\n", + " # Remove special characters\n", + " text = re.sub(r\"[^a-zA-Z0-9\\s]\", \"\", text)\n", + "\n", + " # Remove punctuation\n", + " text = re.sub(r'[^\\w\\s]', '', text)\n", + "\n", + " # Remove digits\n", + " text = re.sub(r'\\d', '', text)\n", + "\n", + " # Convert to lowercase\n", + " text = text.lower()\n", + "\n", + " # Remove stop words\n", + " stop_words = set(stopwords.words('english'))\n", + " words = [word for word in text.split() if word.lower() not in stop_words]\n", + " text = ' '.join(words)\n", + "\n", + " return text\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "p9NHS6Vx2BE8" + }, + "outputs": [], + "source": [ + "new_df['job_posting'] = new_df['job_posting'].apply(preprocess_text)\n", + "\n", + "new_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The next step in the pre-processing is lemmatization. It is a process to reduce a word to its root form, called a lemma. For example the verb 'planning' would be changed to 'plan' world." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kZnHODi-ZK33" + }, + "outputs": [], + "source": [ + "# Lemmatization\n", + "import en_core_web_sm\n", + "\n", + "nlp = en_core_web_sm.load()\n", + "\n", + "def lemmatize_text(text):\n", + " doc = nlp(text)\n", + " return \" \".join([token.lemma_ for token in doc])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uQauVQdw2LWa" + }, + "outputs": [], + "source": [ + "new_df['job_posting'] = new_df['job_posting'].apply(lemmatize_text)\n", + "\n", + "new_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dQDR6_SpZW0B" + }, + "source": [ + "At this stage we can also visualize the data with wordcloud by having special text column. We can show it for both fake and real dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 411 + }, + "id": "fdR9GAG6ZnPh", + "outputId": "57e9b5ae-87b4-4523-d0ae-8c8fd56cd9bc" + }, + "outputs": [], + "source": [ + "from wordcloud import WordCloud\n", + "\n", + "non_fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 0]['job_posting'])\n", + "fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 1]['job_posting'])\n", + "\n", + "wordcloud_non_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(non_fraudulent_text)\n", + "\n", + "wordcloud_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(fraudulent_text)\n", + "\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))\n", + "\n", + "ax1.imshow(wordcloud_non_fraudulent, interpolation='bilinear')\n", + "ax1.axis('off')\n", + "ax1.set_title('Non-Fraudulent Job Postings')\n", + "\n", + "ax2.imshow(wordcloud_fraudulent, interpolation='bilinear')\n", + "ax2.axis('off')\n", + "ax2.set_title('Fraudulent Job Postings')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ihtfhOr7aNMa" + }, + "source": [ + "## Fake job postings identification and removal\n", + "\n", + "Nowadays, it is unfortunate that not all the job offers that are posted on papular portals are genuine. Some of them are created only to collect personal data. Therefore, just detecting fake job postings can be very essential. \n", + "\n", + "We will create bidirectional LSTM model with one hot encoding. Let's start with all necessary imports." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VNdX-xcjtVS2" + }, + "outputs": [], + "source": [ + "from tensorflow.keras.layers import Embedding\n", + "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", + "from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.preprocessing.text import one_hot\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.layers import Bidirectional\n", + "from tensorflow.keras.layers import Dropout" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Make sure, you're using Tensorflow version 2.15.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "IxY47-s7tbjU", + "outputId": "02d68552-ff52-422b-9044-e55e35ef1236" + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "tf.__version__" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let us import Intel Extension for TensorFlow*. We are using Python API `itex.experimental_ops_override()`. It automatically replace some TensorFlow operators by Custom Operators under `itex.ops` namespace, as well as to be compatible with existing trained parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import intel_extension_for_tensorflow as itex\n", + "\n", + "itex.experimental_ops_override()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We need to prepare data for the model we will create. First let's assign job_postings to X and fraudulent values to y (expected value)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "U-7klPFFtZgo" + }, + "outputs": [], + "source": [ + "X = new_df['job_posting']\n", + "y = new_df['fraudulent']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One hot encoding is a technique to represent categorical variables as numerical values. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3FFtUrPbtbmD" + }, + "outputs": [], + "source": [ + "voc_size = 5000\n", + "onehot_repr = [one_hot(words, voc_size) for words in X]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ygHx6LSg6ZUr", + "outputId": "5b152a4f-621b-400c-a65b-5fa19a934aa2" + }, + "outputs": [], + "source": [ + "sent_length = 40\n", + "embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length)\n", + "print(embedded_docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating model\n", + "\n", + "We are creating Deep Neural Network using Bidirectional LSTM. The architecture is as followed:\n", + "\n", + "* Embedding layer\n", + "* Bidirectiona LSTM Layer\n", + "* Dropout layer\n", + "* Dense layer with sigmod function\n", + "\n", + "We are using Adam optimizer with binary crossentropy. We are optimism accuracy.\n", + "\n", + "If Intel® Extension for TensorFlow* backend is XPU, `tf.keras.layers.LSTM` will be replaced by `itex.ops.ItexLSTM`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Vnhm4huG-Mat", + "outputId": "dbc59ef1-168a-4e11-f38d-b47674dd4be6" + }, + "outputs": [], + "source": [ + "embedding_vector_features = 50\n", + "model_itex = Sequential()\n", + "model_itex.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length))\n", + "model_itex.add(Bidirectional(itex.ops.ItexLSTM(100)))\n", + "model_itex.add(Dropout(0.3))\n", + "model_itex.add(Dense(1, activation='sigmoid'))\n", + "model_itex.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", + "print(model_itex.summary())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1-tz3hyc-PvN" + }, + "outputs": [], + "source": [ + "X_final = np.array(embedded_docs)\n", + "y_final = np.array(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "POVN7X60-TnQ" + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=320)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's train the model. We are using standard `model.fit()` method providing training and testing dataset. You can easily modify number of epochs in this training process but keep in mind that the model can become overtrained, so that it will have very good results on training data, but poor results on test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "U0cGa7ei-Ufh", + "outputId": "68ce942d-ea51-458f-ac6c-ab619ab1ce74" + }, + "outputs": [], + "source": [ + "model_itex.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=64)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The values returned by the model are in the range [0,1] Need to map them to integer values of 0 or 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "u4I8Y-R5EcDw", + "outputId": "be384d88-b27c-49c5-bebb-e9bdba986692" + }, + "outputs": [], + "source": [ + "y_pred = (model_itex.predict(X_test) > 0.5).astype(\"int32\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the `scikit-learn` library." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 675 + }, + "id": "0lB3N6fxtbom", + "outputId": "97b1713d-b373-44e1-a5b2-15e41aa84016" + }, + "outputs": [], + "source": [ + "from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report\n", + "\n", + "conf_matrix = confusion_matrix(y_test, y_pred)\n", + "print(\"Confusion matrix:\")\n", + "print(conf_matrix)\n", + "\n", + "ConfusionMatrixDisplay.from_predictions(y_test, y_pred)\n", + "\n", + "class_report = classification_report(y_test, y_pred)\n", + "print(\"Classification report:\")\n", + "print(class_report)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ioa6oZNuaPnJ" + }, + "source": [ + "## Job recommendation by showing the most similar ones" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZGReO9ziJyXm" + }, + "source": [ + "Now, as we are sure that the data we are processing is real, we can get back to the original columns and create our recommendation system.\n", + "\n", + "Also use much more simple solution for recommendations. Even, as before we used Deep Learning to check if posting is fake, we can use classical machine learning algorithms to show similar job postings.\n", + "\n", + "First, let's filter fake job postings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 556 + }, + "id": "RsCZLWU0aMqN", + "outputId": "503c1b4e-26db-46fd-d69f-f8ee17c1519c" + }, + "outputs": [], + "source": [ + "real = df[df['fraudulent'] == 0]\n", + "real.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After that, we create a common column containing those text parameters that we want to be compared between theses and are relevant to us when making recommendations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "NLc-uoYeaMsy", + "outputId": "452602b0-88e2-4c9c-a6f0-5b069cc34009" + }, + "outputs": [], + "source": [ + "cols = ['title', 'description', 'requirements', 'required_experience', 'required_education', 'industry']\n", + "real = real[cols]\n", + "real.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 293 + }, + "id": "mX-xc2OetVzx", + "outputId": "e0f24240-d8dd-4f79-fda6-db15d2f4c54f" + }, + "outputs": [], + "source": [ + "real = real.fillna(value='')\n", + "real['text'] = real['description'] + real['requirements'] + real['required_experience'] + real['required_education'] + real['industry']\n", + "real.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see the mechanism that we will use to prepare recommendations - we will use sentence similarity based on prepared `text` column in our dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's prepare a few example sentences that cover 4 topics. On these sentences it will be easier to show how the similarities between the texts work than on the whole large dataset we have." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "messages = [\n", + " # Smartphones\n", + " \"I like my phone\",\n", + " \"My phone is not good.\",\n", + " \"Your cellphone looks great.\",\n", + "\n", + " # Weather\n", + " \"Will it snow tomorrow?\",\n", + " \"Recently a lot of hurricanes have hit the US\",\n", + " \"Global warming is real\",\n", + "\n", + " # Food and health\n", + " \"An apple a day, keeps the doctors away\",\n", + " \"Eating strawberries is healthy\",\n", + " \"Is paleo better than keto?\",\n", + "\n", + " # Asking about age\n", + " \"How old are you?\",\n", + " \"what is your age?\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we are preparing functions to show similarities between given sentences in the for of heat map. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import seaborn as sns\n", + "\n", + "def plot_similarity(labels, features, rotation):\n", + " corr = np.inner(features, features)\n", + " sns.set(font_scale=1.2)\n", + " g = sns.heatmap(\n", + " corr,\n", + " xticklabels=labels,\n", + " yticklabels=labels,\n", + " vmin=0,\n", + " vmax=1,\n", + " cmap=\"YlOrRd\")\n", + " g.set_xticklabels(labels, rotation=rotation)\n", + " g.set_title(\"Semantic Textual Similarity\")\n", + "\n", + "def run_and_plot(messages_):\n", + " message_embeddings_ = model.encode(messages_)\n", + " plot_similarity(messages_, message_embeddings_, 90)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_and_plot(messages)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's move back to our job postings dataset. First, we are using sentence encoding model to be able to calculate similarities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "encodings = []\n", + "for text in real['text']:\n", + " encodings.append(model.encode(text))\n", + "\n", + "real['encodings'] = encodings" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, we can chose job posting we wan to calculate similarities to. In our case it is first job posting in the dataset, but you can easily change it to any other job posting, by changing value in the `index` variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index = 0\n", + "corr = np.inner(encodings[index], encodings)\n", + "real['corr_to_first'] = corr" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And based on the calculated similarities, we can show top most similar job postings, by sorting them according to calculated correlation value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "real.sort_values(by=['corr_to_first'], ascending=False).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this code sample we created job recommendation system. First, we explored and analyzed the dataset, then we pre-process the data and create fake job postings detection model. At the end we used sentence similarities to show top 5 recommendations - the most similar job descriptions to the chosen one. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Tensorflow", + "language": "python", + "name": "tensorflow" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.py b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.py new file mode 100644 index 0000000000..425ab1f5dd --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/JobRecommendationSystem.py @@ -0,0 +1,634 @@ +# %% [markdown] +# # Job recommendation system +# +# The code sample contains the following parts: +# +# 1. Data exploration and visualization +# 2. Data cleaning/pre-processing +# 3. Fake job postings identification and removal +# 4. Job recommendation by showing the most similar job postings +# +# The scenario is that someone wants to find the best posting for themselves. They have collected the data, but he is not sure if all the data is real. Therefore, based on a trained model, as in this sample, they identify with a high degree of accuracy which postings are real, and it is among them that they choose the best ad for themselves. +# +# For simplicity, only one dataset will be used within this code, but the process using one dataset is not significantly different from the one described earlier. +# + +# %% [markdown] +# ## Data exploration and visualization +# +# For the purpose of this code sample we will use Real or Fake: Fake Job Postings dataset available over HuggingFace API. In this first part we will focus on data exploration and visualization. In standard end-to-end workload it is the first step. Engineer needs to first know the data to be able to work on it and prepare solution that will utilize dataset the best. +# +# Lest start with loading the dataset. We are using datasets library to do that. + +# %% +from datasets import load_dataset + +dataset = load_dataset("victor/real-or-fake-fake-jobposting-prediction") +dataset = dataset['train'] + +# %% [markdown] +# To better analyze and understand the data we are transferring it to pandas DataFrame, so we are able to take benefit from all pandas data transformations. Pandas library provides multiple useful functions for data manipulation so it is usual choice at this stage of machine learning or deep learning project. +# + +# %% +import pandas as pd +df = dataset.to_pandas() + +# %% [markdown] +# Let's see 5 first and 5 last rows in the dataset we are working on. + +# %% +df.head() + +# %% +df.tail() + +# %% [markdown] +# Now, lets print a concise summary of the dataset. This way we will see all the column names, know the number of rows and types in every of the column. It is a great overview on the features of the dataset. + +# %% +df.info() + +# %% [markdown] +# At this point it is a good idea to make sure our dataset doen't contain any data duplication that could impact the results of our future system. To do that we firs need to remove `job_id` column. It contains unique number for each job posting so even if the rest of the data is the same between 2 postings it makes it different. + +# %% +# Drop the 'job_id' column +df = df.drop(columns=['job_id']) +df.head() + +# %% [markdown] +# And now, the actual duplicates removal. We first pring the number of duplicates that are in our dataset, than using `drop_duplicated` method we are removing them and after this operation printing the number of the duplicates. If everything works as expected after duplicates removal we should print `0` as current number of duplicates in the dataset. + +# %% +# let's make sure that there are no duplicated jobs + +print(df.duplicated().sum()) +df = df.drop_duplicates() +print(df.duplicated().sum()) + +# %% [markdown] +# Now we can visualize the data from the dataset. First let's visualize data as it is all real, and later, for the purposes of the fake data detection, we will also visualize it spreading fake and real data. +# +# When working with text data it can be challenging to visualize it. Thankfully, there is a `wordcloud` library that shows common words in the analyzed texts. The bigger word is, more often the word is in the text. Wordclouds allow us to quickly identify the most important topic and themes in a large text dataset and also explore patterns and trends in textural data. +# +# In our example, we will create wordcloud for job titles, to have high-level overview of job postings we are working with. + +# %% +from wordcloud import WordCloud # module to print word cloud +from matplotlib import pyplot as plt +import seaborn as sns + +# On the basis of Job Titles form word cloud +job_titles_text = ' '.join(df['title']) +wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text) + +# Plotting Word Cloud +plt.figure(figsize=(10, 6)) +plt.imshow(wordcloud, interpolation='bilinear') +plt.title('Job Titles') +plt.axis('off') +plt.tight_layout() +plt.show() + +# %% [markdown] +# Different possibility to get some information from this type of dataset is by showing top-n most common values in given column or distribution of the values int his column. +# Let's show top 10 most common job titles and compare this result with previously showed wordcould. + +# %% +# Get Count of job title +job_title_counts = df['title'].value_counts() + +# Plotting a bar chart for the top 10 most common job titles +top_job_titles = job_title_counts.head(10) +plt.figure(figsize=(10, 6)) +top_job_titles.sort_values().plot(kind='barh') +plt.title('Top 10 Most Common Job Titles') +plt.xlabel('Frequency') +plt.ylabel('Job Titles') +plt.show() + +# %% [markdown] +# Now we can do the same for different columns, as `employment_type`, `required_experience`, `telecommuting`, `has_company_logo` and `has_questions`. These should give us reale good overview of different parts of our dataset. + +# %% +# Count the occurrences of each work type +work_type_counts = df['employment_type'].value_counts() + +# Plotting the distribution of work types +plt.figure(figsize=(8, 6)) +work_type_counts.sort_values().plot(kind='barh') +plt.title('Distribution of Work Types Offered by Jobs') +plt.xlabel('Frequency') +plt.ylabel('Work Types') +plt.show() + +# %% +# Count the occurrences of required experience types +work_type_counts = df['required_experience'].value_counts() + +# Plotting the distribution of work types +plt.figure(figsize=(8, 6)) +work_type_counts.sort_values().plot(kind='barh') +plt.title('Distribution of Required Experience by Jobs') +plt.xlabel('Frequency') +plt.ylabel('Required Experience') +plt.show() + +# %% [markdown] +# For employment_type and required_experience we also created matrix to see if there is any corelation between those two. To visualize it we created heatmap. If you think that some of the parameters can be related, creating similar heatmap can be a good idea. + +# %% +from matplotlib import pyplot as plt +import seaborn as sns +import pandas as pd + +plt.subplots(figsize=(8, 8)) +df_2dhist = pd.DataFrame({ + x_label: grp['required_experience'].value_counts() + for x_label, grp in df.groupby('employment_type') +}) +sns.heatmap(df_2dhist, cmap='viridis') +plt.xlabel('employment_type') +_ = plt.ylabel('required_experience') + +# %% +# Count the occurrences of unique values in the 'telecommuting' column +telecommuting_counts = df['telecommuting'].value_counts() + +plt.figure(figsize=(8, 6)) +telecommuting_counts.sort_values().plot(kind='barh') +plt.title('Counts of telecommuting vs Non-telecommuting') +plt.xlabel('count') +plt.ylabel('telecommuting') +plt.show() + +# %% +has_company_logo_counts = df['has_company_logo'].value_counts() + +plt.figure(figsize=(8, 6)) +has_company_logo_counts.sort_values().plot(kind='barh') +plt.ylabel('has_company_logo') +plt.xlabel('Count') +plt.title('Counts of With_Logo vs Without_Logo') +plt.show() + +# %% +has_questions_counts = df['has_questions'].value_counts() + +# Plot the counts +plt.figure(figsize=(8, 6)) +has_questions_counts.sort_values().plot(kind='barh') +plt.ylabel('has_questions') +plt.xlabel('Count') +plt.title('Counts Questions vs NO_Questions') +plt.show() + +# %% [markdown] +# From the job recommendations point of view the salary and location can be really important parameters to take into consideration. In given dataset we have salary ranges available so there is no need for additional data processing rather than removal of empty ranges but if the dataset you're working on has specific values, consider organizing it into appropriate ranges and only then displaying the result. + +# %% +# Splitting benefits by comma and creating a list of benefits +benefits_list = df['salary_range'].str.split(',').explode() +benefits_list = benefits_list[benefits_list != 'None'] +benefits_list = benefits_list[benefits_list != '0-0'] + + +# Counting the occurrences of each skill +benefits_count = benefits_list.str.strip().value_counts() + +# Plotting the top 10 most common benefits +top_benefits = benefits_count.head(10) +plt.figure(figsize=(10, 6)) +top_benefits.sort_values().plot(kind='barh') +plt.title('Top 10 Salaries Range Offered by Companies') +plt.xlabel('Frequency') +plt.ylabel('Salary Range') +plt.show() + +# %% [markdown] +# For the location we have both county, state and city specified, so we need to split it into individual columns, and then show top 10 counties and cities. + +# %% +# Split the 'location' column into separate columns for country, state, and city +location_split = df['location'].str.split(', ', expand=True) +df['Country'] = location_split[0] +df['State'] = location_split[1] +df['City'] = location_split[2] + +# %% +# Count the occurrences of unique values in the 'Country' column +Country_counts = df['Country'].value_counts() + +# Select the top 10 most frequent occurrences +top_10_Country = Country_counts.head(10) + +# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels +plt.figure(figsize=(14, 10)) +sns.barplot(y=top_10_Country.index, x=top_10_Country.values) +plt.ylabel('Country') +plt.xlabel('Count') +plt.title('Top 10 Most Frequent Country') +plt.show() + +# %% +# Count the occurrences of unique values in the 'City' column +City_counts = df['City'].value_counts() + +# Select the top 10 most frequent occurrences +top_10_City = City_counts.head(10) + +# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels +plt.figure(figsize=(14, 10)) +sns.barplot(y=top_10_City.index, x=top_10_City.values) +plt.ylabel('City') +plt.xlabel('Count') +plt.title('Top 10 Most Frequent City') +plt.show() + +# %% [markdown] +# ### Fake job postings data visualization +# +# What about fraudulent class? Let see how many of the jobs in the dataset are fake. Whether there are equally true and false offers, or whether there is a significant disproportion between the two. + +# %% +## fake job visualization +# Count the occurrences of unique values in the 'fraudulent' column +fraudulent_counts = df['fraudulent'].value_counts() + +# Plot the counts using a rainbow color palette +plt.figure(figsize=(8, 6)) +sns.barplot(x=fraudulent_counts.index, y=fraudulent_counts.values) +plt.xlabel('Fraudulent') +plt.ylabel('Count') +plt.title('Counts of Fraudulent vs Non-Fraudulent') +plt.show() + +# %% +plt.figure(figsize=(10, 6)) +sns.countplot(data=df, x='employment_type', hue='fraudulent') +plt.title('Count of Fraudulent Cases by Employment Type') +plt.xlabel('Employment Type') +plt.ylabel('Count') +plt.legend(title='Fraudulent') +plt.show() + +# %% +plt.figure(figsize=(10, 6)) +sns.countplot(data=df, x='required_experience', hue='fraudulent') +plt.title('Count of Fraudulent Cases by Required Experience') +plt.xlabel('Required Experience') +plt.ylabel('Count') +plt.legend(title='Fraudulent') +plt.show() + +# %% +plt.figure(figsize=(30, 18)) +sns.countplot(data=df, x='required_education', hue='fraudulent') +plt.title('Count of Fraudulent Cases by Required Education') +plt.xlabel('Required Education') +plt.ylabel('Count') +plt.legend(title='Fraudulent') +plt.show() + +# %% [markdown] +# We can see that there is no connection between those parameters and fake job postings. This way in the future processing we can remove them. + +# %% [markdown] +# ## Data cleaning/pre-processing +# +# One of the really important step related to any type of data processing is data cleaning. For texts it usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks. +# +# In our case, to prepare data for Fake Job Postings recognition we will first, combine all relevant columns into single new record and then clean the data to work on it. + +# %% +# List of columns to concatenate +columns_to_concat = ['title', 'location', 'department', 'salary_range', 'company_profile', + 'description', 'requirements', 'benefits', 'employment_type', + 'required_experience', 'required_education', 'industry', 'function'] + +# Concatenate the values of specified columns into a new column 'job_posting' +df['job_posting'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1) + +# Create a new DataFrame with columns 'job_posting' and 'fraudulent' +new_df = df[['job_posting', 'fraudulent']].copy() + +# %% +new_df.head() + +# %% +# import spacy +import re +import nltk +from nltk.corpus import stopwords + +nltk.download('stopwords') + +def preprocess_text(text): + # Remove newlines, carriage returns, and tabs + text = re.sub('\n','', text) + text = re.sub('\r','', text) + text = re.sub('\t','', text) + # Remove URLs + text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE) + + # Remove special characters + text = re.sub(r"[^a-zA-Z0-9\s]", "", text) + + # Remove punctuation + text = re.sub(r'[^\w\s]', '', text) + + # Remove digits + text = re.sub(r'\d', '', text) + + # Convert to lowercase + text = text.lower() + + # Remove stop words + stop_words = set(stopwords.words('english')) + words = [word for word in text.split() if word.lower() not in stop_words] + text = ' '.join(words) + + return text + + + +# %% +new_df['job_posting'] = new_df['job_posting'].apply(preprocess_text) + +new_df.head() + +# %% [markdown] +# The next step in the pre-processing is lemmatization. It is a process to reduce a word to its root form, called a lemma. For example the verb 'planning' would be changed to 'plan' world. + +# %% +# Lemmatization +import en_core_web_sm + +nlp = en_core_web_sm.load() + +def lemmatize_text(text): + doc = nlp(text) + return " ".join([token.lemma_ for token in doc]) + +# %% +new_df['job_posting'] = new_df['job_posting'].apply(lemmatize_text) + +new_df.head() + +# %% [markdown] +# At this stage we can also visualize the data with wordcloud by having special text column. We can show it for both fake and real dataset. + +# %% +from wordcloud import WordCloud + +non_fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 0]['job_posting']) +fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 1]['job_posting']) + +wordcloud_non_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(non_fraudulent_text) + +wordcloud_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(fraudulent_text) + +fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10)) + +ax1.imshow(wordcloud_non_fraudulent, interpolation='bilinear') +ax1.axis('off') +ax1.set_title('Non-Fraudulent Job Postings') + +ax2.imshow(wordcloud_fraudulent, interpolation='bilinear') +ax2.axis('off') +ax2.set_title('Fraudulent Job Postings') + +plt.show() + +# %% [markdown] +# ## Fake job postings identification and removal +# +# Nowadays, it is unfortunate that not all the job offers that are posted on papular portals are genuine. Some of them are created only to collect personal data. Therefore, just detecting fake job postings can be very essential. +# +# We will create bidirectional LSTM model with one hot encoding. Let's start with all necessary imports. + +# %% +from tensorflow.keras.layers import Embedding +from tensorflow.keras.preprocessing.sequence import pad_sequences +from tensorflow.keras.models import Sequential +from tensorflow.keras.preprocessing.text import one_hot +from tensorflow.keras.layers import Dense +from tensorflow.keras.layers import Bidirectional +from tensorflow.keras.layers import Dropout + +# %% [markdown] +# Make sure, you're using Tensorflow version 2.15.0 + +# %% +import tensorflow as tf +tf.__version__ + +# %% [markdown] +# Now, let us import Intel Extension for TensorFlow*. We are using Python API `itex.experimental_ops_override()`. It automatically replace some TensorFlow operators by Custom Operators under `itex.ops` namespace, as well as to be compatible with existing trained parameters. + +# %% +import intel_extension_for_tensorflow as itex + +itex.experimental_ops_override() + +# %% [markdown] +# We need to prepare data for the model we will create. First let's assign job_postings to X and fraudulent values to y (expected value). + +# %% +X = new_df['job_posting'] +y = new_df['fraudulent'] + +# %% [markdown] +# One hot encoding is a technique to represent categorical variables as numerical values. + +# %% +voc_size = 5000 +onehot_repr = [one_hot(words, voc_size) for words in X] + +# %% +sent_length = 40 +embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length) +print(embedded_docs) + +# %% [markdown] +# ### Creating model +# +# We are creating Deep Neural Network using Bidirectional LSTM. The architecture is as followed: +# +# * Embedding layer +# * Bidirectiona LSTM Layer +# * Dropout layer +# * Dense layer with sigmod function +# +# We are using Adam optimizer with binary crossentropy. We are optimism accuracy. +# +# If Intel® Extension for TensorFlow* backend is XPU, `tf.keras.layers.LSTM` will be replaced by `itex.ops.ItexLSTM`. + +# %% +embedding_vector_features = 50 +model_itex = Sequential() +model_itex.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length)) +model_itex.add(Bidirectional(itex.ops.ItexLSTM(100))) +model_itex.add(Dropout(0.3)) +model_itex.add(Dense(1, activation='sigmoid')) +model_itex.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) +print(model_itex.summary()) + +# %% +X_final = np.array(embedded_docs) +y_final = np.array(y) + +# %% [markdown] +# + +# %% +from sklearn.model_selection import train_test_split +X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=320) + +# %% [markdown] +# Now, let's train the model. We are using standard `model.fit()` method providing training and testing dataset. You can easily modify number of epochs in this training process but keep in mind that the model can become overtrained, so that it will have very good results on training data, but poor results on test data. + +# %% +model_itex.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=64) + +# %% [markdown] +# The values returned by the model are in the range [0,1] Need to map them to integer values of 0 or 1. + +# %% +y_pred = (model_itex.predict(X_test) > 0.5).astype("int32") + +# %% [markdown] +# To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the `scikit-learn` library. + +# %% +from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report + +conf_matrix = confusion_matrix(y_test, y_pred) +print("Confusion matrix:") +print(conf_matrix) + +ConfusionMatrixDisplay.from_predictions(y_test, y_pred) + +class_report = classification_report(y_test, y_pred) +print("Classification report:") +print(class_report) + +# %% [markdown] +# ## Job recommendation by showing the most similar ones + +# %% [markdown] +# Now, as we are sure that the data we are processing is real, we can get back to the original columns and create our recommendation system. +# +# Also use much more simple solution for recommendations. Even, as before we used Deep Learning to check if posting is fake, we can use classical machine learning algorithms to show similar job postings. +# +# First, let's filter fake job postings. + +# %% +real = df[df['fraudulent'] == 0] +real.head() + +# %% [markdown] +# After that, we create a common column containing those text parameters that we want to be compared between theses and are relevant to us when making recommendations. + +# %% +cols = ['title', 'description', 'requirements', 'required_experience', 'required_education', 'industry'] +real = real[cols] +real.head() + +# %% +real = real.fillna(value='') +real['text'] = real['description'] + real['requirements'] + real['required_experience'] + real['required_education'] + real['industry'] +real.head() + +# %% [markdown] +# Let's see the mechanism that we will use to prepare recommendations - we will use sentence similarity based on prepared `text` column in our dataset. + +# %% +from sentence_transformers import SentenceTransformer + +model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') + +# %% [markdown] +# Let's prepare a few example sentences that cover 4 topics. On these sentences it will be easier to show how the similarities between the texts work than on the whole large dataset we have. + +# %% +messages = [ + # Smartphones + "I like my phone", + "My phone is not good.", + "Your cellphone looks great.", + + # Weather + "Will it snow tomorrow?", + "Recently a lot of hurricanes have hit the US", + "Global warming is real", + + # Food and health + "An apple a day, keeps the doctors away", + "Eating strawberries is healthy", + "Is paleo better than keto?", + + # Asking about age + "How old are you?", + "what is your age?", +] + +# %% [markdown] +# Now, we are preparing functions to show similarities between given sentences in the for of heat map. + +# %% +import numpy as np +import seaborn as sns + +def plot_similarity(labels, features, rotation): + corr = np.inner(features, features) + sns.set(font_scale=1.2) + g = sns.heatmap( + corr, + xticklabels=labels, + yticklabels=labels, + vmin=0, + vmax=1, + cmap="YlOrRd") + g.set_xticklabels(labels, rotation=rotation) + g.set_title("Semantic Textual Similarity") + +def run_and_plot(messages_): + message_embeddings_ = model.encode(messages_) + plot_similarity(messages_, message_embeddings_, 90) + +# %% +run_and_plot(messages) + +# %% [markdown] +# Now, let's move back to our job postings dataset. First, we are using sentence encoding model to be able to calculate similarities. + +# %% +encodings = [] +for text in real['text']: + encodings.append(model.encode(text)) + +real['encodings'] = encodings + +# %% [markdown] +# Then, we can chose job posting we wan to calculate similarities to. In our case it is first job posting in the dataset, but you can easily change it to any other job posting, by changing value in the `index` variable. + +# %% +index = 0 +corr = np.inner(encodings[index], encodings) +real['corr_to_first'] = corr + +# %% [markdown] +# And based on the calculated similarities, we can show top most similar job postings, by sorting them according to calculated correlation value. + +# %% +real.sort_values(by=['corr_to_first'], ascending=False).head() + +# %% [markdown] +# In this code sample we created job recommendation system. First, we explored and analyzed the dataset, then we pre-process the data and create fake job postings detection model. At the end we used sentence similarities to show top 5 recommendations - the most similar job descriptions to the chosen one. + +# %% +print("[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]") + + diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/License.txt b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/License.txt new file mode 100644 index 0000000000..e63c6e13dc --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/License.txt @@ -0,0 +1,7 @@ +Copyright Intel Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/README.md b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/README.md new file mode 100644 index 0000000000..6964819ee4 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/README.md @@ -0,0 +1,177 @@ +# Job Recommendation System: End-to-End Deep Learning Workload + + +This sample illustrates the use of Intel® Extension for TensorFlow* to build and run an end-to-end AI workload on the example of the job recommendation system. + +| Property | Description +|:--- |:--- +| Category | Reference Designs and End to End +| What you will learn | How to use Intel® Extension for TensorFlow* to build end to end AI workload? +| Time to complete | 30 minutes + +## Purpose + +This code sample show end-to-end Deep Learning workload in the example of job recommendation system. It consists of four main parts: + +1. Data exploration and visualization - showing what the dataset is looking like, what are some of the main features and what is a data distribution in it. +2. Data cleaning and pre-processing - removal of duplicates, explanation all necessary steps for text pre-processing. +3. Fraud job postings removal - finding which of the job posting are fake using LSTM DNN and filtering them. +4. Job recommendation - calculation and providing top-n job descriptions similar to the chosen one. + +## Prerequisites + +| Optimized for | Description +| :--- | :--- +| OS | Linux, Ubuntu* 20.04 +| Hardware | GPU +| Software | Intel® Extension for TensorFlow* +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). + + +## Key Implementation Details + +This sample creates Deep Neural Networ to fake job postings detections using Intel® Extension for TensorFlow* LSTM layer on GPU. It also utilizes `itex.experimental_ops_override()` to automatically replace some TensorFlow operators by Custom Operators form Intel® Extension for TensorFlow*. + +The sample tutorial contains one Jupyter Notebook and one Python script. You can use either. + +## Environment Setup +You will need to download and install the following toolkits, tools, and components to use the sample. + + +**1. Get AI Tools** + +Required AI Tools: + +If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector. + +>**Note**: If Docker option is chosen in AI Tools Selector, refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + +**2. (Offline Installer) Activate the AI Tools bundle base environment** + +If the default path is used during the installation of AI Tools: +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If a non-default path is used: +``` +source /bin/activate +``` + +**3. (Offline Installer) Activate relevant Conda environment** + +``` +conda activate tensorflow-gpu +``` + +**4. Clone the GitHub repository** + + +``` +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem +``` + +**5. Install dependencies** + +>**Note**: Before running the following commands, make sure your Conda/Python environment with AI Tools installed is activated + +``` +pip install -r requirements.txt +pip install notebook +``` +For Jupyter Notebook, refer to [Installing Jupyter](https://jupyter.org/install) for detailed installation instructions. + +## Run the Sample +>**Note**: Before running the sample, make sure [Environment Setup](https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/INC-Quantization-Sample-for-PyTorch#environment-setup) is completed. + +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)](#ai-tools-offline-installer-validated) +* [Conda/PIP](#condapip) +* [Docker](#docker) + +### AI Tools Offline Installer (Validated) + +**1. Register Conda kernel to Jupyter Notebook kernel** + +If the default path is used during the installation of AI Tools: +``` +$HOME/intel/oneapi/intelpython/envs/tensorflow-gpu/bin/python -m ipykernel install --user --name=tensorflow-gpu +``` +If a non-default path is used: +``` +/bin/python -m ipykernel install --user --name=tensorflow-gpu +``` +**2. Launch Jupyter Notebook** + +``` +jupyter notebook --ip=0.0.0.0 +``` +**3. Follow the instructions to open the URL with the token in your browser** + +**4. Select the Notebook** + +``` +JobRecommendationSystem.ipynb +``` +**5. Change the kernel to `tensorflow-gpu`** + +**6. Run every cell in the Notebook in sequence** + +### Conda/PIP +> **Note**: Before running the instructions below, make sure your Conda/Python environment with AI Tools installed is activated + +**1. Register Conda/Python kernel to Jupyter Notebook kernel** + +For Conda: +``` +/bin/python -m ipykernel install --user --name=tensorflow-gpu +``` +To know , run `conda env list` and find your Conda environment path. + +For PIP: +``` +python -m ipykernel install --user --name=tensorflow-gpu +``` +**2. Launch Jupyter Notebook** + +``` +jupyter notebook --ip=0.0.0.0 +``` +**3. Follow the instructions to open the URL with the token in your browser** + +**4. Select the Notebook** + +``` +JobRecommendationSystem.ipynb +``` +**5. Change the kernel to ``** + + +**6. Run every cell in the Notebook in sequence** + +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + + + +## Example Output + + If successful, the sample displays [CODE_SAMPLE_COMPLETED_SUCCESSFULLY]. Additionally, the sample shows multiple diagram explaining dataset, the training progress for fraud job posting detection and top job recommendations. + +## Related Samples + + +* [Intel Extension For TensorFlow Getting Started Sample](https://github.com/oneapi-src/oneAPI-samples/blob/development/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted/README.md) +* [Leveraging Intel Extension for TensorFlow with LSTM for Text Generation Sample](https://github.com/oneapi-src/oneAPI-samples/blob/master/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_TextGeneration_with_LSTM/README.md) + +## License + +Code samples are licensed under the MIT license. See +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) +for details. + +Third party program Licenses can be found here: +[third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/requirements.txt b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/requirements.txt new file mode 100644 index 0000000000..15bcd710c6 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/requirements.txt @@ -0,0 +1,10 @@ +ipykernel +matplotlib +sentence_transformers +transformers +datasets +accelerate +wordcloud +spacy +jinja2 +nltk \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/sample.json b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/sample.json new file mode 100644 index 0000000000..31e14cab36 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/sample.json @@ -0,0 +1,29 @@ +{ + "guid": "80708728-0BD4-435E-961D-178E5ED1450C", + "name": "JobRecommendationSystem: End-to-End Deep Learning Workload", + "categories": ["Toolkit/oneAPI AI And Analytics/End-to-End Workloads"], + "description": "This sample illustrates the use of Intel® Extension for TensorFlow* to build and run an end-to-end AI workload on the example of the job recommendation system", + "builder": ["cli"], + "toolchain": ["jupyter"], + "languages": [{"python":{}}], + "os":["linux"], + "targetDevice": ["GPU"], + "ciTests": { + "linux": [ + { + "env": [], + "id": "JobRecommendationSystem_py", + "steps": [ + "source /intel/oneapi/intelpython/bin/activate", + "conda env remove -n user_tensorflow-gpu", + "conda create --name user_tensorflow-gpu --clone tensorflow-gpu", + "conda activate user_tensorflow-gpu", + "pip install -r requirements.txt", + "python -m ipykernel install --user --name=user_tensorflow-gpu", + "python JobRecommendationSystem.py" + ] + } + ] +}, +"expertise": "Reference Designs and End to End" +} \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/third-party-programs.txt b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/third-party-programs.txt new file mode 100644 index 0000000000..e9f8042d0a --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/JobRecommendationSystem/third-party-programs.txt @@ -0,0 +1,253 @@ +oneAPI Code Samples - Third Party Programs File + +This file contains the list of third party software ("third party programs") +contained in the Intel software and their required notices and/or license +terms. This third party software, even if included with the distribution of the +Intel software, may be governed by separate license terms, including without +limitation, third party license terms, other Intel software license terms, and +open source software license terms. These separate license terms govern your use +of the third party programs as set forth in the “third-party-programs.txt” or +other similarly named text file. + +Third party programs and their corresponding required notices and/or license +terms are listed below. + +-------------------------------------------------------------------------------- + +1. Nothings STB Libraries + +stb/LICENSE + + This software is available under 2 licenses -- choose whichever you prefer. + ------------------------------------------------------------------------------ + ALTERNATIVE A - MIT License + Copyright (c) 2017 Sean Barrett + Permission is hereby granted, free of charge, to any person obtaining a copy of + this software and associated documentation files (the "Software"), to deal in + the Software without restriction, including without limitation the rights to + use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies + of the Software, and to permit persons to whom the Software is furnished to do + so, subject to the following conditions: + The above copyright notice and this permission notice shall be included in all + copies or substantial portions of the Software. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + ------------------------------------------------------------------------------ + ALTERNATIVE B - Public Domain (www.unlicense.org) + This is free and unencumbered software released into the public domain. + Anyone is free to copy, modify, publish, use, compile, sell, or distribute this + software, either in source code form or as a compiled binary, for any purpose, + commercial or non-commercial, and by any means. + In jurisdictions that recognize copyright laws, the author or authors of this + software dedicate any and all copyright interest in the software to the public + domain. We make this dedication for the benefit of the public at large and to + the detriment of our heirs and successors. We intend this dedication to be an + overt act of relinquishment in perpetuity of all present and future rights to + this software under copyright law. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION + WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +-------------------------------------------------------------------------------- + +2. FGPA example designs-gzip + + SDL2.0 + +zlib License + + + This software is provided 'as-is', without any express or implied + warranty. In no event will the authors be held liable for any damages + arising from the use of this software. + + Permission is granted to anyone to use this software for any purpose, + including commercial applications, and to alter it and redistribute it + freely, subject to the following restrictions: + + 1. The origin of this software must not be misrepresented; you must not + claim that you wrote the original software. If you use this software + in a product, an acknowledgment in the product documentation would be + appreciated but is not required. + 2. Altered source versions must be plainly marked as such, and must not be + misrepresented as being the original software. + 3. This notice may not be removed or altered from any source distribution. + + +-------------------------------------------------------------------------------- + +3. Nbody + (c) 2019 Fabio Baruffa + + Plotly.js + Copyright (c) 2020 Plotly, Inc + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +© 2020 GitHub, Inc. + +-------------------------------------------------------------------------------- + +4. GNU-EFI + Copyright (c) 1998-2000 Intel Corporation + +The files in the "lib" and "inc" subdirectories are using the EFI Application +Toolkit distributed by Intel at http://developer.intel.com/technology/efi + +This code is covered by the following agreement: + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + +Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. THE EFI SPECIFICATION AND ALL OTHER INFORMATION +ON THIS WEB SITE ARE PROVIDED "AS IS" WITH NO WARRANTIES, AND ARE SUBJECT +TO CHANGE WITHOUT NOTICE. + +-------------------------------------------------------------------------------- + +5. Edk2 + Copyright (c) 2019, Intel Corporation. All rights reserved. + + Edk2 Basetools + Copyright (c) 2019, Intel Corporation. All rights reserved. + +SPDX-License-Identifier: BSD-2-Clause-Patent + +-------------------------------------------------------------------------------- + +6. Heat Transmission + +GNU LESSER GENERAL PUBLIC LICENSE +Version 3, 29 June 2007 + +Copyright © 2007 Free Software Foundation, Inc. + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + +This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. + +0. Additional Definitions. +As used herein, “this License” refers to version 3 of the GNU Lesser General Public License, and the “GNU GPL” refers to version 3 of the GNU General Public License. + +“The Library” refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. + +An “Application” is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. + +A “Combined Work” is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the “Linked Version”. + +The “Minimal Corresponding Source” for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. + +The “Corresponding Application Code” for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. + +1. Exception to Section 3 of the GNU GPL. +You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. + +2. Conveying Modified Versions. +If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: + +a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or +b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. +3. Object Code Incorporating Material from Library Header Files. +The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: + +a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the object code with a copy of the GNU GPL and this license document. +4. Combined Works. +You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: + +a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the Combined Work with a copy of the GNU GPL and this license document. +c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. +d) Do one of the following: +0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. +1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. +e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) +5. Combined Libraries. +You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: + +a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. +b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. +6. Revised Versions of the GNU Lesser General Public License. +The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. + +If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. + +-------------------------------------------------------------------------------- +7. Rodinia + Copyright (c)2008-2011 University of Virginia +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted without royalty fees or other restrictions, provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + * Neither the name of the University of Virginia, the Dept. of Computer Science, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA OR THE SOFTWARE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +If you use this software or a modified version of it, please cite the most relevant among the following papers: + + - M. A. Goodrum, M. J. Trotter, A. Aksel, S. T. Acton, and K. Skadron. Parallelization of Particle Filter Algorithms. In Proceedings of the 3rd Workshop on Emerging Applications and Many-core Architecture (EAMA), in conjunction with the IEEE/ACM International +Symposium on Computer Architecture (ISCA), June 2010. + + - S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee and K. Skadron. +Rodinia: A Benchmark Suite for Heterogeneous Computing. IEEE International Symposium +on Workload Characterization, Oct 2009. + +- J. Meng and K. Skadron. "Performance Modeling and Automatic Ghost Zone Optimization +for Iterative Stencil Loops on GPUs." In Proceedings of the 23rd Annual ACM International +Conference on Supercomputing (ICS), June 2009. + +- L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems +Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International +Symposium on Computer Architecture (ISCA), June 2009. + +- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: +A Case Study in Leveraging Manycore Coprocessors." In Proceedings of the International Parallel +and Distributed Processing Symposium (IPDPS), May 2009. + +- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. "A Performance +Study of General Purpose Applications on Graphics Processors using CUDA" Journal of +Parallel and Distributed Computing, Elsevier, June 2008. + +-------------------------------------------------------------------------------- +Other names and brands may be claimed as the property of others. + +-------------------------------------------------------------------------------- \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Dataset/get_dataset.py b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Dataset/get_dataset.py new file mode 100644 index 0000000000..f30a8d06e7 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Dataset/get_dataset.py @@ -0,0 +1,35 @@ +import os +import shutil +import argparse +from datasets import load_dataset +from tqdm import tqdm + +language_to_code = { + "japanese": "ja", + "swedish": "sv-SE" +} + +def download_dataset(output_dir): + for lang, lang_code in language_to_code.items(): + print(f"Processing dataset for language: {lang_code}") + + # Load the dataset for the specific language + dataset = load_dataset("mozilla-foundation/common_voice_11_0", lang_code, split="train", trust_remote_code=True) + + # Create a language-specific output folder + output_folder = os.path.join(output_dir, lang, lang_code, "clips") + os.makedirs(output_folder, exist_ok=True) + + # Extract and copy MP3 files + for sample in tqdm(dataset, desc=f"Extracting and copying MP3 files for {lang}"): + audio_path = sample['audio']['path'] + shutil.copy(audio_path, output_folder) + + print("Extraction and copy complete.") + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Extract and copy audio files from a dataset to a specified directory.") + parser.add_argument("--output_dir", type=str, default="/data/commonVoice", help="Base output directory for saving the files. Default is /data/commonVoice") + args = parser.parse_args() + + download_dataset(args.output_dir) \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/clean.sh b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/clean.sh index 7ea1719af4..34747af45c 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/clean.sh +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/clean.sh @@ -1,7 +1,5 @@ #!/bin/bash -rm -R RIRS_NOISES -rm -R tmp -rm -R speechbrain -rm -f rirs_noises.zip noise.csv reverb.csv vad_file.txt +echo "Deleting .wav files, tmp" rm -f ./*.wav +rm -R tmp diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_commonVoice.py b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_commonVoice.py index 6442418bf0..7effb2df76 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_commonVoice.py +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_commonVoice.py @@ -29,7 +29,7 @@ def __init__(self, dirpath, filename): self.sampleRate = 0 self.waveData = '' self.wavesize = 0 - self.waveduriation = 0 + self.waveduration = 0 if filename.endswith(".wav") or filename.endswith(".wmv"): self.wavefile = filename self.wavepath = dirpath + os.sep + filename @@ -173,12 +173,12 @@ def main(argv): data = datafile(testDataDirectory, filename) predict_list = [] use_entire_audio_file = False - if data.waveduration < sample_dur: + if int(data.waveduration) <= sample_dur: # Use entire audio file if the duration is less than the sampling duration use_entire_audio_file = True sample_list = [0 for _ in range(sample_size)] else: - start_time_list = list(range(sample_size - int(data.waveduration) + 1)) + start_time_list = list(range(0, int(data.waveduration) - sample_dur)) sample_list = [] for i in range(sample_size): sample_list.append(random.sample(start_time_list, 1)[0]) @@ -198,10 +198,6 @@ def main(argv): predict_list.append(' ') pass - # Clean up - if use_entire_audio_file: - os.remove("./" + data.filename) - # Pick the top rated prediction result occurence_count = Counter(predict_list) total_count = sum(occurence_count.values()) diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_custom.py b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_custom.py index b4f9d6adee..2b4a331c0b 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_custom.py +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/inference_custom.py @@ -30,7 +30,7 @@ def __init__(self, dirpath, filename): self.sampleRate = 0 self.waveData = '' self.wavesize = 0 - self.waveduriation = 0 + self.waveduration = 0 if filename.endswith(".wav") or filename.endswith(".wmv"): self.wavefile = filename self.wavepath = dirpath + os.sep + filename @@ -61,41 +61,45 @@ def __init__(self, ipex_op=False, bf16=False, int8_model=False): self.model_int8 = load(source_model_int8_path, self.language_id) self.model_int8.eval() elif ipex_op: + self.language_id.eval() + # Optimize for inference with IPEX print("Optimizing inference with IPEX") - self.language_id.eval() - sampleInput = (torch.load("./sample_input_features.pt"), torch.load("./sample_input_wav_lens.pt")) if bf16: print("BF16 enabled") self.language_id.mods["compute_features"] = ipex.optimize(self.language_id.mods["compute_features"], dtype=torch.bfloat16) self.language_id.mods["mean_var_norm"] = ipex.optimize(self.language_id.mods["mean_var_norm"], dtype=torch.bfloat16) - self.language_id.mods["embedding_model"] = ipex.optimize(self.language_id.mods["embedding_model"], dtype=torch.bfloat16) self.language_id.mods["classifier"] = ipex.optimize(self.language_id.mods["classifier"], dtype=torch.bfloat16) else: self.language_id.mods["compute_features"] = ipex.optimize(self.language_id.mods["compute_features"]) self.language_id.mods["mean_var_norm"] = ipex.optimize(self.language_id.mods["mean_var_norm"]) - self.language_id.mods["embedding_model"] = ipex.optimize(self.language_id.mods["embedding_model"]) self.language_id.mods["classifier"] = ipex.optimize(self.language_id.mods["classifier"]) # Torchscript to resolve performance issues with reorder operations + print("Applying Torchscript") + sampleWavs = torch.load("./sample_wavs.pt") + sampleWavLens = torch.ones(sampleWavs.shape[0]) with torch.no_grad(): - I2 = self.language_id.mods["embedding_model"](*sampleInput) + I1 = self.language_id.mods["compute_features"](sampleWavs) + I2 = self.language_id.mods["mean_var_norm"](I1, sampleWavLens) + I3 = self.language_id.mods["embedding_model"](I2, sampleWavLens) + if bf16: with torch.cpu.amp.autocast(): - self.language_id.mods["compute_features"] = torch.jit.trace( self.language_id.mods["compute_features"] , example_inputs=(torch.rand(1,32000))) - self.language_id.mods["mean_var_norm"] = torch.jit.trace(self.language_id.mods["mean_var_norm"], example_inputs=sampleInput) - self.language_id.mods["embedding_model"] = torch.jit.trace(self.language_id.mods["embedding_model"], example_inputs=sampleInput) - self.language_id.mods["classifier"] = torch.jit.trace(self.language_id.mods["classifier"], example_inputs=I2) + self.language_id.mods["compute_features"] = torch.jit.trace( self.language_id.mods["compute_features"] , example_inputs=sampleWavs) + self.language_id.mods["mean_var_norm"] = torch.jit.trace(self.language_id.mods["mean_var_norm"], example_inputs=(I1, sampleWavLens)) + self.language_id.mods["embedding_model"] = torch.jit.trace(self.language_id.mods["embedding_model"], example_inputs=(I2, sampleWavLens)) + self.language_id.mods["classifier"] = torch.jit.trace(self.language_id.mods["classifier"], example_inputs=I3) self.language_id.mods["compute_features"] = torch.jit.freeze(self.language_id.mods["compute_features"]) self.language_id.mods["mean_var_norm"] = torch.jit.freeze(self.language_id.mods["mean_var_norm"]) self.language_id.mods["embedding_model"] = torch.jit.freeze(self.language_id.mods["embedding_model"]) self.language_id.mods["classifier"] = torch.jit.freeze( self.language_id.mods["classifier"]) else: - self.language_id.mods["compute_features"] = torch.jit.trace( self.language_id.mods["compute_features"] , example_inputs=(torch.rand(1,32000))) - self.language_id.mods["mean_var_norm"] = torch.jit.trace(self.language_id.mods["mean_var_norm"], example_inputs=sampleInput) - self.language_id.mods["embedding_model"] = torch.jit.trace(self.language_id.mods["embedding_model"], example_inputs=sampleInput) - self.language_id.mods["classifier"] = torch.jit.trace(self.language_id.mods["classifier"], example_inputs=I2) + self.language_id.mods["compute_features"] = torch.jit.trace( self.language_id.mods["compute_features"] , example_inputs=sampleWavs) + self.language_id.mods["mean_var_norm"] = torch.jit.trace(self.language_id.mods["mean_var_norm"], example_inputs=(I1, sampleWavLens)) + self.language_id.mods["embedding_model"] = torch.jit.trace(self.language_id.mods["embedding_model"], example_inputs=(I2, sampleWavLens)) + self.language_id.mods["classifier"] = torch.jit.trace(self.language_id.mods["classifier"], example_inputs=I3) self.language_id.mods["compute_features"] = torch.jit.freeze(self.language_id.mods["compute_features"]) self.language_id.mods["mean_var_norm"] = torch.jit.freeze(self.language_id.mods["mean_var_norm"]) @@ -114,11 +118,11 @@ def predict(self, data_path="", ipex_op=False, bf16=False, int8_model=False, ver with torch.no_grad(): if bf16: with torch.cpu.amp.autocast(): - prediction = self.language_id.classify_batch(signal) + prediction = self.language_id.classify_batch(signal) else: - prediction = self.language_id.classify_batch(signal) + prediction = self.language_id.classify_batch(signal) else: # default - prediction = self.language_id.classify_batch(signal) + prediction = self.language_id.classify_batch(signal) inference_end_time = time() inference_latency = inference_end_time - inference_start_time @@ -195,13 +199,13 @@ def main(argv): with open(OUTPUT_SUMMARY_CSV_FILE, 'w') as f: writer = csv.writer(f) writer.writerow(["Audio File", - "Input Frequency", + "Input Frequency (Hz)", "Expected Language", "Top Consensus", "Top Consensus %", "Second Consensus", "Second Consensus %", - "Average Latency", + "Average Latency (s)", "Result"]) total_samples = 0 @@ -273,12 +277,12 @@ def main(argv): predict_list = [] use_entire_audio_file = False latency_sum = 0.0 - if data.waveduration < sample_dur: + if int(data.waveduration) <= sample_dur: # Use entire audio file if the duration is less than the sampling duration use_entire_audio_file = True sample_list = [0 for _ in range(sample_size)] else: - start_time_list = list(range(sample_size - int(data.waveduration) + 1)) + start_time_list = list(range(int(data.waveduration) - sample_dur)) sample_list = [] for i in range(sample_size): sample_list.append(random.sample(start_time_list, 1)[0]) @@ -346,17 +350,36 @@ def main(argv): avg_latency, result ]) + else: + # Write results to a .csv file + with open(OUTPUT_SUMMARY_CSV_FILE, 'a') as f: + writer = csv.writer(f) + writer.writerow([ + filename, + sample_rate_for_csv, + "N/A", + top_occurance, + str(topPercentage) + "%", + sec_occurance, + str(secPercentage) + "%", + avg_latency, + "N/A" + ]) + if ground_truth_compare: # Summary of results print("\n\n Correctly predicted %d/%d\n" %(correct_predictions, total_samples)) - print("\n See %s for summary\n" %(OUTPUT_SUMMARY_CSV_FILE)) + + print("\n See %s for summary\n" %(OUTPUT_SUMMARY_CSV_FILE)) elif os.path.isfile(path): print("\nIt is a normal file", path) else: print("It is a special file (socket, FIFO, device file)" , path) + print("Done.\n") + if __name__ == "__main__": import sys sys.exit(main(sys.argv)) diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/initialize.sh b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/initialize.sh deleted file mode 100644 index 935debac44..0000000000 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/initialize.sh +++ /dev/null @@ -1,23 +0,0 @@ -#!/bin/bash - -# Activate the oneAPI environment for PyTorch -source activate pytorch - -# Install speechbrain -git clone https://github.com/speechbrain/speechbrain.git -cd speechbrain -pip install -r requirements.txt -pip install --editable . -cd .. - -# Add speechbrain to environment variable PYTHONPATH -export PYTHONPATH=$PYTHONPATH:/Inference/speechbrain - -# Install PyTorch and Intel Extension for PyTorch (IPEX) -pip install torch==1.13.1 torchaudio -pip install --no-deps torchvision==0.14.0 -pip install intel_extension_for_pytorch==1.13.100 -pip install neural-compressor==2.0 - -# Update packages -apt-get update && apt-get install libgl1 diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/interfaces.patch b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/interfaces.patch deleted file mode 100644 index 762ae5ebee..0000000000 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/interfaces.patch +++ /dev/null @@ -1,11 +0,0 @@ ---- interfaces.py 2022-10-07 16:58:26.836359346 -0700 -+++ interfaces_new.py 2022-10-07 16:59:09.968110128 -0700 -@@ -945,7 +945,7 @@ - out_prob = self.mods.classifier(emb).squeeze(1) - score, index = torch.max(out_prob, dim=-1) - text_lab = self.hparams.label_encoder.decode_torch(index) -- return out_prob, score, index, text_lab -+ return out_prob, score, index # removed text_lab to get torchscript to work - - def classify_file(self, path): - """Classifies the given audiofile into the given set of labels. diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/lang_id_inference.ipynb b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/lang_id_inference.ipynb index 0ed44139b3..1cd1afee01 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/lang_id_inference.ipynb +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/lang_id_inference.ipynb @@ -47,7 +47,7 @@ "metadata": {}, "outputs": [], "source": [ - "!python inference_commonVoice.py -p /data/commonVoice/test" + "!python inference_commonVoice.py -p ${COMMON_VOICE_PATH}/processed_data/test" ] }, { @@ -55,7 +55,7 @@ "metadata": {}, "source": [ "## inference_custom.py for Custom Data \n", - "To generate an overall results output summary, the audio_ground_truth_labels.csv file needs to be modified with the name of the audio file and expected audio label (i.e. en for English). By default, this is disabled but if desired, the *--ground_truth_compare* can be used. To run inference on custom data, you must specify a folder with WAV files and pass the path in as an argument. " + "To run inference on custom data, you must specify a folder with .wav files and pass the path in as an argument. You can do so by creating a folder named `data_custom` and then copy 1 or 2 .wav files from your test dataset into it. .mp3 files will NOT work. " ] }, { @@ -65,7 +65,7 @@ "### Randomly select audio clips from audio files for prediction\n", "python inference_custom.py -p DATAPATH -d DURATION -s SIZE\n", "\n", - "An output file output_summary.csv will give the summary of the results." + "An output file `output_summary.csv` will give the summary of the results." ] }, { @@ -104,6 +104,8 @@ "### Optimizations with Intel® Extension for PyTorch (IPEX) \n", "python inference_custom.py -p data_custom -d 3 -s 50 --vad --ipex --verbose \n", "\n", + "This will apply ipex.optimize to the model(s) and TorchScript. You can also add the --bf16 option along with --ipex to run in the BF16 data type, supported on 4th Gen Intel® Xeon® Scalable processors and newer.\n", + "\n", "Note that the *--verbose* option is required to view the latency measurements. " ] }, @@ -121,7 +123,7 @@ "metadata": {}, "source": [ "## Quantization with Intel® Neural Compressor (INC)\n", - "To improve inference latency, Intel® Neural Compressor (INC) can be used to quantize the trained model from FP32 to INT8 by running quantize_model.py. The *-datapath* argument can be used to specify a custom evaluation dataset but by default it is set to */data/commonVoice/dev* which was generated from the data preprocessing scripts in the *Training* folder. " + "To improve inference latency, Intel® Neural Compressor (INC) can be used to quantize the trained model from FP32 to INT8 by running quantize_model.py. The *-datapath* argument can be used to specify a custom evaluation dataset but by default it is set to `$COMMON_VOICE_PATH/processed_data/dev` which was generated from the data preprocessing scripts in the `Training` folder. " ] }, { @@ -130,14 +132,46 @@ "metadata": {}, "outputs": [], "source": [ - "!python quantize_model.py -p ./lang_id_commonvoice_model -datapath $COMMON_VOICE_PATH/dev" + "!python quantize_model.py -p ./lang_id_commonvoice_model -datapath $COMMON_VOICE_PATH/processed_data/dev" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After quantization, the model will be stored in lang_id_commonvoice_model_INT8 and neural_compressor.utils.pytorch.load will have to be used to load the quantized model for inference. If self.language_id is the original model and data_path is the path to the audio file:\n", + "\n", + "```\n", + "from neural_compressor.utils.pytorch import load\n", + "model_int8 = load(\"./lang_id_commonvoice_model_INT8\", self.language_id)\n", + "signal = self.language_id.load_audio(data_path)\n", + "prediction = self.model_int8(signal)\n", + "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "After quantization, the model will be stored in *lang_id_commonvoice_model_INT8* and *neural_compressor.utils.pytorch.load* will have to be used to load the quantized model for inference. " + "The code above is integrated into inference_custom.py. You can now run inference on your data using this INT8 model:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!python inference_custom.py -p data_custom -d 3 -s 50 --vad --int8_model --verbose" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### (Optional) Comparing Predictions with Ground Truth\n", + "\n", + "You can choose to modify audio_ground_truth_labels.csv to include the name of the audio file and expected audio label (like, en for English), then run inference_custom.py with the --ground_truth_compare option. By default, this is disabled." ] }, { diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/quantize_model.py b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/quantize_model.py index 428e24142e..e5ce7f9bbc 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/quantize_model.py +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/quantize_model.py @@ -18,8 +18,6 @@ from neural_compressor.utils.pytorch import load from speechbrain.pretrained import EncoderClassifier -DEFAULT_EVAL_DATA_PATH = "/data/commonVoice/dev" - def prepare_dataset(path): data_list = [] for dir_name in os.listdir(path): @@ -33,7 +31,7 @@ def main(argv): import argparse parser = argparse.ArgumentParser() parser.add_argument('-p', type=str, required=True, help="Path to the model to be optimized") - parser.add_argument('-datapath', type=str, default=DEFAULT_EVAL_DATA_PATH, help="Path to evaluation dataset") + parser.add_argument('-datapath', type=str, required=True, help="Path to evaluation dataset") args = parser.parse_args() model_path = args.p diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/sample_input_features.pt b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/sample_input_features.pt deleted file mode 100644 index 61114fe706..0000000000 Binary files a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/sample_input_features.pt and /dev/null differ diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/sample_wavs.pt b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/sample_wavs.pt new file mode 100644 index 0000000000..72ea7cc659 Binary files /dev/null and b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Inference/sample_wavs.pt differ diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/README.md b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/README.md index a44d562f57..623a82d85b 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/README.md +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/README.md @@ -6,7 +6,7 @@ Languages are selected from the CommonVoice dataset for training, validation, an | Area | Description |:--- |:--- -| What you will learn | How to use training and inference with SpeechBrain, Intel® Extension for PyTorch (IPEX) inference, Intel® Neural Compressor (INC) quantization, and a oneapi-aikit container +| What you will learn | How to use training and inference with SpeechBrain, Intel® Extension for PyTorch* (IPEX) inference, Intel® Neural Compressor (INC) quantization | Time to complete | 60 minutes ## Purpose @@ -17,16 +17,14 @@ Spoken audio comes in different languages and this sample uses a model to identi | Optimized for | Description |:--- |:--- -| OS | Ubuntu* 18.04 or newer -| Hardware | Intel® Xeon® processor family -| Software | Intel® OneAPI AI Analytics Toolkit
Hugging Face SpeechBrain +| OS | Ubuntu* 22.04 or newer +| Hardware | Intel® Xeon® and Core® processor families +| Software | Intel® AI Tools
Hugging Face SpeechBrain ## Key Implementation Details The [CommonVoice](https://commonvoice.mozilla.org/) dataset is used to train an Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Network (ECAPA-TDNN). This is implemented in the [Hugging Face SpeechBrain](https://huggingface.co/SpeechBrain) library. Additionally, a small Convolutional Recurrent Deep Neural Network (CRDNN) pretrained on the LibriParty dataset is used to process audio samples and output the segments where speech activity is detected. -After you have downloaded the CommonVoice dataset, the data must be preprocessed by converting the MP3 files into WAV format and separated into training, validation, and testing sets. - The model is then trained from scratch using the Hugging Face SpeechBrain library. This model is then used for inference on the testing dataset or a user-specified dataset. There is an option to utilize SpeechBrain's Voice Activity Detection (VAD) where only the speech segments from the audio files are extracted and combined before samples are randomly selected as input into the model. To improve performance, the user may quantize the trained model to INT8 using Intel® Neural Compressor (INC) to decrease latency. The sample contains three discreet phases: @@ -39,93 +37,94 @@ For both training and inference, you can run the sample and scripts in Jupyter N ## Prepare the Environment -### Downloading the CommonVoice Dataset +### Create and Set Up Environment ->**Note**: You can skip downloading the dataset if you already have a pretrained model and only want to run inference on custom data samples that you provide. +1. Create your conda environment by following the instructions on the Intel [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). You can follow these settings: -Download the CommonVoice dataset for languages of interest from [https://commonvoice.mozilla.org/en/datasets](https://commonvoice.mozilla.org/en/datasets). +* Tool: AI Tools +* Preset or customize: Customize +* Distribution Type: conda* or pip +* Python Versions: Python* 3.9 or 3.10 +* PyTorch* Framework Optimizations: Intel® Extension for PyTorch* (CPU) +* Intel®-Optimized Tools & Libraries: Intel® Neural Compressor -For this sample, you will need to download the following languages: **Japanese** and **Swedish**. Follow Steps 1-6 below or you can execute the code. +>**Note**: Be sure to activate your environment before installing the packages. If using pip, install using `python -m pip` instead of just `pip`. + +2. Create your dataset folder and set the environment variable `COMMON_VOICE_PATH`. This needs to match with where you downloaded your dataset. +```bash +mkdir -p /data/commonVoice +export COMMON_VOICE_PATH=/data/commonVoice +``` -1. On the CommonVoice website, select the Version and Language. -2. Enter your email. -3. Check the boxes, and right-click on the download button to copy the link address. -4. Paste this link into a text editor and copy the first part of the URL up to ".tar.gz". -5. Use **GNU wget** on the URL to download the data to `/data/commonVoice`. +3. Install packages needed for MP3 to WAV conversion +```bash +sudo apt-get update && apt-get install -y ffmpeg libgl1 +``` - Alternatively, you can use a directory on your local drive (due to the large amount of data). If you opt to do so, you must change the `COMMON_VOICE_PATH` environment in `launch_docker.sh` before running the script. +4. Navigate to your working directory, clone the `oneapi-src` repository, and navigate to this code sample. +```bash +git clone https://github.com/oneapi-src/oneAPI-samples.git +cd oneAPI-samples/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification +``` -6. Extract the compressed folder, and rename the folder with the language (for example, English). +5. Run the bash script to install additional necessary libraries, including SpeechBrain. +```bash +source initialize.sh +``` - The file structure **must match** the `LANGUAGE_PATHS` defined in `prepareAllCommonVoice.py` in the `Training` folder for the script to run properly. +### Download the CommonVoice Dataset -These commands illustrate Steps 1-6. Notice that it downloads Japanese and Swedish from CommonVoice version 11.0. +>**Note**: You can skip downloading the dataset if you already have a pretrained model and only want to run inference on custom data samples that you provide. + +First, change to the `Dataset` directory. ``` -# Create the commonVoice directory under 'data' -sudo chmod 777 -R /data -cd /data -mkdir commonVoice -cd commonVoice - -# Download the CommonVoice data -wget \ -https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-11.0-2022-09-21/cv-corpus-11.0-2022-09-21-ja.tar.gz \ -https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-11.0-2022-09-21/cv-corpus-11.0-2022-09-21-sv-SE.tar.gz - -# Extract and organize the CommonVoice data into respective folders by language -tar -xf cv-corpus-11.0-2022-09-21-ja.tar.gz -mv cv-corpus-11.0-2022-09-21 japanese -tar -xf cv-corpus-11.0-2022-09-21-sv-SE.tar.gz -mv cv-corpus-11.0-2022-09-21 swedish +cd ./Dataset ``` -### Configuring the Container +The `get_dataset.py` script downloads the Common Voice dataset by doing the following: -1. Pull the `oneapi-aikit` docker image. -2. Set up the Docker environment. - ``` - docker pull intel/oneapi-aikit - ./launch_docker.sh - ``` - >**Note**: By default, the `Inference` and `Training` directories will be mounted and the environment variable `COMMON_VOICE_PATH` will be set to `/data/commonVoice` and mounted to `/data`. `COMMON_VOICE_PATH` is the location of where the CommonVoice dataset is downloaded. +- Gets the train set of the [Common Voice dataset from Huggingface](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) for Japanese and Swedish +- Downloads each mp3 and moves them to the `output_dir` folder +1. If you want to add additional languages, then modify the `language_to_code` dictionary in the file to reflect the languages to be included in the model. +3. Run the script with options. + ```bash + python get_dataset.py --output_dir ${COMMON_VOICE_PATH} + ``` + | Parameters | Description + |:--- |:--- + | `--output_dir` | Base output directory for saving the files. Default is /data/commonVoice + +Once the dataset is downloaded, navigate back to the parent directory +``` +cd .. +``` ## Train the Model with Languages This section explains how to train a model for language identification using the CommonVoice dataset, so it includes steps on how to preprocess the data, train the model, and prepare the output files for inference. -### Configure the Training Environment - -1. Change to the `Training` directory. - ``` - cd /Training - ``` -2. Source the bash script to install the necessary components. - ``` - source initialize.sh - ``` - This installs PyTorch*, the Intel® Extension for PyTorch (IPEX), and other components. +First, change to the `Training` directory. +``` +cd ./Training +``` -### Run in Jupyter Notebook +### Option 1: Run in Jupyter Notebook -1. Install Jupyter Notebook. - ``` - pip install notebook - ``` -2. Launch Jupyter Notebook. +1. Launch Jupyter Notebook. ``` jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root ``` -3. Follow the instructions to open the URL with the token in your browser. -4. Locate and select the Training Notebook. +2. Follow the instructions to open the URL with the token in your browser. +3. Locate and select the Training Notebook. ``` lang_id_training.ipynb ``` -5. Follow the instructions in the Notebook. +4. Follow the instructions in the Notebook. -### Run in a Console +### Option 2: Run in a Console If you cannot or do not want to use Jupyter Notebook, use these procedures to run the sample and scripts locally. @@ -133,13 +132,13 @@ If you cannot or do not want to use Jupyter Notebook, use these procedures to ru 1. Acquire copies of the training scripts. (The command retrieves copies of the required VoxLingua107 training scripts from SpeechBrain.) ``` - cp speechbrain/recipes/VoxLingua107/lang_id/create_wds_shards.py create_wds_shards.py - cp speechbrain/recipes/VoxLingua107/lang_id/train.py train.py - cp speechbrain/recipes/VoxLingua107/lang_id/hparams/train_ecapa.yaml train_ecapa.yaml + cp ../speechbrain/recipes/VoxLingua107/lang_id/create_wds_shards.py create_wds_shards.py + cp ../speechbrain/recipes/VoxLingua107/lang_id/train.py train.py + cp ../speechbrain/recipes/VoxLingua107/lang_id/hparams/train_ecapa.yaml train_ecapa.yaml ``` 2. From the `Training` directory, apply patches to modify these files to work with the CommonVoice dataset. - ``` + ```bash patch < create_wds_shards.patch patch < train_ecapa.patch ``` @@ -154,8 +153,8 @@ The `prepareAllCommonVoice.py` script performs the following data preprocessing 1. If you want to add additional languages, then modify the `LANGUAGE_PATHS` list in the file to reflect the languages to be included in the model. 2. Run the script with options. The samples will be divided as follows: 80% training, 10% validation, 10% testing. - ``` - python prepareAllCommonVoice.py -path /data -max_samples 2000 --createCsv --train --dev --test + ```bash + python prepareAllCommonVoice.py -path $COMMON_VOICE_PATH -max_samples 2000 --createCsv --train --dev --test ``` | Parameters | Description |:--- |:--- @@ -166,27 +165,28 @@ The `prepareAllCommonVoice.py` script performs the following data preprocessing #### Create Shards for Training and Validation -1. If the `/data/commonVoice_shards` folder exists, delete the folder and the contents before proceeding. +1. If the `${COMMON_VOICE_PATH}/processed_data/commonVoice_shards` folder exists, delete the folder and the contents before proceeding. 2. Enter the following commands. + ```bash + python create_wds_shards.py ${COMMON_VOICE_PATH}/processed_data/train ${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/train + python create_wds_shards.py ${COMMON_VOICE_PATH}/processed_data/dev ${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/dev ``` - python create_wds_shards.py /data/commonVoice/train/ /data/commonVoice_shards/train - python create_wds_shards.py /data/commonVoice/dev/ /data/commonVoice_shards/dev - ``` -3. Note the shard with the largest number as `LARGEST_SHARD_NUMBER` in the output above or by navigating to `/data/commonVoice_shards/train`. +3. Note the shard with the largest number as `LARGEST_SHARD_NUMBER` in the output above or by navigating to `${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/train`. 4. Open the `train_ecapa.yaml` file and modify the `train_shards` variable to make the range reflect: `000000..LARGEST_SHARD_NUMBER`. -5. Repeat the process for `/data/commonVoice_shards/dev`. +5. Repeat Steps 3 and 4 for `${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/dev`. #### Run the Training Script -The YAML file `train_ecapa.yaml` with the training configurations should already be patched from the Prerequisite section. +The YAML file `train_ecapa.yaml` with the training configurations is passed as an argument to the `train.py` script to train the model. 1. If necessary, edit the `train_ecapa.yaml` file to meet your needs. | Parameters | Description |:--- |:--- + | `seed` | The seed value, which should be set to a different value for subsequent runs. Defaults to 1987. | `out_n_neurons` | Must be equal to the number of languages of interest. | `number_of_epochs` | Default is **10**. Adjust as needed. - | `batch_size` | In the trainloader_options, decrease this value if your CPU or GPU runs out of memory while running the training script. + | `batch_size` | In the trainloader_options, decrease this value if your CPU or GPU runs out of memory while running the training script. If you see a "Killed" error message, then the training script has run out of memory. 2. Run the script to train the model. ``` @@ -195,30 +195,48 @@ The YAML file `train_ecapa.yaml` with the training configurations should already #### Move Model to Inference Folder -After training, the output should be inside `results/epaca/SEED_VALUE` folder. By default SEED_VALUE is set to 1987 in the YAML file. You can change the value as needed. +After training, the output should be inside the `results/epaca/1987` folder. By default the `seed` is set to 1987 in `train_ecapa.yaml`. You can change the value as needed. -1. Copy all files with *cp -R* from `results/epaca/SEED_VALUE` into a new folder called `lang_id_commonvoice_model` in the **Inference** folder. - - The name of the folder MUST match with the pretrained_path variable defined in the YAML file. By default, it is `lang_id_commonvoice_model`. +1. Copy all files from `results/epaca/1987` into a new folder called `lang_id_commonvoice_model` in the **Inference** folder. + ```bash + cp -R results/epaca/1987 ../Inference/lang_id_commonvoice_model + ``` + The name of the folder MUST match with the pretrained_path variable defined in `train_ecapa.yaml`. By default, it is `lang_id_commonvoice_model`. 2. Change directory to `/Inference/lang_id_commonvoice_model/save`. + ```bash + cd ../Inference/lang_id_commonvoice_model/save + ``` + 3. Copy the `label_encoder.txt` file up one level. -4. Change to the latest `CKPT` folder, and copy the classifier.ckpt and embedding_model.ckpt files into the `/Inference/lang_id_commonvoice_model/` folder. + ```bash + cp label_encoder.txt ../. + ``` + +4. Change to the latest `CKPT` folder, and copy the classifier.ckpt and embedding_model.ckpt files into the `/Inference/lang_id_commonvoice_model/` folder which is two directories up. By default, the command below will navigate into the single CKPT folder that is present, but you can change it to the specific folder name. + ```bash + # Navigate into the CKPT folder + cd CKPT* + + cp classifier.ckpt ../../. + cp embedding_model.ckpt ../../ + cd ../../../.. + ``` - You may need to modify the permissions of these files to be executable before you run the inference scripts to consume them. + You may need to modify the permissions of these files to be executable i.e. `sudo chmod 755` before you run the inference scripts to consume them. >**Note**: If `train.py` is rerun with the same seed, it will resume from the epoch number it last run. For a clean rerun, delete the `results` folder or change the seed. You can now load the model for inference. In the `Inference` folder, the `inference_commonVoice.py` script uses the trained model on the testing dataset, whereas `inference_custom.py` uses the trained model on a user-specified dataset and can utilize Voice Activity Detection. ->**Note**: If the folder name containing the model is changed from `lang_id_commonvoice_model`, you will need to modify the `source_model_path` variable in `inference_commonVoice.py` and `inference_custom.py` files in the `speechbrain_inference` class. +>**Note**: If the folder name containing the model is changed from `lang_id_commonvoice_model`, you will need to modify the `pretrained_path` in `train_ecapa.yaml`, and the `source_model_path` variable in both the `inference_commonVoice.py` and `inference_custom.py` files in the `speechbrain_inference` class. ## Run Inference for Language Identification >**Stop**: If you have not already done so, you must run the scripts in the `Training` folder to generate the trained model before proceeding. -To run inference, you must have already run all of the training scripts, generated the trained model, and moved files to the appropriate locations. You must place the model output in a folder name matching the name specified as the `pretrained_path` variable defined in the YAML file. +To run inference, you must have already run all of the training scripts, generated the trained model, and moved files to the appropriate locations. You must place the model output in a folder name matching the name specified as the `pretrained_path` variable defined in `train_ecapa.yaml`. >**Note**: If you plan to run inference on **custom data**, you will need to create a folder for the **.wav** files to be used for prediction. For example, `data_custom`. Move the **.wav** files to your custom folder. (For quick results, you may select a few audio files from each language downloaded from CommonVoice.) @@ -226,35 +244,23 @@ To run inference, you must have already run all of the training scripts, generat 1. Change to the `Inference` directory. ``` - cd /Inference - ``` -2. Source the bash script to install or update the necessary components. - ``` - source initialize.sh - ``` -3. Patch the Intel® Extension for PyTorch (IPEX) to use SpeechBrain models. (This patch is required for PyTorch* TorchScript to work because the output of the model must contain only tensors.) - ``` - patch ./speechbrain/speechbrain/pretrained/interfaces.py < interfaces.patch + cd ./Inference ``` -### Run in Jupyter Notebook +### Option 1: Run in Jupyter Notebook -1. If you have not already done so, install Jupyter Notebook. - ``` - pip install notebook - ``` -2. Launch Jupyter Notebook. +1. Launch Jupyter Notebook. ``` - jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root + jupyter notebook --ip 0.0.0.0 --port 8889 --allow-root ``` -3. Follow the instructions to open the URL with the token in your browser. -4. Locate and select the inference Notebook. +2. Follow the instructions to open the URL with the token in your browser. +3. Locate and select the inference Notebook. ``` lang_id_inference.ipynb ``` -5. Follow the instructions in the Notebook. +4. Follow the instructions in the Notebook. -### Run in a Console +### Option 2: Run in a Console If you cannot or do not want to use Jupyter Notebook, use these procedures to run the sample and scripts locally. @@ -287,34 +293,32 @@ Both scripts support input options; however, some options can be use on `inferen #### On the CommonVoice Dataset 1. Run the inference_commonvoice.py script. - ``` - python inference_commonVoice.py -p /data/commonVoice/test + ```bash + python inference_commonVoice.py -p ${COMMON_VOICE_PATH}/processed_data/test ``` The script should create a `test_data_accuracy.csv` file that summarizes the results. #### On Custom Data -1. Modify the `audio_ground_truth_labels.csv` file to include the name of the audio file and expected audio label (like, `en` for English). +To run inference on custom data, you must specify a folder with **.wav** files and pass the path in as an argument. You can do so by creating a folder named `data_custom` and then copy 1 or 2 **.wav** files from your test dataset into it. **.mp3** files will NOT work. - By default, this is disabled. If required, use the `--ground_truth_compare` input option. To run inference on custom data, you must specify a folder with **.wav** files and pass the path in as an argument. - -2. Run the inference_ script. - ``` - python inference_custom.py -p - ``` +Run the inference_ script. +```bash +python inference_custom.py -p +``` The following examples describe how to use the scripts to produce specific outcomes. **Default: Random Selections** 1. To randomly select audio clips from audio files for prediction, enter commands similar to the following: - ``` + ```bash python inference_custom.py -p data_custom -d 3 -s 50 ``` This picks 50 3-second samples from each **.wav** file in the `data_custom` folder. The `output_summary.csv` file summarizes the results. 2. To randomly select audio clips from audio files after applying **Voice Activity Detection (VAD)**, use the `--vad` option: - ``` + ```bash python inference_custom.py -p data_custom -d 3 -s 50 --vad ``` Again, the `output_summary.csv` file summarizes the results. @@ -324,18 +328,20 @@ The following examples describe how to use the scripts to produce specific outco **Optimization with Intel® Extension for PyTorch (IPEX)** 1. To optimize user-defined data, enter commands similar to the following: - ``` + ```bash python inference_custom.py -p data_custom -d 3 -s 50 --vad --ipex --verbose ``` + This will apply `ipex.optimize` to the model(s) and TorchScript. You can also add the `--bf16` option along with `--ipex` to run in the BF16 data type, supported on 4th Gen Intel® Xeon® Scalable processors and newer. + >**Note**: The `--verbose` option is required to view the latency measurements. **Quantization with Intel® Neural Compressor (INC)** 1. To improve inference latency, you can use the Intel® Neural Compressor (INC) to quantize the trained model from FP32 to INT8 by running `quantize_model.py`. + ```bash + python quantize_model.py -p ./lang_id_commonvoice_model -datapath $COMMON_VOICE_PATH/processed_data/dev ``` - python quantize_model.py -p ./lang_id_commonvoice_model -datapath $COMMON_VOICE_PATH/dev - ``` - Use the `-datapath` argument to specify a custom evaluation dataset. By default, the datapath is set to the `/data/commonVoice/dev` folder that was generated from the data preprocessing scripts in the `Training` folder. + Use the `-datapath` argument to specify a custom evaluation dataset. By default, the datapath is set to the `$COMMON_VOICE_PATH/processed_data/dev` folder that was generated from the data preprocessing scripts in the `Training` folder. After quantization, the model will be stored in `lang_id_commonvoice_model_INT8` and `neural_compressor.utils.pytorch.load` will have to be used to load the quantized model for inference. If `self.language_id` is the original model and `data_path` is the path to the audio file: ``` @@ -345,9 +351,16 @@ The following examples describe how to use the scripts to produce specific outco prediction = self.model_int8(signal) ``` -### Troubleshooting + The code above is integrated into `inference_custom.py`. You can now run inference on your data using this INT8 model: + ```bash + python inference_custom.py -p data_custom -d 3 -s 50 --vad --int8_model --verbose + ``` + + >**Note**: The `--verbose` option is required to view the latency measurements. + +**(Optional) Comparing Predictions with Ground Truth** -If the model appears to be giving the same output regardless of input, try running `clean.sh` to remove the `RIR_NOISES` and `speechbrain` folders. Redownload that data after cleaning by running `initialize.sh` and either `inference_commonVoice.py` or `inference_custom.py`. +You can choose to modify `audio_ground_truth_labels.csv` to include the name of the audio file and expected audio label (like, `en` for English), then run `inference_custom.py` with the `--ground_truth_compare` option. By default, this is disabled. ## License diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/clean.sh b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/clean.sh index f60b245773..30f1806c10 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/clean.sh +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/clean.sh @@ -1,5 +1,4 @@ #!/bin/bash -rm -R RIRS_NOISES -rm -R speechbrain -rm -f rirs_noises.zip noise.csv reverb.csv +echo "Deleting rir, noise, speechbrain" +rm -R rir noise diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/create_wds_shards.patch b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/create_wds_shards.patch index ddfe37588b..3d60bc627f 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/create_wds_shards.patch +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/create_wds_shards.patch @@ -1,5 +1,5 @@ ---- create_wds_shards.py 2022-09-20 14:55:48.732386718 -0700 -+++ create_wds_shards_commonvoice.py 2022-09-20 14:53:56.554637629 -0700 +--- create_wds_shards.py 2024-11-13 18:08:07.440000000 -0800 ++++ create_wds_shards_modified.py 2024-11-14 14:09:36.225000000 -0800 @@ -27,7 +27,10 @@ t, sr = torchaudio.load(audio_file_path) @@ -12,7 +12,7 @@ return t -@@ -61,27 +64,20 @@ +@@ -66,27 +69,22 @@ sample_keys_per_language = defaultdict(list) for f in audio_files: @@ -23,7 +23,9 @@ - f.as_posix(), - ) + # Common Voice format -+ # commonVoice_folder_path/common_voice__00000000.wav' ++ # commonVoice_folder_path/processed_data//common_voice__00000000.wav' ++ # DATASET_TYPE: dev, test, train ++ # LANG_ID: the label for the language + m = re.match(r"((.*)(common_voice_)(.+)(_)(\d+).wav)", f.as_posix()) + if m: diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/initialize.sh b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/initialize.sh deleted file mode 100644 index 78c114f2dc..0000000000 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/initialize.sh +++ /dev/null @@ -1,26 +0,0 @@ -#!/bin/bash - -# Activate the oneAPI environment for PyTorch -source activate pytorch - -# Install speechbrain -git clone https://github.com/speechbrain/speechbrain.git -cd speechbrain -pip install -r requirements.txt -pip install --editable . -cd .. - -# Add speechbrain to environment variable PYTHONPATH -export PYTHONPATH=$PYTHONPATH:/Training/speechbrain - -# Install webdataset -pip install webdataset==0.1.96 - -# Install PyTorch and Intel Extension for PyTorch (IPEX) -pip install torch==1.13.1 torchaudio -pip install --no-deps torchvision==0.14.0 -pip install intel_extension_for_pytorch==1.13.100 - -# Install libraries for MP3 to WAV conversion -pip install pydub -apt-get update && apt-get install ffmpeg diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/lang_id_training.ipynb b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/lang_id_training.ipynb index 0502d223e9..4550b88916 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/lang_id_training.ipynb +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/lang_id_training.ipynb @@ -29,9 +29,9 @@ "metadata": {}, "outputs": [], "source": [ - "!cp speechbrain/recipes/VoxLingua107/lang_id/create_wds_shards.py create_wds_shards.py\n", - "!cp speechbrain/recipes/VoxLingua107/lang_id/train.py train.py\n", - "!cp speechbrain/recipes/VoxLingua107/lang_id/hparams/train_ecapa.yaml train_ecapa.yaml" + "!cp ../speechbrain/recipes/VoxLingua107/lang_id/create_wds_shards.py create_wds_shards.py\n", + "!cp ../speechbrain/recipes/VoxLingua107/lang_id/train.py train.py\n", + "!cp ../speechbrain/recipes/VoxLingua107/lang_id/hparams/train_ecapa.yaml train_ecapa.yaml" ] }, { @@ -75,7 +75,7 @@ "metadata": {}, "outputs": [], "source": [ - "!python prepareAllCommonVoice.py -path /data -max_samples 2000 --createCsv --train --dev --test" + "!python prepareAllCommonVoice.py -path $COMMON_VOICE_PATH -max_samples 2000 --createCsv --train --dev --test" ] }, { @@ -102,15 +102,15 @@ "metadata": {}, "outputs": [], "source": [ - "!python create_wds_shards.py /data/commonVoice/train/ /data/commonVoice_shards/train \n", - "!python create_wds_shards.py /data/commonVoice/dev/ /data/commonVoice_shards/dev" + "!python create_wds_shards.py ${COMMON_VOICE_PATH}/processed_data/train ${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/train \n", + "!python create_wds_shards.py ${COMMON_VOICE_PATH}/processed_data/dev ${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/dev" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Note down the shard with the largest number as LARGEST_SHARD_NUMBER in the output above or by navigating to */data/commonVoice_shards/train*. In *train_ecapa.yaml*, modify the *train_shards* variable to go from 000000..LARGEST_SHARD_NUMBER. Repeat the process for */data/commonVoice_shards/dev*. " + "Note down the shard with the largest number as LARGEST_SHARD_NUMBER in the output above or by navigating to `${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/train`. In `train_ecapa.yaml`, modify the `train_shards` variable to go from 000000..LARGEST_SHARD_NUMBER. Repeat the process for `${COMMON_VOICE_PATH}/processed_data/commonVoice_shards/dev`. " ] }, { @@ -126,6 +126,7 @@ "source": [ "### Run the training script \n", "The YAML file *train_ecapa.yaml* with the training configurations should already be patched from the Prerequisite section. The following parameters can be adjusted in the file directly as needed: \n", + "* *seed* should be set to a different value for subsequent runs. Defaults to 1987\n", "* *out_n_neurons* must be equal to the number of languages of interest \n", "* *number_of_epochs* is set to 10 by default but can be adjusted \n", "* In the trainloader_options, the *batch_size* may need to be decreased if your CPU or GPU runs out of memory while running the training script. \n", @@ -147,18 +148,57 @@ "metadata": {}, "source": [ "### Move output model to Inference folder \n", - "After training, the output should be inside results/epaca/SEED_VALUE. By default SEED_VALUE is set to 1987 in the YAML file. This value can be changed. Follow these instructions next: \n", + "After training, the output should be inside the `results/epaca/1987` folder. By default the `seed` is set to 1987 in `train_ecapa.yaml`. You can change the value as needed.\n", "\n", - "1. Copy all files with *cp -R* from results/epaca/SEED_VALUE into a new folder called *lang_id_commonvoice_model* in the Inference folder. The name of the folder MUST match with the pretrained_path variable defined in the YAML file. By default, it is *lang_id_commonvoice_model*. \n", - "2. Navigate to /Inference/land_id_commonvoice_model/save. \n", - "3. Copy the label_encoder.txt file up one level. \n", - "4. Navigate into the latest CKPT folder and copy the classifier.ckpt and embedding_model.ckpt files into the /Inference/lang_id_commonvoice_model/ level. You may need to modify the permissions of these files to be executable before you run the inference scripts to consume them. \n", + "1. Copy all files from `results/epaca/1987` into a new folder called `lang_id_commonvoice_model` in the **Inference** folder.\n", + " The name of the folder MUST match with the pretrained_path variable defined in `train_ecapa.yaml`. By default, it is `lang_id_commonvoice_model`.\n", "\n", - "Note that if *train.py* is rerun with the same seed, it will resume from the epoch number it left off of. For a clean rerun, delete the *results* folder or change the seed. \n", + "2. Change directory to `/Inference/lang_id_commonvoice_model/save`.\n", "\n", - "### Running inference\n", - "At this point, the model can be loaded and used in inference. In the Inference folder, inference_commonVoice.py uses the trained model on \n", - "the testing dataset, whereas inference_custom.py uses the trained model on a user-specified dataset and utilizes Voice Activity Detection. Note that if the folder name containing the model is changed from *lang_id_commonvoice_model*, you will need to modify inference_commonVoice.py and inference_custom.py's *source_model_path* variable in the *speechbrain_inference* class. " + "3. Copy the `label_encoder.txt` file up one level.\n", + "\n", + "4. Change to the latest `CKPT` folder, and copy the classifier.ckpt and embedding_model.ckpt files into the `/Inference/lang_id_commonvoice_model/` folder which is two directories up." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# 1)\n", + "!cp -R results/epaca/1987 ../Inference/lang_id_commonvoice_model\n", + "\n", + "# 2)\n", + "os.chdir(\"../Inference/lang_id_commonvoice_model/save\")\n", + "\n", + "# 3)\n", + "!cp label_encoder.txt ../.\n", + "\n", + "# 4) \n", + "folders = os.listdir()\n", + "for folder in folders:\n", + " if \"CKPT\" in folder:\n", + " os.chdir(folder)\n", + " break\n", + "!cp classifier.ckpt ../../.\n", + "!cp embedding_model.ckpt ../../\n", + "os.chdir(\"../../../..\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may need to modify the permissions of these files to be executable i.e. `sudo chmod 755` before you run the inference scripts to consume them.\n", + "\n", + ">**Note**: If `train.py` is rerun with the same seed, it will resume from the epoch number it last run. For a clean rerun, delete the `results` folder or change the seed.\n", + "\n", + "You can now load the model for inference. In the `Inference` folder, the `inference_commonVoice.py` script uses the trained model on the testing dataset, whereas `inference_custom.py` uses the trained model on a user-specified dataset and can utilize Voice Activity Detection. \n", + "\n", + ">**Note**: If the folder name containing the model is changed from `lang_id_commonvoice_model`, you will need to modify the `pretrained_path` in `train_ecapa.yaml`, and the `source_model_path` variable in both the `inference_commonVoice.py` and `inference_custom.py` files in the `speechbrain_inference` class. " ] } ], diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/prepareAllCommonVoice.py b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/prepareAllCommonVoice.py index ed78ab5c35..a6ab8df1b2 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/prepareAllCommonVoice.py +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/prepareAllCommonVoice.py @@ -124,9 +124,9 @@ def main(argv): createCsv = args.createCsv # Data paths - TRAIN_PATH = commonVoicePath + "/commonVoice/train" - TEST_PATH = commonVoicePath + "/commonVoice/test" - DEV_PATH = commonVoicePath + "/commonVoice/dev" + TRAIN_PATH = commonVoicePath + "/processed_data/train" + TEST_PATH = commonVoicePath + "/processed_data/test" + DEV_PATH = commonVoicePath + "/processed_data/dev" # Prepare the csv files for the Common Voice dataset if createCsv: diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/train_ecapa.patch b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/train_ecapa.patch index 38db22cf39..c95bf540ad 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/train_ecapa.patch +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/Training/train_ecapa.patch @@ -1,60 +1,55 @@ ---- train_ecapa.yaml.orig 2023-02-09 17:17:34.849537612 +0000 -+++ train_ecapa.yaml 2023-02-09 17:19:42.936542193 +0000 -@@ -4,19 +4,19 @@ +--- train_ecapa.yaml 2024-11-13 18:08:40.313000000 -0800 ++++ train_ecapa_modified.yaml 2024-11-14 14:52:31.232000000 -0800 +@@ -4,17 +4,17 @@ # ################################ # Basic parameters -seed: 1988 +seed: 1987 - __set_seed: !apply:torch.manual_seed [!ref ] + __set_seed: !apply:speechbrain.utils.seed_everything [!ref ] output_folder: !ref results/epaca/ save_folder: !ref /save train_log: !ref /train_log.txt -data_folder: !PLACEHOLDER +data_folder: ./ - rir_folder: !ref - # skip_prep: False -shards_url: /data/voxlingua107_shards -+shards_url: /data/commonVoice_shards ++shards_url: /data/commonVoice/processed_data/commonVoice_shards train_meta: !ref /train/meta.json val_meta: !ref /dev/meta.json -train_shards: !ref /train/shard-{000000..000507}.tar +train_shards: !ref /train/shard-{000000..000000}.tar val_shards: !ref /dev/shard-000000.tar - # Set to directory on a large disk if you are training on Webdataset shards hosted on the web -@@ -25,7 +25,7 @@ + # Data for augmentation +@@ -32,7 +32,7 @@ ckpt_interval_minutes: 5 # Training parameters -number_of_epochs: 40 -+number_of_epochs: 10 ++number_of_epochs: 3 lr: 0.001 lr_final: 0.0001 sample_rate: 16000 -@@ -38,11 +38,11 @@ +@@ -45,10 +45,10 @@ deltas: False # Number of languages -out_n_neurons: 107 +out_n_neurons: 2 +-num_workers: 4 +-batch_size: 128 ++num_workers: 1 ++batch_size: 64 + batch_size_val: 32 train_dataloader_options: -- num_workers: 4 -- batch_size: 128 -+ num_workers: 1 -+ batch_size: 64 + num_workers: !ref +@@ -60,6 +60,21 @@ - val_dataloader_options: - num_workers: 1 -@@ -138,3 +138,20 @@ - classifier: !ref - normalizer: !ref - counter: !ref -+ -+# Below most relevant for inference using self-trained model: -+ + ############################## Augmentations ################################### + ++# Changes for code sample to work with CommonVoice dataset +pretrained_path: lang_id_commonvoice_model + +label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder @@ -69,3 +64,6 @@ + classifier: !ref /classifier.ckpt + label_encoder: !ref /label_encoder.txt + + # Download and prepare the dataset of noisy sequences for augmentation + prepare_noise_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URL + URL: !ref diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/initialize.sh b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/initialize.sh new file mode 100644 index 0000000000..0021b588b1 --- /dev/null +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/initialize.sh @@ -0,0 +1,23 @@ +#!/bin/bash + +# Install huggingface datasets and other requirements +conda install -c conda-forge -y datasets tqdm librosa jupyter ipykernel ipywidgets + +# Install speechbrain +git clone --depth 1 --branch v1.0.2 https://github.com/speechbrain/speechbrain.git +cd speechbrain +python -m pip install -r requirements.txt +python -m pip install --editable . +cd .. + +# Add speechbrain to environment variable PYTHONPATH +export PYTHONPATH=$PYTHONPATH:$(pwd)/speechbrain + +# Install webdataset +python -m pip install webdataset==0.2.100 + +# Install libraries for MP3 to WAV conversion +python -m pip install pydub + +# Install notebook to run Jupyter notebooks +python -m pip install notebook diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/launch_docker.sh b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/launch_docker.sh deleted file mode 100644 index 546523f6f6..0000000000 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/launch_docker.sh +++ /dev/null @@ -1,13 +0,0 @@ -#!/bin/bash - -export COMMON_VOICE_PATH="/data/commonVoice" -export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} -e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} -e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} -e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} -e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} -e SOCKS_PROXY=${SOCKS_PROXY} -e COMMON_VOICE_PATH=${COMMON_VOICE_PATH}" -docker run --privileged ${DOCKER_RUN_ENVS} -it --rm --network host \ - -v"/home:/home" \ - -v"/tmp:/tmp" \ - -v "${PWD}/Inference":/Inference \ - -v "${PWD}/Training":/Training \ - -v "${COMMON_VOICE_PATH}":/data \ - --shm-size 32G \ - intel/oneapi-aikit - \ No newline at end of file diff --git a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/sample.json b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/sample.json index ba157302ff..768ed8eb6d 100644 --- a/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/sample.json +++ b/AI-and-Analytics/End-to-end-Workloads/LanguageIdentification/sample.json @@ -12,8 +12,19 @@ { "id": "Language_Identification_E2E", "env": [ + "export COMMON_VOICE_PATH=/data/commonVoice" ], "steps": [ + "mkdir -p /data/commonVoice", + "apt-get update && apt-get install ffmpeg libgl1 -y", + "source initialize.sh", + "cd ./Dataset", + "python get_dataset.py --output_dir ${COMMON_VOICE_PATH}", + "cd ..", + "cd ./Training", + "jupyter nbconvert --execute --to notebook --inplace --debug lang_id_training.ipynb", + "cd ./Inference", + "jupyter nbconvert --execute --to notebook --inplace --debug lang_id_inference.ipynb" ] } ] diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/.gitkeep b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/License.txt b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/License.txt new file mode 100644 index 0000000000..e63c6e13dc --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/License.txt @@ -0,0 +1,7 @@ +Copyright Intel Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/README.md b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/README.md new file mode 100644 index 0000000000..a8fb984dd9 --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/README.md @@ -0,0 +1,140 @@ +# `JAX Getting Started` Sample + +The `JAX Getting Started` sample demonstrates how to train a JAX model and run inference on Intel® hardware. +| Property | Description +|:--- |:--- +| Category | Get Start Sample +| What you will learn | How to start using JAX* on Intel® hardware. +| Time to complete | 10 minutes + +## Purpose + +JAX is a high-performance numerical computing library that enables automatic differentiation. It provides features like just-in-time compilation and efficient parallelization for machine learning and scientific computing tasks. + +This sample code shows how to get started with JAX on CPU. The sample code defines a simple neural network that trains on the MNIST dataset using JAX for parallel computations across multiple CPU cores. The network trains over multiple epochs, evaluates accuracy, and adjusts parameters using stochastic gradient descent across devices. + +## Prerequisites + +| Optimized for | Description +|:--- |:--- +| OS | Ubuntu* 22.0.4 and newer +| Hardware | Intel® Xeon® Scalable processor family +| Software | JAX + +> **Note**: AI and Analytics samples are validated on AI Tools Offline Installer. For the full list of validated platforms refer to [Platform Validation](https://github.com/oneapi-src/oneAPI-samples/tree/master?tab=readme-ov-file#platform-validation). + +## Key Implementation Details + +The getting-started sample code uses the python file 'spmd_mnist_classifier_fromscratch.py' under the examples directory in the +[jax repository](https://github.com/google/jax/). +It implements a simple neural network's training and inference for mnist images. The images are downloaded to a temporary directory when the example is run first. +- **init_random_params** initializes the neural network weights and biases for each layer. +- **predict** computes the forward pass of the network, applying weights, biases, and activations to inputs. +- **loss** calculates the cross-entropy loss between predictions and target labels. +- **spmd_update** performs parallel gradient updates across multiple devices using JAX’s pmap and lax.psum. +- **accuracy** computes the accuracy of the model by predicting the class of each input in the batch and comparing it to the true target class. It uses the *jnp.argmax* function to find the predicted class and then computes the mean of correct predictions. +- **data_stream** function generates batches of shuffled training data. It reshapes the data so that it can be split across multiple cores, ensuring that the batch size is divisible by the number of cores for parallel processing. +- **training loop** trains the model for a set number of epochs, updating parameters and printing training/test accuracy after each epoch. The parameters are replicated across devices and updated in parallel using spmd_update. After each epoch, the model’s accuracy is evaluated on both training and test data using accuracy. + +## Environment Setup + +You will need to download and install the following toolkits, tools, and components to use the sample. + +**1. Get Intel® AI Tools** + +Required AI Tools: 'JAX' +
If you have not already, select and install these Tools via [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). AI and Analytics samples are validated on AI Tools Offline Installer. It is recommended to select Offline Installer option in AI Tools Selector.
+please see the [supported versions](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html). + +>**Note**: If Docker option is chosen in AI Tools Selector, refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. + +**2. (Offline Installer) Activate the AI Tools bundle base environment** + +If the default path is used during the installation of AI Tools: +``` +source $HOME/intel/oneapi/intelpython/bin/activate +``` +If a non-default path is used: +``` +source /bin/activate +``` + +**3. (Offline Installer) Activate relevant Conda environment** + +For the system with Intel CPU: +``` +conda activate jax +``` + +**4. Clone the GitHub repository** +``` +git clone https://github.com/google/jax.git +cd jax +export PYTHONPATH=$PYTHONPATH:$(pwd) +``` +## Run the Sample + +>**Note**: Before running the sample, make sure Environment Setup is completed. +Go to the section which corresponds to the installation method chosen in [AI Tools Selector](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-tools-selector.html) to see relevant instructions: +* [AI Tools Offline Installer (Validated)/Conda/PIP](#ai-tools-offline-installer-validatedcondapip) +* [Docker](#docker) +### AI Tools Offline Installer (Validated)/Conda/PIP +``` + python examples/spmd_mnist_classifier_fromscratch.py +``` +### Docker +AI Tools Docker images already have Get Started samples pre-installed. Refer to [Working with Preset Containers](https://github.com/intel/ai-containers/tree/main/preset) to learn how to run the docker and samples. +## Example Output +1. When the program is run, you should see results similar to the following: + +``` +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz to /tmp/jax_example_data/ +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/train-labels-idx1-ubyte.gz to /tmp/jax_example_data/ +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/t10k-images-idx3-ubyte.gz to /tmp/jax_example_data/ +downloaded https://storage.googleapis.com/cvdf-datasets/mnist/t10k-labels-idx1-ubyte.gz to /tmp/jax_example_data/ +Epoch 0 in 2.71 sec +Training set accuracy 0.7381166815757751 +Test set accuracy 0.7516999840736389 +Epoch 1 in 2.35 sec +Training set accuracy 0.81454998254776 +Test set accuracy 0.8277999758720398 +Epoch 2 in 2.33 sec +Training set accuracy 0.8448166847229004 +Test set accuracy 0.8568999767303467 +Epoch 3 in 2.34 sec +Training set accuracy 0.8626833558082581 +Test set accuracy 0.8715999722480774 +Epoch 4 in 2.30 sec +Training set accuracy 0.8752999901771545 +Test set accuracy 0.8816999793052673 +Epoch 5 in 2.33 sec +Training set accuracy 0.8839333653450012 +Test set accuracy 0.8899999856948853 +Epoch 6 in 2.37 sec +Training set accuracy 0.8908833265304565 +Test set accuracy 0.8944999575614929 +Epoch 7 in 2.31 sec +Training set accuracy 0.8964999914169312 +Test set accuracy 0.8986999988555908 +Epoch 8 in 2.28 sec +Training set accuracy 0.9016000032424927 +Test set accuracy 0.9034000039100647 +Epoch 9 in 2.31 sec +Training set accuracy 0.9060333371162415 +Test set accuracy 0.9059999585151672 +``` + +2. Troubleshooting + + If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the *[Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html)* for more information on using the utility + +## License + +Code samples are licensed under the MIT license. See +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) +for details. + +Third party program Licenses can be found here: +[third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt) + +*Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html) diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/run.sh b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/run.sh new file mode 100644 index 0000000000..2a8313d002 --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/run.sh @@ -0,0 +1,6 @@ +source $HOME/intel/oneapi/intelpython/bin/activate +conda activate jax +git clone https://github.com/google/jax.git +cd jax +export PYTHONPATH=$PYTHONPATH:$(pwd) +python examples/spmd_mnist_classifier_fromscratch.py diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/sample.json b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/sample.json new file mode 100644 index 0000000000..96c1fffd5b --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/sample.json @@ -0,0 +1,24 @@ +{ + "guid": "9A6A140B-FBD0-4CB2-849A-9CAF15A6F3B1", + "name": "Getting Started example for JAX CPU", + "categories": ["Toolkit/oneAPI AI And Analytics/Getting Started"], + "description": "This sample illustrates how to train a JAX model and run inference", + "builder": ["cli"], + "languages": [{ + "python": {} + }], + "os": ["linux"], + "targetDevice": ["CPU"], + "ciTests": { + "linux": [{ + "id": "JAX CPU example", + "steps": [ + "git clone https://github.com/google/jax.git", + "cd jax", + "conda activate jax", + "python examples/spmd_mnist_classifier_fromscratch.py" + ] + }] + }, + "expertise": "Getting Started" +} diff --git a/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/third-party-programs.txt b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/third-party-programs.txt new file mode 100644 index 0000000000..e9f8042d0a --- /dev/null +++ b/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted/third-party-programs.txt @@ -0,0 +1,253 @@ +oneAPI Code Samples - Third Party Programs File + +This file contains the list of third party software ("third party programs") +contained in the Intel software and their required notices and/or license +terms. This third party software, even if included with the distribution of the +Intel software, may be governed by separate license terms, including without +limitation, third party license terms, other Intel software license terms, and +open source software license terms. These separate license terms govern your use +of the third party programs as set forth in the “third-party-programs.txt” or +other similarly named text file. + +Third party programs and their corresponding required notices and/or license +terms are listed below. + +-------------------------------------------------------------------------------- + +1. Nothings STB Libraries + +stb/LICENSE + + This software is available under 2 licenses -- choose whichever you prefer. + ------------------------------------------------------------------------------ + ALTERNATIVE A - MIT License + Copyright (c) 2017 Sean Barrett + Permission is hereby granted, free of charge, to any person obtaining a copy of + this software and associated documentation files (the "Software"), to deal in + the Software without restriction, including without limitation the rights to + use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies + of the Software, and to permit persons to whom the Software is furnished to do + so, subject to the following conditions: + The above copyright notice and this permission notice shall be included in all + copies or substantial portions of the Software. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + ------------------------------------------------------------------------------ + ALTERNATIVE B - Public Domain (www.unlicense.org) + This is free and unencumbered software released into the public domain. + Anyone is free to copy, modify, publish, use, compile, sell, or distribute this + software, either in source code form or as a compiled binary, for any purpose, + commercial or non-commercial, and by any means. + In jurisdictions that recognize copyright laws, the author or authors of this + software dedicate any and all copyright interest in the software to the public + domain. We make this dedication for the benefit of the public at large and to + the detriment of our heirs and successors. We intend this dedication to be an + overt act of relinquishment in perpetuity of all present and future rights to + this software under copyright law. + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION + WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +-------------------------------------------------------------------------------- + +2. FGPA example designs-gzip + + SDL2.0 + +zlib License + + + This software is provided 'as-is', without any express or implied + warranty. In no event will the authors be held liable for any damages + arising from the use of this software. + + Permission is granted to anyone to use this software for any purpose, + including commercial applications, and to alter it and redistribute it + freely, subject to the following restrictions: + + 1. The origin of this software must not be misrepresented; you must not + claim that you wrote the original software. If you use this software + in a product, an acknowledgment in the product documentation would be + appreciated but is not required. + 2. Altered source versions must be plainly marked as such, and must not be + misrepresented as being the original software. + 3. This notice may not be removed or altered from any source distribution. + + +-------------------------------------------------------------------------------- + +3. Nbody + (c) 2019 Fabio Baruffa + + Plotly.js + Copyright (c) 2020 Plotly, Inc + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. +© 2020 GitHub, Inc. + +-------------------------------------------------------------------------------- + +4. GNU-EFI + Copyright (c) 1998-2000 Intel Corporation + +The files in the "lib" and "inc" subdirectories are using the EFI Application +Toolkit distributed by Intel at http://developer.intel.com/technology/efi + +This code is covered by the following agreement: + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + +Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. THE EFI SPECIFICATION AND ALL OTHER INFORMATION +ON THIS WEB SITE ARE PROVIDED "AS IS" WITH NO WARRANTIES, AND ARE SUBJECT +TO CHANGE WITHOUT NOTICE. + +-------------------------------------------------------------------------------- + +5. Edk2 + Copyright (c) 2019, Intel Corporation. All rights reserved. + + Edk2 Basetools + Copyright (c) 2019, Intel Corporation. All rights reserved. + +SPDX-License-Identifier: BSD-2-Clause-Patent + +-------------------------------------------------------------------------------- + +6. Heat Transmission + +GNU LESSER GENERAL PUBLIC LICENSE +Version 3, 29 June 2007 + +Copyright © 2007 Free Software Foundation, Inc. + +Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + +This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. + +0. Additional Definitions. +As used herein, “this License” refers to version 3 of the GNU Lesser General Public License, and the “GNU GPL” refers to version 3 of the GNU General Public License. + +“The Library” refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. + +An “Application” is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. + +A “Combined Work” is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the “Linked Version”. + +The “Minimal Corresponding Source” for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. + +The “Corresponding Application Code” for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. + +1. Exception to Section 3 of the GNU GPL. +You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. + +2. Conveying Modified Versions. +If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: + +a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or +b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. +3. Object Code Incorporating Material from Library Header Files. +The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: + +a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the object code with a copy of the GNU GPL and this license document. +4. Combined Works. +You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: + +a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. +b) Accompany the Combined Work with a copy of the GNU GPL and this license document. +c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. +d) Do one of the following: +0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. +1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. +e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) +5. Combined Libraries. +You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: + +a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. +b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. +6. Revised Versions of the GNU Lesser General Public License. +The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License “or any later version” applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. + +If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. + +-------------------------------------------------------------------------------- +7. Rodinia + Copyright (c)2008-2011 University of Virginia +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, are permitted without royalty fees or other restrictions, provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + * Neither the name of the University of Virginia, the Dept. of Computer Science, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA OR THE SOFTWARE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +If you use this software or a modified version of it, please cite the most relevant among the following papers: + + - M. A. Goodrum, M. J. Trotter, A. Aksel, S. T. Acton, and K. Skadron. Parallelization of Particle Filter Algorithms. In Proceedings of the 3rd Workshop on Emerging Applications and Many-core Architecture (EAMA), in conjunction with the IEEE/ACM International +Symposium on Computer Architecture (ISCA), June 2010. + + - S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee and K. Skadron. +Rodinia: A Benchmark Suite for Heterogeneous Computing. IEEE International Symposium +on Workload Characterization, Oct 2009. + +- J. Meng and K. Skadron. "Performance Modeling and Automatic Ghost Zone Optimization +for Iterative Stencil Loops on GPUs." In Proceedings of the 23rd Annual ACM International +Conference on Supercomputing (ICS), June 2009. + +- L.G. Szafaryn, K. Skadron and J. Saucerman. "Experiences Accelerating MATLAB Systems +Biology Applications." in Workshop on Biomedicine in Computing (BiC) at the International +Symposium on Computer Architecture (ISCA), June 2009. + +- M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: +A Case Study in Leveraging Manycore Coprocessors." In Proceedings of the International Parallel +and Distributed Processing Symposium (IPDPS), May 2009. + +- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. "A Performance +Study of General Purpose Applications on Graphics Processors using CUDA" Journal of +Parallel and Distributed Computing, Elsevier, June 2008. + +-------------------------------------------------------------------------------- +Other names and brands may be claimed as the property of others. + +-------------------------------------------------------------------------------- \ No newline at end of file diff --git a/AI-and-Analytics/Getting-Started-Samples/README.md b/AI-and-Analytics/Getting-Started-Samples/README.md index 4aa716713c..14154dc9fd 100644 --- a/AI-and-Analytics/Getting-Started-Samples/README.md +++ b/AI-and-Analytics/Getting-Started-Samples/README.md @@ -27,5 +27,6 @@ Third party program Licenses can be found here: [third-party-programs.txt](https |Classical Machine Learning| Scikit-learn (OneDAL) | [Intel_Extension_For_SKLearn_GettingStarted](Intel_Extension_For_SKLearn_GettingStarted) | Speed up a scikit-learn application using Intel oneDAL. |Deep Learning
Inference Optimization|Intel® Extension of TensorFlow | [Intel® Extension For TensorFlow GettingStarted](Intel_Extension_For_TensorFlow_GettingStarted) | Guides users how to run a TensorFlow inference workload on both GPU and CPU. |Deep Learning Inference Optimization|oneCCL Bindings for PyTorch | [Intel oneCCL Bindings For PyTorch GettingStarted](Intel_oneCCL_Bindings_For_PyTorch_GettingStarted) | Guides users through the process of running a simple PyTorch* distributed workload on both GPU and CPU. | +|Inference Optimization|JAX Getting Started Sample | [IntelJAX GettingStarted](https://github.com/oneapi-src/oneAPI-samples/tree/development/AI-and-Analytics/Getting-Started-Samples/IntelJAX_GettingStarted) | The JAX Getting Started sample demonstrates how to train a JAX model and run inference on Intel® hardware. | *Other names and brands may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html)