My work for Geescore revolves around the Natural Language Processing and Machine Learning domain. Basically, I work as a backend developer where I deal with issues such as parsing, scoring and driving the behind-the-scenes decisions.
Geescore attempts to innovate the automated hiring industry by bringing a new aspect of automatically scoring a resume for a given job posting – any job posting anywhere, for that matter. This scoring is automated, and also dynamic; which means it can increase as the jobseeker interacts more with the tool and makes suggested improvements to his resume.
One of the core aspects of such a tool is the ability to parse both resumes and job postings, and this is one of the key areas that I deal with.
For resumes, we accept a fixed set of formats; namely, txt, doc, docx and PDF. First, the text is extracted and cleaned from the resume. After that, different sections are identified and separated from the resume. Usually, a resume is segmented into different sections such as basic information, education, work experience, skills, interests, certifications, references etc. These sections are identified and separated using a combination of Machine Learning and advanced parsing rules. The current parsing solution is currently dependent more on Machine Learning, for which we have trained two different classification models. The first one identifies those lines in the resume text which are section headers, and the second classifier analyzes the accumulated text of a section to identify which section type it belongs to. These classifiers are currently trained on the data of around 500 resumes, and it is planned to sequentially keep training them on more and more data as it becomes available to us. The parsing rules involve complex regular expressions coupled with section specific conditions, and are currently being used as a backup option in case the ML models fail to identify some section(s). We do not want to rely on rules too much since the formats of resumes can vary, which is why the preference and main focus is on the Machine Learning approach.
Once a resume has been parsed, it will be stored into the database.
Parsing job postings
Parsing job postings is a similar but more complicated task. A web scraping engine first extracts the text from the URL of the job posting, and this text is then cleaned. Afterwards, we again apply the sections identification technique; the content of job postings can typically be segmented into different sections such as job description, company description, responsibilities, qualifications, contact information etc. However, parsing job postings is much more difficult for a multitude of reasons, some of which are thus:
The formats of job postings varies a lot. Since these are basically web pages, there are a lot more ways to design and be creative, and also the content can be dynamic
Often sections do not have explicit headers; they are instead specified by placement or other means
The text scraped from a web page containing a job posting contains a lot of junk text, such as that of headers, footers, menus, submenus etc.
Because of these complexities, the Machine Learning solution we developed for parsing job descriptions is different: here we’re using a custom Named Entity Recognition classifier. This model has been trained on tagged job descriptions to identify section headers or in some cases, section values. Only a single classifier is being used here, which can both identify some sections and the corresponding values (such as location, salary, job type etc.), and section headers in the cases when the corresponding value is too large (such as qualifications, responsibilities, benefits) etc. Generating training data for this classifier is a problem currently, since the documents have to be tagged properly which takes a lot of time, and they also have to be tagged very carefully, since we don’t have a lot of tagged data at hand. Currently the NER classifier is trained on a total of 360 documents, but it needs a lot more for better performance.
We also have backup rules for identifying different sections, but we are close to getting rid of them altogether since they do not perform well on content which can vary as much as job postings.
Once a job description has been parsed, it will be stored into the database.
For scoring resumes, we have a sophisticated algorithm in place that considers multiple perspectives. Currently however, we are considering these two factors: the content similarity of the resume and the job description, and the geological distance between the job posting location and the jobseeker’s residence.
The content similarity is judged by extracting keywords, keyphrases and acronyms from the jobseeker’s resume, and looking them up in the corresponding job description. The more matches, the higher the score. This keyword extraction and lookup is not done naively; so for example, keywords are extracted from the work experience and skills sections of the resumes, and looked up in the qualifications and responsibilities in the job posting. Similarly, the jobseeker’s educational degrees are extracted and then looked up in the educational qualifications of the job posting, for a high scoring fuzzy match. Keywords are extracted using RAKE (Rapid Automated Keyword Extraction); we also experimented with TF-IDF but that didn’t give good results; and acronyms are extracted using regular expressions. We have more advanced content matching techniques in mind, such as Doc2Vec for example, but that requires a huge amount of data for both resumes and job descriptions, which we currently do not possess.
Geological distance between the job posting location and the jobseeker’s location works in the sense that the closer the location, the higher the score.
Tools and Technologies
All of this has been developed in Python 3. Some of the modules being used are: