Free Seminar for Chula Engineering Students (all majors)
Department of Industrial Engineering, Chula Engineering, and The University of Tokyo, Japan, present
“Text Mining by Using Python: Application to Patent Documents”
Instructor: Prof. Kazuyuki Motohashi (U-Tokyo, Japan)
Teaching Assitant: Dr. Suchit Pongnumkul
Date: May 27, 29, 31, 2019
Time: 09:00-12:00
Venue: Room 407, 4F, Engineering Building 4, Faculty of Engineering, Chulalongkorn University
Language: English (and Thai by TA)
- Contents
– Introduction to patent data analysis
– What is patent data? Why is it used for technology management research?
– Various kinds of patent database: PATSTAT, JPO (IIP patent database), USPTO
– Keyword extraction, TF-IDF, similarity measures
– VIDEO (O’Reilly Media) https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
– HOMEWORK: Apply patent abstract documents to the Video exercise above, and extract three keywords by using TF-IDF scores - Patent Similarity
– Review of homework
– Preprocessing of text: tokenization regularization (lowering character), stemming/lemmatization and stop words exclusion
– Text processing of tf-idf vectors:
1. Create dictionary: mapping every word to a number
2. Corpus (list of bags of words) : a list of number of words occurring in each documents
– Calculation of similarity measures across each documents : genism.similarities
– Topic Modeling
– What is topic modeling (with some examples)
1. Understanding the concept: LDA
– Gensim module in Python Topic modeling works for the following three technologies, comparison with JPO classification
1. Artificial intelligence
2. Autonomous driving
3. Gene modification technology
– Good reference about topic modeling by genism https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
– Assignment for further works (three weeks program)
RSVP at https://forms.gle/mPYfDKYx8VmR8jQM8
Capacity: 20 seats
(Registration deadline: May 18, 2019)
Contact: Natt Leelawat, D.Eng. (natt.l@chula.ac.th)