Training camp II / Bigdata challenge

2017 Spring

Project based learning.

Table of contents


To learn how to handle bigdata and brush up programming skill throughout project based learning (PBL).

Schedule and place


2nd, 23rd-25th June 2017

Time table

2nd13:00-15:00Pre-meeting and drawing up research plan (PhD students).
23rd10:00-10:10Opening talk (Dr. Naho Orita).
10:10-18:00Practical work (12:00-13:00 Lunch break)
24th10:00-18:00Practical work (12:00-13:00 Lunch break)
25th10:00-16:30Practical work (12:00-13:00 Lunch break)


2ndSmall lecture room, 5th floor, GSIS building
23rd - 25thRoom 207 (Medium lecture room), 2nd floor, GSIS building

About the class


In this class, you are expected to solve programming problems listed below in a group. The problems include what is relevant to manufacturing, marketing, natural language processing and biology. Solve a problem allocated to your group.

Role of PhD student

PhD students in a team should solve the problem as trainer for master course students. Tell a description of you problem to master course students, design a research scheme, provide directions and manage your team.

Role of master course student

Master course students in a team should solve the problem as trainee. Write programming codes actually, make a survey about your problem and analyze data.


At the final day (25th), make a research presentation about your project. The presentation time is about 20 minutes per a team including a question and answer period. Basically, presentation style or format is free and up to you. Choose presenter(s) from master course students.

Teaching assistant

Each team has a teaching assistant.

TeamProblemInstructorInstructor's major
Team AManufacturingCherdsak Kingkan3D modeling, machine learning, surface recognition
Team BMarketingYinxing Li, Kazunori YamadaMachine learning, econometrics, optimization, sequence
Team CNatural language processingYinxing LiMachine learning, econometrics
Team DBiologyTakuro NakayamaEvolutionary biology, phycology


In this page, code will be shown by the following representation.

#!/usr/bin/env python

def main():
	# body of the program
if __name__ == '__main__':


Terminal will be shown by the following box. Here, "$" stands for a prompt.

$ ls


Text will be shown by the following box.

This is a text file.

Group member

A person in bold face is PhD student.

Group AGroup BGroup CGroup D

Ilya Ardakani

Ishara Perera

Wang Xiyue

Naoya Chiba

Daiki Sato

Jie Chen

Shun Kodate

Wang Zhen


Mirai Igarashi

Xia Danlong


Mulya Agung

Kien Duy Nguyen

Agness Ethel Lakudzala

Ryo Takahashi

Komaki Ninomiya

Qikai Chen

Kentaro Ogawa

Luqman Khan

Zahra Kamalia Putri

Yuki Kagaya

Shan Liang

Shuuki Ri

Evaluation of grade


Please send all presentation materials to The progress and activity of your team will be evaluated.

Report (master course student)

Submit a report within one page and at least half page of A4 paper in PDF format to before 4th July. Follow the guideline below.

  1. Include summary of your research.
  2. Clarify your contribution to the research.
  3. Include improvement of your programming and/or research ability.

Report (PhD student)

Submit a report within one page and at least half page of A4 paper in PDF format to before 4th July. Follow the guideline below.

  1. Include summary of your research.
  2. Illustrate the role of master course students. Clarify on what criteria you allocated tasks to master course students.
  3. Clarify your contribution to the research.
  4. Include improvement of your teaching, management, leading and/or research ability.

Participation to the class

Instructors will evaluate your participation to the class.

Computational environment

Cluster machine

You can use cluster machine in GPDS. User ID for team A, B, C and D is bcga, bcgb, bcgc and bcgd. The address is "". Instructors know password. You can SSH access to the machine if you are in Tohoku university network. The cluster consists of 4 computation machines and a login machine and each machine has 1 GPU and 24 threads and 128GB memory. The quota for each user is 2TB. It has Son of Grid Engine as job scheduler.

Programming language

Use any language you like.

Problem A (Bosch production line performance: Reduce manufacturing failures)

This project was in data science competition on Kaggle. Here we are going to work on this dataset to learn about data science process, such as data preprocessing, data visualization, data analysis, etc.

Project Objective

Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

Please feel free to apply any analytics techniques that you can come up with.


This dataset is provided by Bosch at Kaggle data science competition. Please visit the following link to download the dataset and more description : Please be aware that the ground truth of this dataset is highly “Imbalanced”.

Data example


Problem B (Making classifier for banker to decide lending money)

Data information

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

* The Dataset has many categorical variables.




This is the instructor's personal website for machine learning (in Japanese). If you like, please read it.

Problem and procedure

  1. Download dataset from the above link. And read bank-additional-names.txt for more information about dataset.
  2. Use "Sample.csv" to build multiple proposal classifier models. Randomly choose 1,000 data to make test data. Try both Parametric and non-Parametric Methods and choose one best Model. (Use Cross-Validation when testing the model)
  3. Use "Full.csv" to build multiple proposal classifier models. Randomly choose 1,000 data to make test data and choose one best Model.
  4. Comparing differences of model selection between the small sample data and big sample data.
  5. Optional: Analyze and visualize as much information as possible.

Problem C (Labeled topic model for predicting stock market price)


The purpose of this project is using text analysis methods to predict product sales.

Problem and procedure

  1. Collect text data from the website bellow.

  3. Choose one product of Apple (iPhone or iPad) and collect sales data. (It’s open source data)
  4. Pre-processing the text data. (Remove Punctuation, remove stop words and etc.)
  5. Apply text analysis algorithm to the text data. (Topic model, Naïve Bayes and etc.)
  6. Use the result of topic model to predict stock price, and evaluate both prediction model and text analysis algorithm.

Schematic diagram of procedure


Problem D (Identification of foreign gene in a genome)


Plastid (Chloroplast) is a fundamental organelle of plant cells which carries out photosynthesis. Plastids are believed to have originated through an endosymbiosis with a cyanobacterium over 1 billion years ago. Extant plant cells control their plastids, descendants of the bacterium, using genes acquired by endosymbiotic gene transfer (EGT) from the cyanobacterial endosymbiont in the course of evolution. Thus, a EGT event is recognized as a milestone in the plastid evolution.

Paulinella chromatophora is a single-celled amoeba, which contains photosynthetic inclusion called “chromatophores” in their cells. This amoeba has lately gained attentions from biologists because recent studies revealed that the “chromatophores” have originated from a cyanobacteria, which is independent from the one that gave rise to ordinary plastids in plants. The chromatophores of Paulinella is supposed to be acquired quite recent (~100 million years ago) compared to plants’ plastids and Paulinella chromatophora might exhibit the early stage of plastid evolution, making this amoeba a good model to understand the origin and evolution of extant plants on the earth.


It had been remained unclear if EGT from the symbionts (chromatophores) to nuclear genome, which is assumed to be pivotal in the plastid evolution, has occurred in Paulinella chromatophora. Lately, genome sequence of Paulinella chromatophora was revealed and ~60,000 protein coding genes were found in the genome. Check if there are protein coding genes thought to be transferred from chromatophore (EGT genes) among provided sequences. If exist, detect as many EGT genes as possible within 3 days.

  • A data set will be provided on the first day.
  • Data to be provided will be protein sequence (amino acid sequence) data that are coded by genes of Paulinella chromatophora.
  • The closest free-living relatives of the chromatophore are cyanobacterial species belong to genera Synechococcus/Prochlorococcus.
  • It is highly recommended googling highlighted words before you get started.
  • Following programs could be a help in detecting EGT genes/proteins.
BLASTTool for sequence similarity search.
MAFFTMultiple sequence alignment program.
IQ-TREESoftware for reconstructing phylogenetic trees.
BMGESoftware for selection of phylogenetic informative regions from multiple sequence alignments.
SEAVIEWMultiple alignment viewer (GUI).


Takuro Nakayama, Naho Orita, Kazunori Yamada, Aijing Xing, Cherdsak Kingkan, Yinxing Li