# Training camp II / Bigdata challenge

2017 Spring

Project based learning.

## Objective

To learn how to handle bigdata and brush up programming skill throughout project based learning (PBL).

## Schedule and place

### Duration

2nd, 23rd-25th June 2017

### Time table

 2nd 13:00-15:00 Pre-meeting and drawing up research plan (PhD students). 23rd 10:00-10:10 Opening talk (Dr. Naho Orita). 10:10-18:00 Practical work (12:00-13:00 Lunch break) 24th 10:00-18:00 Practical work (12:00-13:00 Lunch break) 25th 10:00-16:30 Practical work (12:00-13:00 Lunch break) 16:30-18:00 Presentation

### Place

 2nd Small lecture room, 5th floor, GSIS building 23rd - 25th Room 207 (Medium lecture room), 2nd floor, GSIS building

### Description

In this class, you are expected to solve programming problems listed below in a group. The problems include what is relevant to manufacturing, marketing, natural language processing and biology. Solve a problem allocated to your group.

### Role of PhD student

PhD students in a team should solve the problem as trainer for master course students. Tell a description of you problem to master course students, design a research scheme, provide directions and manage your team.

### Role of master course student

Master course students in a team should solve the problem as trainee. Write programming codes actually, make a survey about your problem and analyze data.

### Presentation

At the final day (25th), make a research presentation about your project. The presentation time is about 20 minutes per a team including a question and answer period. Basically, presentation style or format is free and up to you. Choose presenter(s) from master course students.

### Teaching assistant

Each team has a teaching assistant.

TeamProblemInstructorInstructor's major
Team AManufacturingCherdsak Kingkan3D modeling, machine learning, surface recognition
Team BMarketingYinxing Li, Kazunori YamadaMachine learning, econometrics, optimization, sequence
Team CNatural language processingYinxing LiMachine learning, econometrics
Team DBiologyTakuro NakayamaEvolutionary biology, phycology

### Code

#!/usr/bin/env python

def main():
# body of the program

if __name__ == '__main__':
main()


### Terminal

Terminal will be shown by the following box. Here, "$" stands for a prompt. $ ls


### Text

Text will be shown by the following box.

This is a text file.


## Group member

A person in bold face is PhD student.

Group AGroup BGroup CGroup D

Ilya Ardakani

Ishara Perera

Wang Xiyue

Naoya Chiba

Daiki Sato

Jie Chen

Shun Kodate

Wang Zhen

Mirai Igarashi

Xia Danlong

Mulya Agung

Kien Duy Nguyen

Agness Ethel Lakudzala

Ryo Takahashi

Komaki Ninomiya

Qikai Chen

Kentaro Ogawa

Luqman Khan

Zahra Kamalia Putri

Yuki Kagaya

Shan Liang

Shuuki Ri

### Presentation

Please send all presentation materials to kyamada@ecei.tohoku.ac.jp. The progress and activity of your team will be evaluated.

### Report (master course student)

Submit a report within one page and at least half page of A4 paper in PDF format to kyamada@ecei.tohoku.ac.jp before 4th July. Follow the guideline below.

1. Include summary of your research.
2. Clarify your contribution to the research.
3. Include improvement of your programming and/or research ability.

### Report (PhD student)

Submit a report within one page and at least half page of A4 paper in PDF format to kyamada@ecei.tohoku.ac.jp before 4th July. Follow the guideline below.

1. Include summary of your research.
2. Illustrate the role of master course students. Clarify on what criteria you allocated tasks to master course students.
3. Clarify your contribution to the research.

### Participation to the class

Instructors will evaluate your participation to the class.

## Computational environment

### Cluster machine

You can use cluster machine in GPDS. User ID for team A, B, C and D is bcga, bcgb, bcgc and bcgd. The address is "130.34.234.40". Instructors know password. You can SSH access to the machine if you are in Tohoku university network. The cluster consists of 4 computation machines and a login machine and each machine has 1 GPU and 24 threads and 128GB memory. The quota for each user is 2TB. It has Son of Grid Engine as job scheduler.

### Programming language

Use any language you like.

## Problem A (Bosch production line performance: Reduce manufacturing failures)

This project was in data science competition on Kaggle. Here we are going to work on this dataset to learn about data science process, such as data preprocessing, data visualization, data analysis, etc.

### Project Objective

Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

Please feel free to apply any analytics techniques that you can come up with.

### Dataset

This dataset is provided by Bosch at Kaggle data science competition. Please visit the following link to download the dataset and more description :https://www.kaggle.com/c/bosch-production-line-performance Please be aware that the ground truth of this dataset is highly “Imbalanced”.

## Problem B (Making classifier for banker to decide lending money)

### Data information

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

* The Dataset has many categorical variables.

Dataset

### Hint

This is the instructor's personal website for machine learning (in Japanese). If you like, please read it.

http://data-science.gr.jp/implementation.html

### Problem and procedure

2. Use "Sample.csv" to build multiple proposal classifier models. Randomly choose 1,000 data to make test data. Try both Parametric and non-Parametric Methods and choose one best Model. (Use Cross-Validation when testing the model)
3. Use "Full.csv" to build multiple proposal classifier models. Randomly choose 1,000 data to make test data and choose one best Model.
4. Comparing differences of model selection between the small sample data and big sample data.
5. Optional: Analyze and visualize as much information as possible.

## Problem C (Labeled topic model for predicting stock market price)

### Objective

The purpose of this project is using text analysis methods to predict product sales.

### Problem and procedure

1. Collect text data from the website bellow.
2. http://www.gsmarena.com/apple-phones-48.php

3. Choose one product of Apple (iPhone or iPad) and collect sales data. (It’s open source data)
4. Pre-processing the text data. (Remove Punctuation, remove stop words and etc.)
5. Apply text analysis algorithm to the text data. (Topic model, Naïve Bayes and etc.)
6. Use the result of topic model to predict stock price, and evaluate both prediction model and text analysis algorithm.

## Problem D (Identification of foreign gene in a genome)

### Background

Plastid (Chloroplast) is a fundamental organelle of plant cells which carries out photosynthesis. Plastids are believed to have originated through an endosymbiosis with a cyanobacterium over 1 billion years ago. Extant plant cells control their plastids, descendants of the bacterium, using genes acquired by endosymbiotic gene transfer (EGT) from the cyanobacterial endosymbiont in the course of evolution. Thus, a EGT event is recognized as a milestone in the plastid evolution.

Paulinella chromatophora is a single-celled amoeba, which contains photosynthetic inclusion called “chromatophores” in their cells. This amoeba has lately gained attentions from biologists because recent studies revealed that the “chromatophores” have originated from a cyanobacteria, which is independent from the one that gave rise to ordinary plastids in plants. The chromatophores of Paulinella is supposed to be acquired quite recent (~100 million years ago) compared to plants’ plastids and Paulinella chromatophora might exhibit the early stage of plastid evolution, making this amoeba a good model to understand the origin and evolution of extant plants on the earth.

### Problem

It had been remained unclear if EGT from the symbionts (chromatophores) to nuclear genome, which is assumed to be pivotal in the plastid evolution, has occurred in Paulinella chromatophora. Lately, genome sequence of Paulinella chromatophora was revealed and ~60,000 protein coding genes were found in the genome. Check if there are protein coding genes thought to be transferred from chromatophore (EGT genes) among provided sequences. If exist, detect as many EGT genes as possible within 3 days.

• A data set will be provided on the first day.
• Data to be provided will be protein sequence (amino acid sequence) data that are coded by genes of Paulinella chromatophora.
• The closest free-living relatives of the chromatophore are cyanobacterial species belong to genera Synechococcus/Prochlorococcus.
• It is highly recommended googling highlighted words before you get started.
• Following programs could be a help in detecting EGT genes/proteins.
SoftwareDescription
BLASTTool for sequence similarity search.
MAFFTMultiple sequence alignment program.
IQ-TREESoftware for reconstructing phylogenetic trees.
BMGESoftware for selection of phylogenetic informative regions from multiple sequence alignments.
SEAVIEWMultiple alignment viewer (GUI).

## Instructors

Takuro Nakayama, Naho Orita, Kazunori Yamada, Aijing Xing, Cherdsak Kingkan, Yinxing Li