Yi Hong

Shanghai Jiao Tong University

CS 248: Big Data Processing

Overview: We are in the era of big data and deep learning. In this course, we will discuss techniques that are designed for processing big data. In particular, we will cover classical algorithms for mining high-dimensional data, data streams and graphs, machine learning and deep learning techniques for handling big data, and recent advances for understanding medical big data.

Course Information

  • Class meetings: W 10:00am – 11:40am (1-16 week), F 10:00am – 11:40am (9-16 week) @ Dong Shang Yuan 206

  • Instructor: Yi Hong (yi.hong -at- sjtu.edu.cn, office: SEIEE 3-501)

  • Office hours: by appointment

  • Course webpage: http://cs.sjtu.edu.cn/~yihong/CS248-Fall2021.html

Topics

  • Finding similar items

  • Clustering

  • Dimensionality reduction

  • Mining data streams

  • Graph analysis

  • Large-scale machine learning

  • Deep big data

  • Medical big data

Reference Book and Resources

  • Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Mining of Massive Datasets, Third Edition. http://www.mmds.org.

  • Ziyu Lin, Principles and Applications of Big Data Technology (3rd).

  • Papers from top conferences and journals.

Grading

  • Homework (3 Assignments, each 10%)

  • Paper Presentation (25%)

  • Course Project (40%; Proposal 5%, Project Update 5%, Final Presentation 15%, Final Report 15%)

  • Participation and Attendance (5%)

  • Late Policy: 1 day late (10% off), 2 (20% off), 3 (30% off). Late submissions are not accepted after 3 late days.

Academic Honesty

All students are responsible for maintaining the highest standards of honesty and integrity in every phase of their academic careers. The penalties for academic dishonesty are severe and ignorance is not an acceptable defense.

Tentative Schedule

Date Topic Reading Assignments
Sep 15 (W) Course Introduction
Introduction to Big Data Processing
Ziyu Lin, Chapter 1 --
Sep 22 (W) Finding Similar Items Leskovec et al., Chapter 3 --
Sep 29 (W) Clustering Leskovec et al., Chapter 7 --
Oct 6 (W) Dimensional Reduction 1 Leskovec et al., Chapter 11 Homework 1 Handout
Oct 13 (W) Dimensional Reduction 2 Leskovec et al., Chapter 11 --
Oct 20 (W) Mining Data Streams 1 Leskovec et al., Chapter 4 Homework 1 Due
Oct 27 (W) Mining Data Streams 2 Leskovec et al., Chapter 4 Course Project Proposal Due on Sunday (Oct 31) 11:59pm
Nov 3 (W) Graph Analysis 1 Leskovec et al., Chapter 10 Homework 2 Handout
Nov 10 (W) Graph Analysis 2 Leskovec et al., Chapter 10 --
Nov 12 (F) Introduction to MapReduce Leskovec et al., Chapter 2
Ziyu Lin, Chapter 7
--
Nov 17 (W) Paper Presentation 1 -- Homework 2 Due
Nov 19 (F) Paper Presentation 2 -- --
Nov 24 (W) Introduction to Spark Ziyu Lin, Chapter 10 Homework 3 Handout
Nov 26 (F) Large-Scale Machine Learning 1 Leskovec et al., Chapter 12 --
Dec 1 (W) Large-Scale Machine Learning 2 Leskovec et al., Chapter 2 --
Dec 3 (F) Course Project Update -- --
Dec 8 (W) Deep Big Data Analytics 1 Leskovec et al., Chapter 13
Online Papers
Homework 3 Due
Dec 10 (F) Deep Big Data Analytics 2 Leskovec et al., Chapter 13
Online Papers
--
Dec 15 (W) Paper Discusssion/Guest Lecture -- --
Dec 17 (F) Medical Big Data 1 Online Papers --
Dec 22 (W) Medical Big Data 2 Online Papers --
Dec 24 (F) Paper Discussion/Guest Lecture -- --
Dec 29 (W) Course Project Final Presentation 1 -- --
Dec 31 (F) Course Project Final Presentation 2 -- Course Project Report in One Week (Jan 7, 2022) 11:59pm

Paper Reading List

Literature Review

  • Saxena et al., A Review of Clustering Techniques and Developments, Neurocomputing, 2017.

  • Min et al., A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture, IEEE Access, 2018.

  • Yang et al., Multi-View Clustering: A Survey, Big Data Mining and Analytics, 2018.

  • Zhang et al., Graph Convolutional Networks: A Comprehensive Review, Computational Social Networks, 2019.

  • Tustison et al., Learning image-based spatial transformations via convolutional neural T networks: A review, Magnetic Resonance Imaging, 2019.

  • Xu et al., Review of classical dimensionality reduction and sample selection methods for large-scale data processing, Neurocomputing, 2019.

  • Lathuiliere et al., A Comprehensive Analysis of Deep Regression, TPAMI 2020.

  • Ma et al., Image Matching from Handcrafted to Deep Features: A Survey, IJCV 2021.

  • Ramirez-Gallego et al., A survey on data preprocessing for data stream mining: Current status and future directions, Neurocomputing 2017.

  • Zubaroglu et al., Data stream clustering: a review, Artificial Intelligence Review, 2021.

  • Jan et al., Deep Learning in Big Data Analytics: A Comparative Study, Computers & Electrical Engineering 2017.

Similarity Search

  • Gong et al., Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval, TPAMI 2013.

  • Yan et al., Deep Multi-View Enhancement Hashing for Image Retrieval, TPAMI 2021.

  • Johnson et al., Billion-Scale Similarity Search with GPUs, Transactions on Big Data, 2021.

Clustering and Dimensionality Reduction

  • Caron et al., Deep Clustering for Unsupervised Learning of Visual Features, ECCV 2018.

  • Sinaga et al., Unsupervised K-Means Clustering Algorithm, IEEE Access, 2020.

  • Zhan et al., Online Deep Clustering for Unsupervised Representation Learning, CVPR 2020.

  • Reddy et al., Analysis of Dimensionality Reduction Techniques on Big Data, IEEE Access, 2020.

  • Migenda et al., Adaptive dimensionality reduction for neural network-based online principal component analysis, Plos One, 2021.

Data Streams

  • Krempl et al., Open challenges for data stream mining research, KDD 2014.

Graph Analysis

  • Gao et al., Large-Scale Learnable Graph Convolutional Networks, KDD 2018.

  • Wu et al., Simplifying Graph Convolutional Networks, ICML 2019.

  • Yao et al., Graph Convolutional Networks for Text Classification, AAAI 2019.

  • Wang et al., Dynamic Graph CNN for Learning on Point Clouds, Transactions on Graphics, 2019.

  • Manessi et al., Dynamic Graph Convolutional Networks, Pattern Recognition, 2020.

Spatio-temporal data

  • Zhang et al., A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data, AAAI 2019.

Deep Big Data

  • Dean et al., Large scale distributed deep networks, NeurIPS 2012.

Disclaimer

The instructor reserves the right to make changes to the syllabus, including assignemnt due dates. These changes will be announced as early as possible.