Thursday, September 18, 2008

The Six Minute Dream Project

That is not to say you will be done and dusted with the project in Six Minutes Flat (unless you are in any way related to this fellow

 

The objective of the Six Minute Dream is to get your vision, your idea for the prospective project out in class, to get feedback from your peers and professors. 

While you are busy contemplating (gazing pensively into the far distances etc as mentionedbefore) what project to take up, keep in mind, at all time these two points:

 

Point a) 

It should be useful i.e something non-trivial, valuable and even better, counter-intuitive should emerge from your findings. For instance, hours and days of crunching terabytes of financial data on a monster of a data mining package should yield something more substantial than the glorious result "Companies go bankrupt because of lack of money")

 

Point  b) 

It should be feasible i.e within the bounds of possibility to complete in one term.

Good quality data must be available (For instance, while  "Finding Patterns in Classified Top Secret MOSSAD internal communications about Covert Operations " might seem an interesting project, it won't do well at point b. Unless, of course, you are this guy.)

 

 

Every team is expected to come up with a One Page Writeup about their dream project, upload it to this blog  and present/discuss the same in class  on 1st October. This presentation will be a 6 min communication of your idea to the class and profs. Think of it as a Pre-Proposal for the Project. 

No powerpoint slides for class presentation, just coherent ideas.

 

 

Guidelines for the One Page Writeup:

Mention:

1) Project Name, Team Number (u can go crazy and call your team a name!) + Team Members

2) The Problem Statement

3) Data Source (preferrably the type of data available)

4) The Benefit / Utility : Who will potentially benefit from the insights mined? How? 

5) Expected Outcomes: Gut feeling about what you expect might be the findings.

 

It is not expected that you know for sure which tool or technique you will use

 (of course, if you have some idea,  feel free to mention it! - What is important is that you have a fair idea what the  problem is that you plan to solve. Know thy pain-point! )

 

Submission Procedure:

Post this One-Page Write Up on the Blog: 

Log In and Create a New Post with the Title

 "TeamNo_ProjectName_6MinDream". 

 

Copy-Paste the contents of the OnePage Doc to the Post and Press Publish Post.

(In case of any difficulties with blog access, you can e-mail me the one page doc. Please name the file "TeamNo_ProjectName_6MinDream.doc" )

 

 

Timelines:

29 September 09:59:59 - One Page Writeup up on the Blog

1 October -   Presentation and Discussion of your idea in Class

 

 

Tuesday, September 16, 2008

Class Lectures and References

Material referenced during the lectures can all be downloaded from Slideshare:
http://www.slideshare.net/ambujm

Team Co-ordinators for Projects

This is the list of team co-ordinators for the BITT-1 projects.

Please get in touch with them, form your team (Max 4 members including co-ordinator, per team) and add in your names to this list (edit this post, after logging in to the blog. )

(~ If you haven't yet, email me at myshkinonline (at) gmail (dot) com with your name and reg no to get an invite and editing rights).

Team

No.

Team

Project Name

Team Co-Ordinator

+ Team Members

(max 3 more)

1

Arjun A.V – 413/15



2

Bhuvan – 419/15

Akshat Patil (402/15)
GuruPrasad Shenoy(423/15)
Ankit Anand (409/15)

20) Data Mining in Sports

3

Ravi Gupta – 445/15

Ankit Singhal(411/15)
Anand Justin Cherian(405/15)
Sushim Gupta(459/15)

21) Facebook

Interaction Patterns in Online Social Networks

4

G.Raviteja – 422/15

D Sushanth Reddy - 420/15

Ankit Gupta - 410/15

Manmohan Agarwal - 433/15

Stock Markets - Proj No. 23

5

Mahesh Chayel – 432/15

Preet Pillai - 442/15
Ravi Dilip Kumar - 444/15
Ashish Kumar Vijan - 414/15

TBA

6

Hrishikesh Thite – 461/15

Tarun Gupta - 460/15

Sreekanth Reddy - 457/15

TBA

7

Prashant Pande – 438/15

Jose Vinay C - 425/15
Poonacha K. M.- 440/15
Rittu C. Joseph - 446/15

Correlation between
Achievements
of a nation
with its
Macroeconomic
factors

8

Karmendra Jain – 428/15

Ankit Agrawal 408/15
Piyush Mehta 439/15
Rahul Sethia 443/15

Financial Services (12)

9

Abhinav Mathur – FP/02/08



10

Abhinay Puvvala – FP/12/08

2. Ayush Garg (417/15)
3. Aman Goel (403/15)
4. Harshit Duggal (424/15)

The Role of Macroeconomic Factors in Growth

11

Saurav Saha – FP/06/08



12

Shobhit Bhatnagar – 455/15

Kinjal Sengupta
430/15
Prashant Agarwal
441/15
Nishank Gosain
437/15

TBA

13

Rohit Kwatra – 448/15

Anmol Singla 412/15
Amritanshu Kumar 404/15
Manan Mehta 434/15

11.) Entrepreneurship

14

Saurabh Sunil – 452/15

Satwik Sharma
(450/15 )

Devvrat Tripathi
(421/15)

Shweta Poddar
( 456/15)


After your team has been formed, think of the project you would like to take on.

Feel free to add it in to "Project Name" as you go along.

Your team will then come and present this (the "dream project"), informally, in class (Prof will confirm a date in class - most likely Monday).


~Happy Digging!


Note: I've learnt there have been some problems editing this post. The problem is I need to grant authorship rights (to create posts) as well as admin rights (to edit other people's posts).

In case you have been sent an invite and still cant edit this post, then accept the invite, log in and hang on - I'll be granting admin rights shortly.


~InefficientWikisAndozz




Tuesday, September 9, 2008

Getting Started

There are some truly wonderful resources available on the www. "Why do people share knowledge? Why do experts in the field spend their time and effort making great tutorials and primers, to be given away for free?" is a real question. However, while wiser people are busy solving the mysteries of human motivation, you can get smart on data mining at places like these:

http://dml.cs.byu.edu/wiki/index.php/Data_Mining_Resources
http://it.toolbox.com/blogs/opensource-analytics/database-vs-data-warehouse-8286
http://www.statsoft.com/textbook/stdatmin.html
http://www.kaushik.net/avinash/2007/09/data-mining-and-predictive-analytics-on-web-data-works-nyet.html

Here's a comprehensive list of DM blogs ("Comprehensive listings" and aggregators are in the race to be the "most meta" among all metas, aggregating, and aggregating agregators and so on and so forth...BUT, never mind all that :)...):

http://dataminingresearch.blogspot.com/2008/05/data-mining-blogs-big-list.html

Data Mining is an Adventure. There is a tremendous pleasure in having discovered something that is not apparent, or even better, in having corrected and established as bunk something that seemed initially apparent and so-called "Common-Sense".
For some inspiration, I would recommend you read (and most of you probably have already!) S.Levitt's "Freakonomics"
http://en.wikipedia.org/wiki/Steven_Levitt
http://pricetheory.uchicago.edu/levitt/home.html

Monday, September 8, 2008

Suggested Projects

  • The list below is only indicative.
  • Students are free to choose any other project, or make modifications to the topics/use alternative data sources from those given below
  • In fact, students are recommended to do their own digging before finalizing on a project. Make sure your team is convinced that something counter-intuitive (!!), non-trivial and useful can be unearthed.


1 ) A Joke of a Project

What can you find by Collaborative Filtering of User Ratings for Jokes? Available: 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003.

Dataset: http://goldberg.berkeley.edu/jester-data/



2 ) Open Source Software Development

Study of the Growth Dynamics of OSS

Study of the Social Networks behind OSS Development

Clustering of OSS projects - What are the Salient types of differences, especially with respect to Proprietary Software

Development? Have a look at related work here:http://www.nd.edu/~oss/Papers/papers.html

Dataset available on demand:http://www.nd.edu/~oss/Data/data.html


3 ) Does Governance Matter?

Governance consists of the traditions and institutions by which authority in a country is exercised. This includes the process by which governments are selected, monitored and replaced; the capacity of the government to effectively formulate and implement sound policies; and the respect of citizens and the state for the institutions that govern economic and social interactions among them. Can you uncover insights on the aspects of governance essential for economic growth, for human development, etc?

Dataset: http://info.worldbank.org/governance/wgi/index.asp


4 ) Telecom Regulation

Can you uncover some fundamental insights relating regulatory governance of telcos to performance related parameters of the industry?

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20699152~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


5 ) Patterns in Wikipedia:

Identifying Patterns in Editing and Article Creation. Is there a method to the madness of article creation and edits on Wikipedia?

Dataset: http://en.wikipedia.org/wiki/Wikipedia:Database_download


6) The Internet CD Database

Patterns in CD Releases (artists, releases, tracks etc.)

Patterns in User Generated Content Creation

Datasets: http://www.freedb.org/en/download__database.10.html

http://musicbrainz.org/doc/Database


7 ) IIMC Course Selection:

Is there a Pattern in the Selection of Courses across batches? What inferences can be drawn about type and behaviour of students?

Dataset: Extract from PGP Office


8 ) Course Selection, CGPA and Career:

How does Course Selection affect CGPA ? Does CGPA have a bearing on placement, career, success?

Dataset: Extract from PGP Office, Alumni Cell


9 ) Library Data Mining

Patterns in book issuance. Build a predictive model for future issuance.

Dataset: Contact IIMC Library


10) IPmsger

Simple Frequency Analysis. Co-Occurence of certain Words.

Text Mining of Log: Is there method to the IPmsg madness? Find Patterns and Insights on Campus Chat.

Dataset: IPMsger LogFile of past years.

You can use log-analyzers . See http://www.hypernews.org/HyperNews/get/www/log-analyzers.html


11) Entrepreneurship

What combination of factors lead to an entrepreneurial culture?

2007 World Bank Group Entrepreneurship Survey measures entrepreneurial activity in 84 developing and industrial countries over the period 2003-2005.

Dataset: http://www.ifc.org/ifcext/sme.nsf/Content/Entrepreneurship+Database


12) Financial Services

Finance for All? Find insights on Policies and Pitfalls in Expanding Access for the benefit of the many.

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:21546633~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


13 ) Financial Structure

Construction of financial structure indicators to measure whether a country's banks are larger, more active, and more efficient than its stock markets. These indicators can then be used to investigate the empirical link between the legal, regulatory, and policy environment and indicators of financial structure. They can also be used to analyze the implications of financial structure for economic growth.

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20696167~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


14) Bank Regulation and Supervision

Analysis of the impact of bank regulation on various dimensions of bank performance. Study of factors that determines the decisions countries make on the orientation of the regulatory environment, and draw policy conclusions.

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20345037~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


15) Economic Growth and Environmental Quality

Analysis of Linkages between growth and environment quality

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20699819~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


16) Commonalities in Controversial Pages

Can you find patterns in Most Frequently Edited Pages on Wikipedia over time?

What is the relationship between Page Views and Page Edits on Wikipedia?

Dataset: http://en.wikipedia.org/wiki/Wikipedia:Most_frequently_edited_pages

17) Small States, Small Problems?

Patterns in different Nations's Problems and relations with Size.

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20699094~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


18) Data Mining of Electricity Regulation Dataset

How do the different variables in electricity regulatory governance impact performance?

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20699165~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


19 ) Fiscal Policy and Economic Growth

Investigating Interelationships between the two.

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20699114~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


20) Data Mining in Sports

Does past performance predict future preformance? Using statistical data about the past, is it possible to build a model of predictive value?

Dataset: Play Football Manager 2008, use the SAV file created mid-season and predict the rest of the season!


21) Facebook

Interaction Patterns in Online Social Networks

Dataset: Dummy Data can be got at : http://developers.facebook.com/fbopen/


22) Retail

Patterns in Buyer Behaviour.

Dataset: Get Data from Pantaloon, Big Bazaar, Monginis, etc. They would probably be willing, if sensitive information is blacked out.


23) Stock Markets

Are Closing Values telling us something valuable? With Cluster Analysis find out stocks that move together. Warning: Successful Completion of this Project could cause you to become a gazillionaire and risk dropping out of the course.

Dataset: Dig around for SENSEX/NIFTY backtesting data.


24) Vandal Detection

Wikipedia accepts edits even from anonymous editors. Can you device a model to identify the Vandal Edits automatically? (Have a look at http://www.research.ibm.com/visual/projects/history_flow/ for ideas)

Dataset: http://en.wikipedia.org/wiki/Wikipedia:Database_download


25) Website Optimization

Experiment with methods for predicting the next Web page a user will access

Dataset: Log Data from any website administrator. One possible source could be ISG.


26) News Mining

Mining Live News Data Streams for Patterns

Dataset: Any of the numerous news feeds.


27) Seeker behaviour on the Internet

Is there a pattern to the topics sought out by seekers f information on the Internet?

Dataset: Search Volumes Data from compete.com, alexa, etc. Detailed data from WikiStats about page views on Wikipedia: http://dammit.lt/wikistats/


28) The Role of Macroeconomic Factors in Growth

Is growth ireally negatively associated with inflation, large budget deficits, and

distorted foreign exchange markets? Investigations and Insights required!

Dataset: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/0,,contentMDK:20699104~pagePK:64214825~piPK:64214943~theSitePK:469382,00.html


29) Usage of UNIX

What are the trends in the way people use UNIX?

Dataset: http://pages.cpsc.ucalgary.ca/~saul/wiki/pmwiki.php/Resources/DataSets


30) Usage of Web Browsers

How do users use web browsers? Are there identifyable patterns across clusters of users?

Dataset: http://pages.cpsc.ucalgary.ca/~saul/wiki/pmwiki.php/Resources/DataSets


31) Health Risk Analysis of Adolescents in India  

 ---- Live Project  ------

What are the major health risks facing adolescents? What aspects of their psychographic, demographic, socio-cultural-economic factors can they be traced back to? Are there any trends visible? 

Dataset: Survey in progress at a Hospital in Mumbai, Data available on demand.


Feel free to dig around for more. Interesting data is lying around in the unlikeliest places :)



You can start off by having a look here: http://kdd.ics.uci.edu/

A very nice reference is to be found here:

http://delicious.com/pskomoroch/dataset

http://www.datawrangling.com/some-datasets-available-on-the-web


Explore the datasets available.

Think. Ponder. Mull. 

Gaze intensely into the distance, hand on chin. 

Make sure you have a solid Rationale for what you plan to do, in your Project Proposal Document.