Introduction to Big Data Module
The aims of this course 3-day DISCnet event, given by Adam Hill University of Southampton, are to explore how big data techniques can be used to solve massive scale data analysis problems. The course will aim to introduce students to both the theoretical background of cloud computing as well as the practical applications. The processing of large datasets using Big Data techniques, map-reduce and other techniques will be a large focus.
- Understand the theoretical approaches to big data analysis and the
design of modern big data processing pipelines.
- Design a big data processing system.
- Successfully analyse large datasets using Python and Spark.
Practicals will require programming in Python, as well as the use of the UNIX command line / bash shell (e.g., skill learnt during the DISC6001 Software Carpentry course, or equivalent). While students do not need significant experience in Python itself, some serious programming experience is required as the course exercises will require you to write big data analytics code. This course is not suitable for students who have zero practical experience in writing code.
Students who are not confident in Python are expected to use the resources on Python to gain experience before the class. All students are expected to complete the pre-study exercise which looks at lambda expressions in Python. This should be done at least 2 weeks prior to the course to ensure sufficient time for your cloud server accounts to be created.
The course is mandatory for DISCnet core students and also open to non-core DISCnet and GRADnet students. The course is aimed at students in Year 1 or 2 of their PhD. The course may also be suitable for students in later years, depending on their computing and programming experience.
You will need a laptop computer for the course. Laptops should have minimum requirements of 2 Cores, 2 GHz CPU processor, 8Gb RAM, 30 Gb free disk space to run the virtual machine image.