Sharing Materials

Presentations

PySpark Tutorial
2025 Apr [View]
This deck is based on Chambers, Bill, and Matei Zaharia. "Spark: The definitive guide: Big data processing made simple."(2018), one of the most comprehensive resources on Apache Spark. I’ve distilled its key lessons into essential takeaways: What is PySpark? What are its core components - RDDs, DataFrames, Spark SQL, and Pandas UDFs? How does a Spark application run, from its lifecycle to execution breakdown? And finally, how to monitor job status and explore different ways to run Spark jobs.

Design Patterns
2024 Dec [View]
This deck is inspired by Gamma, Erich. "Design patterns: elements of reusable object-oriented software." (1995). It highlights the art of programming beyond just getting code to work, showcasing how OOP can elegantly handle object creation and relationships for better maintainability. The book introduces 23 foundational patterns—creational, structural, and behavioral—to enhance design thinking. Beyond summarizing these patterns with practical examples, I explored six techniques to refactor code, such as turning methods into classes and using abstraction for interaction. These insights serve as a blueprint for crafting smarter, more sustainable codebases.

Git: The Basics
2024 Dec [View]
This slide deck is based on the book Learn Git in a Month of Lunches (Rick Umali). Git is an essential tool for any developer, providing a robust system for version control and collaboration. While the initial setup may present a learning curve for beginners, Git becomes intuitive and efficient with regular use. Fundamental commands like commit, checkout, and merge are staples of daily workflows. Beyond these basics, advanced techniques enhance collaboration and streamline team projects. The deck also highlights two popular Git workflows, offering a foundation for development teams to establish effective collaboration practices.

The Data Scientist's "Software Stack" and Three Tips in Model Development
2023 Feb [View]
In this talk, I first shared my go-to software setup to work and collaborate well within a team - this will ease the job for MLE/DE/PM to deliver a satisfactory product. Secondly, I emphasized the imperative of assessing at every stage during the development cycle and identifying a clear objective before proceeding to the next stage, with a highlight on "Problem statement" and "Iterated model enhancement with gap-based logical reasoning".

Kedro: Hands-on Walkthrough
2021 Nov [View]
This presentation provides an introduction to Kedro and its three primary commands:
- kedro new: Create a new project codebase.
- kedro run: Execute your data pipeline.
- kedro viz: Visualize your pipeline and compare experiments.
We'll also demonstrate the following features: optimize performance by stacking and running pipelines in parallel; enable efficient tracking of your experiments; use layers and namespaces to simplify and organize your workflow.

Structuring an ML Project: The Decision-making mindset in Model Building
2020 Dec [View]
Inspired by Andrew Ng's course Structuring Machine Learning Projects, I've compiled a practical mindset for building machine learning model. While some points may be outdated in the context of the recent advancements in large language models (LLM), the majority of the concepts remain relevant. This deck focuses on setting the right expectations for model performance, the fine-tuning cycle, and the best practices for closing the gap. Key concepts include choosing a single number evaluation metric and setting constraints, using human-level performance as a reference point, orthogonalization, error analysis, and building a quick and iterative system. By following these practices, you can effectively build and improve your machine learning model.

Mind maps

Disclaimer: Books' content is restructured into a mind map to capture their general ideas. Notes are mostly direct quotes from the books as reference content unless stated otherwise. The mind maps serve for personal use with no commercializing intent.

[Book Summary] Data Visualization with Excel Dashboards and Reports, Dick Kusleika
2024 Sep [View]
The book focuses on three core aspects: the dashboard development lifecycle, Excel's data querying, transformation, and visualization capabilities, and the use of purpose-driven visualization elements. It emphasizes the importance of clarifying dashboard requirements with stakeholders, iterative feedback, and understanding data details from the outset. The design section highlights breaking data into layers and using Excel's tools like PivotTables and charts, illustrated through case studies. Finally, the book categorizes charts into three types—Comparison, Composition, and Relationship—and discusses non-chart elements like formatting and sparklines for effective data visualization.

[Book Summary] Practical Time Series Analysis: Prediction with Statistics and Machine Learning, Aileen Nielsen
2023 Jun [View]
The book describes well the practical aspect of an end-to-end time series development process. It discusses the best practices of EDA, feature engineering, modelling, and data storing with valuable tips: lookahead issue, plotting techniques, temporal characteristics in analysis, etc... The writing and mathematical explanation are not well-written, but its content is best appreciated once you work on a time series forecast use case.

Microsoft PowerPoint 2019 & Mind Map icon by Icons8

Sharing Materials

ho xuan vinh

Error

Templates (for web app):

Error