search Ovi's blog

27 October 2016

Data Science Meeting @ Google London Offices (Oct 25th '16) - Takeaways


1. Scala for Data Science - by Pascal Bugnion (ASI Data Science)
  • You can use Python and R in Data Science in new data science projects to start with. However when/if you need to scale them exponentially Scala is the way to go due to how it can handle concurrency (i.e. Java based) 
  • The 'problem' with Scala is that it does not have powerful visualization libraries - in contrast with Python or R. The solution to that is using a tool such as Plotly 
  • Plotly can be used to create and share visualizations online simply by sending your data in a JSON format, among other ways. It takes care of the rest for you. 
  • You can use plotly for graphs, dashboards and a number of charts and it also works with Python, R, Matlab and more. 



2. Application Architecture for Big Data - Tom White, Head of Development at Method Digital, Prev CTO of Skin Analytics
  • Differences between Data Scientists and Developers when working together on the same project 
    • Data Scientists
      • Focus on meaningful results 
      • Exploration and experimentation 
      • Large datasets
      • Preprocessing, model generation 
      • Lots of scripting 
      • Limited scope for effective code-reuse 
      • (Sometimes) little knowledge of how Software Engeneering works 
    • Developers
      • Focus on stable, secure, rapid iteration 
      • Agile Development 
      • User Stories
      • Git workflows
      • Continuous Integration 
      • Code Reviews
      • User Acceptance Testing 
      • DRY Coding 
      • (Sometimes) little knowledge of how Data Science works 
  • Antipattern in DS and Devs working together - Developers write 'all the code' - i.e. linking to too low level Data Science components which often change as experimentation continues 
  • Suggested approach 
    • Separate, co-owned app providing an API 
    • Only use the minimum data-science functions you need - 'freeze' them into the API
    • Version the APO and maki it purely additive 
    • Version any datasets too 
    • Keep a 'live' version on top for tinkering in test environments if need be 


3. Google Big Data Lifecycle 


No comments: