Featured Sessions at Data Day Texas
NLP @HomeAway: how to mine reviews and track competition
Brent Schneeman (Homeaway)
“This talk tells two stories – how HomeAway uses NLP techniques to both mine reviews and keep track of the competition.
One person’s 4-star review could easily be another's 3-star. This talk shows an attempt to cluster reviews based on text similarity to find common patterns in reviews which can then provide guidance to review writers. Additionally, reviews contain a wealth of information for vacation rental suppliers – HomeAway is trying to surface the signal in the reviews so that suppliers can focus on what matters.
The vacation rental industry has some very capable competitors. The second half of the talk shows hows we’re finding listings that appear on multiple sites, discusses some of the operational concerns (performance, accuracy), and how we hope to use techniques such as Topic Modeling to address those concerns”
Turkish Humanism, or Crowdsourcing like a boss: how to recruit, train, promote and employ a virtual workforce.
Russell Jurney (Relato)
Effective crowdsourcing is an interdisciplinary pursuit, involving people management, data processing, analytics, and application development. We'll go over the ins and outs of building hybrid human/machine systems utilizing programs and virtual workers in combination to achieve what is impossible for humans or machines alone. Talk will cover everything from building workflows and hacking dataflows to data entry application development.
NEW : Intro to Graph Data Visualization
Corey Lanum (Cambridge Intelligence)
As new use cases for connected data analysis develop, more and more applications built on graph data find their way into the workplace. As a result, enterprise architects, analysts and data scientists need to communicate their graphs to a non-technical audience of business users.
This is where visualization plays a vital role.
Graph visualization tools give users rapid insight into complex connected data without technical knowledge. They can make decisions, perform graph analysis and query graph databases without the need to learn obscure query languages.
Increasingly too, these advanced tools not only help answer the ‘who / why / how’ questions, but also the where and when.
This tutorial will give a developer an overview of the different approaches and techniques for visualizing graphs, as well as a summary of the different tools available.
NEW: Why human-in-the-loop machine learning is the future of data science
Lukas Biewald - Crowdflower
Last year Fortune 500 CEOs were asking their teams "what's our big data strategy?" 2016 is going to be the year when they ask "what's our machine learning strategy?"
Lukas Biewald, founder and CEO of CrowdFlower, will offer his view of the data science ecosystem and what he thinks the answer to that question actually is. Drawing on myriad real-world examples, Biewald will show the limits of iterating on good models, how important high quality training data really is, and how human-in-the-loop machine learning makes good algorithms great.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Chris Moody - Stitchfix
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.
Beyond Shuffling - Tips & Tricks for Scaling Apache Spark Programs
Holden Karau - IBM
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a very brief introduction to Datasets (new in Spark 1.6) and how they can be used to make our Spark Jobs faster.
Sentiment Analysis is a Market for Lemons ... Here's How to Fix it
Robert Munro - Idibon
There are easily 100+ companies offering some form of sentiment analysis. Despite this, accuracy for sentiment analysis has been hovering around 70% for real-world applications for almost a decade. This talk will explain why competition is hurting the industry rather than fostering innovation, and propose new economic and technical models to address this.
When consumers are unable to distinguish good from bad products, you have a "market for lemons" where the uncertainty drives down the perceived value. The most successful way to increase the accuracy for sentiment analysis is to label more training data. However, the low price-point means that the 100 competing companies do not have the margins needed to create comprehensive training data, meaning that any one company's results are domain-specific. When consumers try out-of-domain analysis, they lose confidence from the poor results.
The solution is to consider different economic models for useful sentiment analysis: ones that allow data-sharing for more accurate training data, while protecting sensitive data from public release. I will propose an architecture for sharing training data that can allow multiple organizations to share in the benefits of better sentiment analysis, and talk about how this applies to the market for NLP technology more broadly.
Spark: The Good, the Bad, and the Ugly
Sarah Guido - Bitly
Apache Spark has been hailed as a trail-blazing new tool for doing distributed data science. However, since it's so new, it can be difficult to set up and hard to use. In this talk, I'll discuss the journey I've had using Spark for data science at Bitly over the past year. I'll talk about the benefits of using Spark, the challenges I've had to overcome, the caveats for using a cutting-edge technology such as this, and my hopes for the Spark project as a whole.
Force multiplier for data science: Introduction to H2O
Hank Roark - H2O
H2O is an open source platform for doing scalable machine learning and data science. It is designed for speed and usability: no more waiting hours or days to build, evaluate, and deploy machine learning models. This session will introduce the H2O ecosystem, overview the H2O distributed, parallel architecture, and demonstrate the abilities of H2O from data ingest to production scoring.
Fast, Distributed Machine Learning for Python using H2O
Hank Roark - H2O
H2O is open source software for doing machine learning in memory. All algorithms support by H2O are parallel and can run distributed on a compute cluster. With H2O's recent launch of its Python API, H2O brings another set of tools to the PyData and Python machine learning community. Specifically those tools are data frame and machine learning capabilities completely in memory across as many CPUs as needed to achieve performance. This talk with introduce H2O and H2O Python to the attendees, provide a demonstration of the H2O through Python, and help practitioners understand where H2O might fit in their current Python pipelines.
Algorithm Marketplaces and the new "algorithm economy"
Diego Oppenheimer - Algorithmia
Peter Sondergaard VP of Research for Gartner recently said the next digital gold rush is "How we do something with data not just what you do with it". During this talk we will cover a brief history of the different algorithmic advances in computer vision, natural language processing, machine learning and general AI and how they are being applied to Big Data today. From there we will talk about how algorithms are playing a crucial part in the next Big Data revolution, new opportunities that are opening up for startups and large companies alike as well as a first look into the role Algorithm Marketplaces will play in this space.
Natural Language Processing With Graph Databases
William Lyon - Neo4j
Graph databases are a type of NoSQL database that use a graph (think nodes and relationships) data model. This talk will show how a graph database can be used in a variety of natural language processing techniques. A brief overview of graph databases and the property graph data model will be presented, followed by a survey of the role for graph databases in natural language processing tasks, including: modeling text as a graph, mining word associations from a text corpus using a graph data model, mining opinions from a corpus of product reviews, and a demonstration of how graphs can enable content recommendation based on keyword extraction.
Learning (and Teaching) Data Science from First Principles
Joel Grus - Google
Everyone wants to either be a data scientist or hire a data scientist. Yet most of us spend very little time thinking about the best way to teach (or learn) data science. Should one start with math and stats? Dive into machine learning? Learn all the tools? I've tried them all and more. I'll give examples of what's worked and what hasn't, bludgeon you with my opinionated pedagogy, and share some broader thoughts about tech education.
Data Decisions and Society
Ellen Friedman - MapR
With an increasingly data-driven world, organizations and individuals are looking for better ways to gain actionable insights from data. In many cases that means working with real time analytics on streams of data; in others it’s a matter of putting together the powerful combination of in-the-moment data with historical data to support anomaly detection, recommendation modeling or for preventive maintenance.
In all these cases there is the important interplay between technologies for gathering and analyzing data and the human element to decide what actions to take based on data. Data driven decisions not only influence individual business success, they also can have an enormous impact even at the level of changing society.
With a rich collection of stories from early medical choices to 19th century navigation to modern big data projects, this talk explore a variety of stories of how data-driven decisions have worked out, including practical advice about avoiding pitfalls as a data scientist and about the decisions that underlie successful projects.
Modern Streaming Analytics, Flow versus State
Ted Dunning - MapR
The leading edge of big data architectural practice is rapidly moving to flow-based computing using streaming architectures as opposed to state-based computing based on batch programs plus workflow schedulers. This transition is part of the larger movement towards micro-services and devOps oriented development and maintenance of large systems.
This fashion is spreading quickly, but the understanding of why flow-based computing is different from state-based computing and what this means is practice is lagging behind. In fact, there is a huge difference and this has the potential of massively simplifying big data systems, thereby improving reliability and time to market.
I will describe the necessary key concepts and illustrate them with practical examples. I will also describe why this matters in the real world.
A Journey from Relational to Graph Database
Nakul Jeirath - Wellaware
The decision to switch database technologies is a huge undertaking with plenty of operational and technical challenges. At WellAware, an oil & gas data startup, we made the decision to move from Postgres to Titan about two years ago. We learned a lot during the switch and in the subsequent two years of operating Titan+Cassandra in a production environment. In this talk, I’ll give an overview of the major factors that went into the decision to switch, challenges we’ve faced, and the lessons learned along the way to assist anyone looking to make the plunge into the world of graph databases.
MySQL 5.7 in a Nutshell
Peter Zaitsev - Percona
MySQL 5.7 is a great release which has a lot to offer, especially in the development and replication areas. It provides a lot of new optimizer features for developers to take advantage of a much more powerful GIS function and high performance JSON data type, allowing for a more powerful store for semi-structured data. It also features dramatically improved Performance Schema, Parallel and Multi-Source replication, allowing you to scale much further than ever before, just to give you a taste. In this talk, we will provide an overview of the most important MySQL 5.7 features.
Data Lake Architectures and the fast path with Cask Hydrator
Jonathan Gray - Cask
Data lakes represent a powerful new data architecture, providing enterprises with the scale and flexibility required for big data: unbounded storage for unbounded questions. Hadoop is the de facto standard for implementing data lakes, but significant expertise, time, and effort are still required for organizations to deliver one. Today, enterprises building their own data lakes on Hadoop are effectively implementing their own internal platforms from a collection of individual open source technologies.
The many projects provided by open source and commercial Hadoop distributions must be integrated with each other, integrated with the existing environment, and operationalized into new and existing processes. With no established best practices or standards, each organization is left to find their own way and rely on expensive, external experts. Data lake proof of concepts can take months.
This talk introduces Cask Hydrator, an open source data lake framework included in the Cask Data App Platform (CDAP). Hydrator is a self-service data ingestion and ETL framework with a drag-and-drop user interface and JSON-based pipeline configurations. Enforcing best practices and providing out-of-the-box functionality, Hydrator enables enterprises to build data lakes in a matter of days. Integrations are included with open source and traditional data sources, from Kafka and Flume to Oracle and Teradata. Completely open source, Cask Hydrator is highly extensible and can be easily integrated with new data sources and sinks, and extended with custom transformations and validations.
Attendees will learn about data lakes, the different approaches and architectures enterprises are utilizing, the benefits and challenges associated with them, and how Cask Hydrator can enable the rapid creation of data lakes and dramatically decrease the complexity in operationalizing them.
Apache Kafka and the Stream Data Platform
Jay Kreps - Confluent
What happens if you take everything that is happening in your company—-every click, every database change, every application log—and make it all available as a real-time stream of well structured data?
I will discuss the experience at LinkedIn and other companies moving from batch-oriented data integration and data processing to real-time streams and real-time processing using Apache Kafka. I’ll talk about how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for streams of event data. I will cover some of the challenges of scaling Kafka to hundreds of billions of events per day at Linkedin, supporting thousands of engineers, applications, and data systems in a self-service fashion.
I will describe how real-time streams can become the source of ETL into Hadoop or a relational data warehouse, and how real-time data can supplement the role of batch-oriented analytics in Hadoop or a traditional data warehouse.
I will also describe the role of emerging stream processing layers such as Storm and Spark that work with Kafka to allow applications to make use of these streams for sophisticated real-time data processing applications.
Data Warehousing - 2016
Kent Graziano - Snowflake
The world of data warehousing has changed! With the advent of Big Data, Streaming Data, IoT, and The Cloud, what is a modern data management professional to do? It may seem to be a very different world with different concepts, terms, and techniques. Or is it? Lots of people still talk about having a data warehouse or several data marts across their organization. But what does that really mean today in 2016? How about the Corporate Information Factory (CIF), the Data Vault, an Operational Data Store (ODS), or just star schemas? Where do they fit now (or do they)? And now we have the Extended Data Warehouse (XDW) as well. How do all these things help us bring value and data-based decisions to our organizations? Where do Big Data and the Cloud fit? Is there a coherent architecture we can define? This talk will endeavor to cut through the hype and the buzzword bingo to help you figure out what part of this is helpful. I will discuss what I have seen in the real world (working and not working!) and a bit of where I think we are going and need to go in 2016 and beyond.
High cardinality time series search: a new level of scale
Eric Sammer - Rocana
Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data. We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases.
This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.
Storing Time Series Data with Apache Cassandra
Patrick McFadin - DataStax
Abstract: If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
Eulogy to the Click: How predictive modeling and big data is killing our favorite metrics!
Claudia Perlich - Dstillery
Data driving decisioning is not an invention of the recent past. We have a waste collection of metrics, KPIs, and other statistics that help decision makers to be more data driven. While the recent technological advances have increased our ability to see the world through the objective lenses of data, one unintended side effect of exactly this technology is the diminishing value of many metrics we cherish.
Consider the clickthrough rate (CTR) in advertising. With the rise of Doubleclick, CTR has been a core metric of an advertiser’s ability to identify a relevant audience: more clicks equals more interest equals more qualified potential consumers. While this might have been true in the past, the sad reality is that measures like CTR have lost their meaning.
The issue is not as much adversarial attempts of gaming, but the fact that even highly correlated proxies like CTR lose their power once predictive modeling and optimization are drawing from more granular information. Given much more detailed information, modern optimization approaches can find signals in the noise, and identify users with poor vision or bad fine motor skills who accidentally (but not randomly) click on ads, rather than interested potential consumers. This fundamental issue extends beyond clicks and calls for a close examination of many of our favorite KPIs.
This talk takes a provocative stand: Many metrics we cherish lose their value because the granularity of modern data collection data enables us to identify and optimize towards hidden signals that use to be noise and now come to the forefront. One such metrics is the click through rate in advertising, but the mechanism is ubiquitous and we should pay close attention to the mechanism at work.
Maslow's Hierarchy of Needs for Databases
Charity Majors - Hound
Are you an accidental DBA? A software engineer, or operations engineer, or startup founder who suddenly found yourself responsible for designing/scaling/not destroying a bunch of data? Yo me too -- we should form a support group or something. In this talk we’ll cover devops/DBA best practices from the earliest seed stages (survival, selecting the right storage layer) all the way up through what you should expect from a mature, self-actualized database tier. Along the way we’ll talk about how to ensure that your databases are a first-class citizen of your engineering and operational processes, and how your observability and reliability requirements are likely to evolve along with your organization as it matures.
Laying down the SMACK on your data pipelines
Patrick McFadin - DataStax
You’ve heard all of the hype, but how can SMACK work for you? In this all-star lineup, you will learn how to create a reactive, scaling, resilient and performant data processing powerhouse. We will go through the basics of Akka, Kafka and Mesos and then deep dive into putting them together in an end2end (and back again) distrubuted transaction. Distributed transactions mean producers waiting for one or more of consumers to respond. On the backend, you will see how Apache Cassandra and Spark can be combined to add the incredibly scaling storage and data analysis needed for fast data pipelines. With these technologies as a foundation, you have the assurance that scale is never a problem and uptime is default.
Data Security for Smart Cities
Eddie Garcia - Cloudera
Governments and the private sector are now quickly understanding the unlimited potential of data as the driving force behind smart cities. Data can optimize traffic patterns, reduce power consumption, and improve overall quality of life in many other ways. But when this data becomes part of our critical infrastructure, this data needs to be secure and protected, because a data breach could trigger a massive outage in a smart city.
Data security must be considered from the start in any project that can put lives at risk. And, data security must be a cornerstone in the platforms on which smart cities will be built.
The major data breaches of recent years only emphasize the importance of data security. This talk will explore how and why traditional security tools and technologies have failed us, and why we need new solutions to secure and protect today's data for the cities of tomorrow.
Data Science for the Masses: Mission Impossible?
Michael Berthold - KNIME
The vision of “Data Science for the Masses” is simple: allow non-data scientists to make use of the power of data science without really understanding the machinery under the hood. While this is possible for narrow disciplines in which a nice GUI can hide complexity, it is not so simple for serious data science. First of all, it is already hard enough to make use of the wisdom of fellow data scientists who use their own favorite tools. Secondly, and still very much an open problem, is the issue of injecting feedback from casual users. Using the KNIME Analytics Platform I will demonstrate a number of methods to encapsulate, reuse, and deploy analytical procedures at various levels of abstraction - giving data scientists control to expose select parts of the analytics processes to user unfamiliar with the underlying tool or without the in-depth knowledge of the analytical algorithms used.
Building a NoSQL database from scratch
Ed Capriolo - Huffington Post
If you have been hanging around the big data space, you probably have heard things like of SSTables, Bloom Filters, the CAP theorem and you want to know more. I started building a memory mapped SSTable implementation for fun. Next, I started writing an in memory cache. The a-ha moment came when I realized that I was building was something the NoSQL space was missing, a data store designed by interface/API which allows developers and users to choose the features that mattered most to them.
This presentation is a technical deep dive of Nibiru, (https://github.com/edwardcapriolo/nibiru) a NoSQL data store designed for maximum plug-ability and configure-ability. Unlike a typical NoSQL talk where someone might circle two sides of the CAP pyramid and describe why or how a tool implements those features, we are going to show how Nibiru carries out a 'quorum read' by launching futures to multiple networked nodes and merges the results! We are going to show how a compaction process happens by walking through a unit test! Do not worry, we are not going to just be reading code blocks for an hour: the talk includes diagrams, and other technical discussion! Fun!
Starting from scratch: Exploring and analyzing text data in Idibon Studio
Nick Gaylord - Idibon
Text analytics projects often start with a big pile of data and some questions. This poses a challenge: how can I tell if this data is well-suited to my questions, and how can I tell if my questions are well-suited to this data? Maximizing the fit between questions and data is essential to getting the most out of your project.
This talk showcases data exploration and analysis in Idibon's text analytics platform, Idibon Studio. Using an example dataset of about 30,000 Reddit comments, I will demonstrate a workflow that highlights several of Studio's native capabilities:
- Unsupervised data exploration via topic modeling
- Creation of a set of connected document classification models
- Training these models in an active learning annotation environment
- Optimizing these models using measures of interannotator agreement
I conclude with a series of data visualizations (created in Tableau) that highlight what we have learned from this data over the course of the exercise, combining document classification results with dynamically-scaled sentiment findings over a 5-month time series.
From Sentiment to Persuasion Analysis
Jason Kessler - CDK Digital Marketing
The task of product review mining has a major blind spot—reader reaction. While much work has been done on identifying positive and negative evaluations, little has been done on determining which evaluations or what type of language change reader behavior. The research Jason will present shows that the intuition that reviews positive sentiment cause readers to be more engaged is misguided.
Jason and his team spent a year monitoring a large set of car reviews posted to a major automotive website, and assigned each review a “persuasiveness” score. Scores were based on the web-browsing behavior users exhibited after reading a review. While they found a slight correlation between a review’s persuasiveness and its star rating, they found that they could better predict a review’s persuasiveness using a classifier with unigram and bigram features. In fact, terms highly indicative of positive sentiment would often be predictive of reviews being unpersuasive.
Frequently used contexts of features with extremely high and low weights revealed contrasting examples of engaging and un-engaging ways to describe product facets. For example, when discussing a car’s power, the word “passing” was identified as persuasive (e.g., “I do a lot of two lane passing and it has never left me wishing I had more power”) while the more technical and less experiential “torque” was found to predict low persuasiveness (e.g., “nice torque and acceleration.”)
This talk will also describe the methodology in conducting this study, which can serve as a template or jumping off point for future research in measuring content effectiveness.
Google Cloud Dataflow - Two Worlds Become A Much Better One
Eric Schmidt - Google
Big Data processing is challenged by four conflicting desires: reducing the “time to answer” (latency), increasing the accuracy of answers (correctness), finding the easiest implementation path (simplicity), and effectively managing operational expense (cost). These desires are then mapped across batch and stream processing models, two separate worlds evolving in divergent directions, producing a tempest of challenges.
Google Cloud Dataflow intelligently merges the worlds of batch and stream processing into a unified and open sourced programming model. This model and the underlying execution environment aims to remove the polarity between correctness and latency AND strike the right balance between simplicity and operational robustness. Cloud Dataflow is also a fully managed, highly scalable, strongly consistent processing service for both batch and stream based processing.
This session will:
drill into the Cloud Dataflow programming model and will compare development patterns found in MapReduce and Apache Spark.
teach you how to NOT manage a cluster or tune a job. Cloud Dataflow provides: automated cluster management, dynamic work rebalancing, and throughput based auto-scaling.
review techniques for monitoring and debugging.
demonstrate how to execute Cloud Dataflow pipelines on alternate runners like Apache Spark and Apache Flink.
Polyglot Persistence vs Multi-Model Databases
Luca Garulli - OrientDB
Many complex applications scale up by using several different databases, i.e. selecting the best DBMS for each use case. This tends to complicate modern architecture with many products by different vendors, no standards, and a lot of ETL which ultimately causes unpredictable results and a lot of headaches. Do you find yourself missing the good ol' days where there was just 1 type of DBMS product: Relational? Multi-Model DBMSs were created to make your life easier, giving you the option of using one NoSQL product with powerful multi-purpose engines capable of handling complex domains. Could one DBMS handle all your needs including speed and scalability in the times of Big Data? Luca will walk you through the benefits and trade-offs of multi-model DBMSs and will show you how easy it is to setup one open source database to handle many different use cases, saving you time and money.
MySQL Indexing Best Practices
Peter Zaitsev - Percona
Proper indexing is a key ingredient of database performance and MySQL is no exception. In this session we will talk about how MySQL uses indexes for query execution, how to come up with an optimal index strategy, how to decide when you need to add an index, and how to discover indexes which are not needed.
Blending Tools and Data in KNIME: From .csv, R, and Python to Spark, MLlib and Hive
Michael Berthold - KNIME
The open source KNIME Analytics Platform allows blending of many data sources and tools within a single workflow. In this talk, I will walk through a real world use case and build one integrative workflow to read data from text files and Hive, and integrate, transform, and aggregate the data locally and on Hadoop. I will then demo how to control a parameter sweep using Spark and how models can be trained locally (with native KNIME nodes and Python/R) and directly on Hadoop using MLlib. The final model is deployed using KNIME workflows, web services, and Spark.
Elevating Your Data Platform
Kurt Brown - Netflix
Are you getting the most out of your data platform? The technologies you choose are important, but even more so is how you put them into practice. Part philosophy and part pragmatic reality, Kurt will dive into the thinking at Netflix on technology selection and trade-offs, challenging everything (constructively), providing building blocks and paved paths, staffing, and more. Kurt will also talk through our tech stack, which includes many big data technologies (e.g. Hadoop, Spark, and Presto), traditional BI tools (e.g. Teradata, MicroStrategy, and Tableau), and custom tools / services (e.g. Netflix' big data portal and API). Expect to leave with an arsenal of new ideas on the best way to get things done.
Kurt's talk will be followed with an extended Q&A / office hour.
Open Source Lambda Architecture with Kafka, Samza, Hadoop, and Druid
Fangjin Yang - Stealth
The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a real-time analytics stack using Kafka, Samza, and Druid.
Analytics pipelines running purely on Hadoop can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time, or fragile solutions where the integrity of the data cannot be assured. Combining Hadoop with Kafka, Samza, and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries.
In the described system, Kafka provides a fast message bus and is the delivery point for machine-generated event streams. Samza and Hadoop work together to load data into Druid. Samza handles near-real-time data and Hadoop handles historical data and data corrections. Druid provides flexible, highly available, low-latency queries.
Building Recommendations at Scale: Lessons Learned
Preetha Appan - Indeed.com
The recommendations engine at Indeed processes billions of input signals daily and drives millions of weekly page views. In this talk, we will delve into how we leveraged probabilistic data structures to build a hybrid (online+offline learning) recommendation pipeline. To address scaling challenges, we incrementally modified the system architecture, model output format, and A/B testing mechanisms. We'll describe these changes and highlight the impact each had on product metrics. We will conclude with lessons learned in system design that apply to any high traffic machine learning application.
Running Agile Data Science Teams
John Akred - Silicon Valley Data Science
What’s the best way to pursue data-driven projects? Drawing from our experience with cross-functional teams of engineering, quantitative, and visualization skills, we will highlight the benefits of collaborative teams of experts working iteratively, across disciplines, and explain how to manage these teams to successfully and efficiently deliver data analytics projects.
Choosing an HDFS data storage format: Avro vs. Parquet and more
Stephen O'Sullivan - Silicon Valley Data Science
Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
Creating a Data-Driven Organization
Carl Anderson - Warby Parker
While many organizations may claim to be data-driven, the reality is that relatively few genuinely are and that we could all do better. Many organizations think that simply because they generate a lot of reports or have many dashboards, they are data-driven. While those activities are part of what a data-driven organization does, fundamentally they are descriptive and backwards-looking. They state what happened but they are not prescriptive. As such, they have limited upside. Having reports and dashboards is nothing without considering what happens to the information they contain.
In this talk, Carl will discuss the hallmarks of great data-driven organizations. He’ll cover the infrastructure, skills, and culture needed to create organizations that take data, treat it as a core asset, and use it to drive and inform critical business decisions and ultimately make an impact. He will also cover some common anti-patterns, behavior that inhibits a business from making the most from its data.
Spelunking the Web with Python: Writing Scrapers for Any Situation
Ryan Mitchell - LinkeDrive
This talk will cover techniques to overcome almost any web scraping obstacle, as well as several case studies where these techniques have been successfully applied. We’ll work with a repository of dozens of ready-to-use code samples, and cover some of Python’s most useful scraping tools -- BeautifulSoup, Scrapy, Selenium, NLP, Requests, PIL, and Tesseract.
We’ll also discuss some modern best practices, legal challenges and myths, and common scraping patterns surrounding web scraping and crawling.
Time series analytics for Big Data and IoT
Fintan Quill - Kx Systems
Trying to solve the data riddle purely through the lens of architecture is missing a vital point: The unifying factor across all data is a dependency on time. The ability to capture and factor in time is the key to unlocking real cost efficiencies.
Whether it’s streaming sensor data, financial market data, chat logs, emails, SMS or the P&L, each piece of data exists and changes in real time, earlier today or further in the past. Unless they are linked together in a way that firms can analyze, there is no way of providing a meaningful overview of the business at any point in time.
This talk will demonstrate how using kdb+, a columnar relational time-series database, with a tightly integrated query language called q, can do aggregations and consolidations on billions of streaming, real-time and historical records for complex analytics.
Detecting Outliers and Anomalies in Real-Time at Datadog
Homin Lee - Datadog
Monitoring even a modestly-sized systems infrastructure quickly becomes untenable without automated alerting. For many metrics it is nontrivial to define ahead of time what constitutes “normal” versus “abnormal” values. This is especially true for metrics whose baseline value fluctuates over time. To make this problem more tractable, Datadog provides: 1) outlier detection functionality to automatically identify any host (or group of hosts) that is behaving abnormally compared to its peers; and 2) anomaly detection to alert when any single metric is behaving differently than its past history would suggest.
In this talk Homin will discuss the algorithms Datadog uses for outlier and anomaly detection. He will discuss the lessons they've learned from using these alerts on their own systems, along with some real-life examples on how to avoid false positives and negatives.
Delivering Real Time Analytics for Mobile Ad Serving
Ben Reiter - Vungle
At Vungle, we process over 100 Million events per day, and that number continues to grow. Using tools like Kafka, Spark, and MemSQL, we were able to rapidly prototype and test various configurations until we settled on the formula that worked for us. In this talk I will describe how we switched from a monolithic vertically scalable batch reporting process, to a real time horizontally scalable system.
Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Doug Daniels - Datadog
Datadog collects hundreds of billions of metric data points per day from hosts, services, and customers all over the world. In addition charting and monitoring this data in real time, they also run many large-scale offline jobs to apply algorithms and compute aggregations on the data. In the past months, they've migrated our largest data sets over to Apache Parquet—an efficient, portable columnar storage format.
In this talk, Doug will dive into why they selected Parquet, as well as their experience ingesting our data into the format and issues they saw at scale. He’ll also discuss their architecture for consuming Parquet data from Hadoop (Pig / HCatalog / Hive) and Presto, as well as what they see next for Parquet.
Under the Hood of Idibon’s Scalable NLP Services
Michelle Casbon - Idibon
This talk dives under the hood of a modern text analytics system, exploring the infrastructure. It covers the tools that Idibon uses as basic building blocks, focusing on the power of integrating human-generated input into the statistical models generated by standard Machine Learning toolkits such as Spark’s MLlib. We will talk about the types of scalability problems that we encountered and how we addressed them, balancing the requirements like maintaining (human) language independence, training machine-learning models on demand, and serving predictive models with minimal latency at scale in distributed cloud-based systems. We’ll discuss the pros and cons of specific technology considerations and what ultimately motivated us when choosing one over the other.