Sean G. Carver's Current Research and Data Science Projects
From Sean_Carver
Recent Projects:
- Effectively Automating Data Science: At Data Machines Corp. I worked on a project involving automation of data science tasks through a DARPA funded competition. Different performers submitted up to 20 automatically generated data science pipelines each for solving various data science problems. I was asked to evaluate whether "diverse" collections of primitives were better than similar primitives with different hyperparameters. Using heirarchical regression, I found a statistically significant result concluded that the effect size was small enough that it did not warrent changing the instructions given to performers.
- Overlapping software projects: KLI-R (R/Github/Git) and KLI (Python/Bitbucket/Mercurial). KLI stands for "Kullback-Leibler Interactive." These projects involve, among other things, computing the number of samples needed to reject an alternative model with the likelihood ratio test, in favor of a true model that produces the data.
- Conference Proceeding: I presented this work at the Joint Statistical Meeting (JSM) in Baltimore, August 2017.
- Baseball: how many innings must be played by model of the Baltimore Orioles (fitted from actual Orioles home games) to reject the model that the New York Yankees are playing? This statistic provides an interpretable way of quantifying the similarity of models.
- Student Collaborators: Rebeca Berger (graduated 2017), Jake Berberian (Class of '22), Kingsley Iyawe Masters Student expected to graduate May 2020.
- Conference Proceeding: I presented this work at the Joint Statistical Meeting (JSM), Baltimore, August 2017.
- Motor Control: With my student collaborators, I have been looking at continuation tapping, an experimental paradigm involving a metronome, and subject tapping to the beat. After a sufficient duration of time, the metronome stops, and the subject must keep the same rhythm. In an effort to perform system identification of the internal clock for motor control, we found that the inter-tap intervals are fit equally well by the Normal and the Inverse Gaussian distributions, and both fit much better than the Laplace distribution. Contrary to this finding, the only relevant study in the literature we discovered, a review paper, reported that inter-tap intervals have a Laplace distribution. This review paper did not provide many details about how the data were collected and analyzed. I am working with Daniel Scanlan to explore, through simulations, when models of continuation tapping produce data that fit the Inverse Gaussian/Normal distributions (these two are almost identical at our parameters) and when they produce data that fit the Laplacian distribution.
- Student Collaborators: (former) Daniel Scanlan, (former) Wasim Ashshowaf, and (former) Alexander Spinos
- Ion Channels in Neuroscience: Ion channels provide much of the molecular basis for neural signaling. Models of ion channels are continuous time Markov chains with hidden states, far more complicated than any of the applications above. Much of the KLI in python code involves these models of ion channels. PDF of poster presented to Society of Neuroscience, 2015
Other past projects:
- Analysis of data from the whole brain larval zebrafish at cellular resolution. A collaborator from Janelia Research Institute, Misha Ahrens provided me with a data set that consisted of about 100,000 time series, one from each of 90% of the neurons in a larval zebrafish. The time series consisted of measurement of calcium from within their respective cells, using a calcium sensitive dye. The fish were involved in a closed-loop sensorimotor behavior, similar to swimming "like Neo in the matrix." Specifically the fish were actually immobilize in a microscope, but were given a visual stimulus, as if they were swimming. The harder their nervous system commanded the tail to wag, the faster the stimulus moved, in the appropriate direction. Each time series had the same length, about 4000 samples. I looked at the intrinsic dimension of the 4000 points embedded in 100,000 dimensions. I found that the intrinsic dimension was about 15. I plan to see if this result is consistent from fish to fish, and also look at the persistent cohomology of the data.
- Server administration, data collection, data storage, and data analysis: I set up a web server in my home to be used by my students in a recent statistics class I taught. I had also wanted to let my students run background processes collecting data from the web for analysis (either through APIs or through scraping and crawling the web). It became clear that the best way for students to collect data in this way is to have them rent their own server from Amazon Web Services (AWS). This summer I am going to work with Jennifer Schaffer, a student from the class, to collect, archive, then analyze a large volume of tweets concerning the U.S. politics using AWS. We plan to use MongoDB to store the tweets, but we have not determined how they will be analyzed. We also plan to look at Twitter's follower network, using a Neo4j database. Finally, I plan to migrate the web server from my own machine to a (separate) AWS host, and keep it active for students, past, present, and future, to use.
- Student Collaborator (just graduated, but still working with me): Jennifer Schaffer
- Twitter and Viral Tweats: I worked with one of my students on posing a project for analysis. Using word embeddings, we attempted to quantify tweats on a love/fear axis, and we planned to see how that correlated with retweats. The project never reached conclusion but we learned a lot in the process.
- Cashflow: As a fun project to learn more about SQL and Regular Expressions, I set up my computer to ingest data from my bank and credit card company. I plan to archive it in a database, and provide regular reports. The code is presently in a private repository.