Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Analyzing Caltrain Delays (svds.com)
64 points by jsweojtj on March 10, 2016 | hide | past | favorite | 31 comments


Data scientists blog about Caltrain data, come up with convoluted hypothesis about bias in sensors at two stations.

Commenter on blog notices that Caltrain is occasionally single-tracking between those stations due to a bridge replacement. [1]

"Data science" ends up with a bloody neck from Occam's razor.

[1] http://www.caltrain.com/projectsplans/Projects/Caltrain_Capi...


There are a couple of reasons that this isn't the explanation.

To pick one, the random data selection in the blog post showed data from October 2015 -- Feb 2016, and this Caltrain link appears to show the bridge work starting Feb 26th, 2016.

So, no the just-so story doesn't appear to be just-so.


The data is very much consistent with the bridge work hypothesis. The website indicates a series of bridges are being replaced. Starting 9/28/15 with Tilton Ave. and proceeding northward to Monte Diablo Ave then Santa Inez Ave.


The data was sound. They formed a hypothesis. They got a better explanation. Now they have a better hypothesis for the data anamoly. I don't think it's fair to fault them or to discredit the effort.


I'm criticizing the approach of starting the analysis in a vacuum and coming up with a hypothesis that fits the abstract pattern, when simple domain knowledge (from looking at a website, or a Caltrain station notice board) would have put them on the "right track" from the start.

What the authors did falls into the trap of a very stereotypical criticism of data science and doesn't do data science any favors.


Single-tracking should have a symmetric effect, i.e. the distribution of delay for both northbound and southbound trains through the track segment containing the bridge repair should both be positive. Instead, the southbound (and to a lesser extent the northbound) trains traveling through the segment actually have "negative delay"--completely inconsistent with the alternative hypothesis. You don't need data science to show that single-tracking doesn't explain the observed data--just common sense.


'"Data science" ends up with a bloody neck from Occam's razor.'

Heh, I like the imagery. I wouldn't really call it Data Science with capital letters either. It's just plain-old exploratory analysis.

I wouldn't say the hypothesis is convoluted, especially since we see bizarre data coming out of Caltrain API at times, and experience has taught us to distrust our instruments when observing strong systematic bias. But single-tracking due to construction IS obviously a better explanation... because it's probably true.

What is still confusing is that the train is delayed several minutes, but then recovers at the very next station. That is odd behavior for a train to 'make up' that much time so quickly. Typical behavior is just to keep getting later.

Or maybe Occam will strike again in the comments with an obviously correct explanation!


Train companies often put "buffer time" in schedules to make up for predictable delays like this one. That could show up as a train being late at one station then suddenly on time at the next. Could that be it?


Seriously, need to break that vicious cycle: you get used to things going bad, people who run them get used to not delivering, breaking due process, posting unrealistic schedules, and BAM - now you think the problem can't be solved, only plotted.

I don't remember last time when I have not seen a train at a station at scheduled time, and the place where I live doesn't expire confidence. When something is late, it gets in the news. Not every week.


To be (grudgingly) fair to Caltrain, a lot of things are perhaps indirectly out of their control-stuff like suicides (until you get gated platforms to stop people from jumping in front of Baby Bullets) and vehicles stopped on tracks are hard to predict. Likewise they are a victim of their own success-standing room only trains make boarding and unboarding longer than they have to be.

Of course, Caltrain has put off electrification and grade separation for a long time, and even now its massively expensive and ruinous (see the CBOSS fiasco) so its not like they are totally off the hook.


The major cause of the problem is the balkanization of transit planning in the Bay Area. Most metropolitan areas this size have some kind of transit authority that plans systematically for the entire region. We are stuck with multiple incompatible systems, and small NIMBY cities like Menlo Park get to block development that would benefit everyone.


Vehicle and person hits cause exceptional delays, but not significantly more so than the string of breakdowns in February. They apologized for this, and seem to have fewer rush hour problems, so maybe this has improved.

They could do more to improve the standing room problem, which adds up quickly. My evening train is now perpetually 3-5 minutes late, which adds 10% to the runtime, due to the train in front being overcrowded. They are putting out new timetables for next month that they claim will improve this. Their fix is to move its departure time closer to the slow train in front, and add 3 minutes to the trip duration. Frustrating.


I can't help but feel that if there was a BART line where the Caltrain currently runs, things would be about a hundred times better than they currently are, on all fronts. Alas...


What I perceive as the good bits of BART could be replicated on Caltrain with much less effort than replacing Caltrain with BART: the off-peak timetable. Caltrain does better for longer trips in my (limited to South Bay) experience, which makes sense since Caltrain average trip length is 1.5x to 2x longer than BART's. But overall, getting improvements seems much less technical and more political, and neither system seems to have much luck.


BART should've taken over SP's Peninsula Commute in the 70s, when SP really wanted to get rid of it. Even better would be if BART ran standard gauge and overhead wire.


BART tried to, but IIRC San Mateo county unwisely pulled out of the project back in the early days. Part of the reason that it took ages to get BART to Millbrae. That Daly City - Colma - SFO extension took way longer than it should have.


Electrified Caltrain is infinitely preferable to BART.


Only if you get rid of the behemoth slowpoke trains, elevate the rails, add more rail lines, and add a corridor to the East Bay. The Millbrae Caltrain/BART connection is a joke.


Why electrified vs normal caltrain? Will that decrease the variance or door to door time of using caltrain?


There is a lot of info here [1], summarized here [2]. Probably of the most interest to most people is "Improved Train Performance, Increased Ridership Capacity and Increased Service: Electrified trains can accelerate and decelerate more quickly than diesel-powered trains, allowing Caltrain to run more efficiently. In addition, because of their performance advantages, electrified trains will enable more frequent and/or faster train service to more riders".

[1]http://www.caltrain.com/projectsplans/CaltrainModernization/... [2] http://www.caltrain.com/Assets/Caltrain+Modernization+Progra...


Faster acceleration. Each stop will cost less time, and Caltrain has a surprising amount of stops between SF-SJ (22 weekday) for its length (47 miles). You can see this effect already with the amount of express schedules that Caltrain has and their varying trip times.


Yeah I couldn't find any estimates about how much time though. It seems like the express train would not be affected much at all.


Not if it still runs at-grade and keeps killing someone every few weeks.


A lot of Caltrain suicides are the result of people stepping off platforms when Baby bullets pass through or trespassing on the tracks. Grade sep won't help that.


It's not as if Caltrain is required to run at grade. It's just a historical artifact of how the system evolved. If we had the money and political will to replace Caltrain with BART, we'd have the money and will to create a grade-separated Caltrain.

What would be the advantage of BART over electrified, grade-separated Caltrain? I'm hard-pressed to think of one, and there's definitely one major disadvantage: the nonstandard rail gauge.


It'd remove an onboarding/offboarding/payment stop at the Millbrae station.


You don't need to replace Caltrain to achieve that. A easier solution would be to convert Caltrain to a gated fare system (like BART is now), and then put a combined BART and Caltrain station behind one set of turnstiles at Millbrae.


While we're at it, why the hell isn't there all-day, every-day direct service from Millbrae to SFO? Having a massive multi-transit-agency station right by the airport, only to tell people "nope, can't get there from here, you have to go to San Bruno and change trains" is just dumb.


Every day direct BART service between SFO and Millbrae was the original intent, and they provided that level of service when the station opened. But ridership fell well below expectations, and BART eventually limited that service to nights and weekends only.


If only Caltrain had a decent API for developers that produced data that is needed for any serious analysis (like train number, speed and lat/long). I hate having to resort to hackey workarounds like scraping for this info. I believe the NextBus API for MUNI has actual positional data and vehicle info. Caltrain, being the mediocre agency it is, has a next to useless API (if I remember the docs correctly on 511.org).


An API would go a long way and realtime information would obviously be awesome. It appears that even if Caltrain provided historical data of actual departure times per train per station, that would be a huge improvement.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: