Datetime trouble - Bokeh

Hello everyone,

I am messing about with some beginner Python/Bokeh.

I have built a nice looking graphs with this code, using other CSV files where the Dates must have been in the right format. The trouble I face is that the code returns a time series that isnt sorted after time (I think it is because the format is e.g. 12/1 which can be both the 12th Jan and the 1st Dec which causes the trouble).

My code looks as follows, and I am not sure to fix this (even after extensive google, and trying to use df[‘Date’] = pd.to_datetime(df.Date)).

import pandas as pd
import matplotlib.pyplot as plt
import csv

# Bokeh libraries
from bokeh.io import output_file, output_notebook
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.layouts import row, column, gridplot
from bokeh.models.widgets import Tabs, Panel
from bokeh.models import HoverTool

#Import data-->
df = pd.read_csv("baltic_data.csv", parse_dates=['Date'])
data_source = ColumnDataSource(df)

df['Date'] = pd.to_datetime(df.Date) 

# define axes
x=df['Date']
y=df['BHSI']

y=df['HS1']
y=df['HS2']
y=df['HS3']
y=df['HS4']

y=df['HS5']
y=df['HS6']


# output to static HTML file
output_file("handy.html")


#Take data  and present in a graph
fig = figure(title="HANDY", 
            x_axis_label='Date', 
            y_axis_label='USD/DAY', 
            x_axis_type='datetime',
            plot_height=600, plot_width=1200)


# define the layout of the line
fig.line(x='Date', y='BHSI', source=data_source, line_width=1, legend='Average INDEX', color="black")

fig.line(x='Date', y='HS1', source=data_source, line_width=1, legend='Route 1', color="navy")
fig.line(x='Date', y='HS2', source=data_source, line_width=1, legend='Route 2', color="green")
fig.line(x='Date', y='HS3', source=data_source, line_width=1, legend='Route 3', color="orange")
fig.line(x='Date', y='HS4', source=data_source, line_width=1, legend='Route 4', color="purple")

fig.line(x='Date', y='HS5', source=data_source, line_width=1, legend='Route 5', color="pink")
fig.line(x='Date', y='HS6', source=data_source, line_width=1, legend='Route 6', color="red")

fig.legend.location = "top_left"


#show
show(fig)


A picture of how the result looks.

At least it’s pretty xD

got that csv file somewhere so people can try things out with it?
maybe take a look at the dataframe and see if they got parsed into something sensible?
maybe make your own date parser and map that over that column instead of letting pandas do it. There’s also a to_datetime method you could call on that column and specify a format to use when reading them (without being familiar with pandas my guess would be that you’re trying to parse as date twice but if it doesn’t work the first time there’s no reason to expect it to work the second - instead remove both instances of the parsing and then call to_datetime with a format specifier so that there’s no guessing going on).

Hello,

Yesterday I couldn’t attach files due to my junior status on these forums.

I wouldn’t mind sharing, how do I do that smartest, upload the file here and paste a download link?

any pastebin, or try uploading again (I bumped up your … trust thing)

Download the CSV file

.

Kay. So. Let me feel real smart here.
Ready?
Sure?
Want sorted values?
Sort them. \o/

2019-08-06-085422_372x177_scrot
looked better before. just saying. there was a pattern to it, maybe something like shortest distance? Sorted high to low or something like that… idk. Something though.

1 Like

Thank you.

The data is historical market data for price of renting ships.

X axis will be the date - Y will be the daily price.

This should show a time series of the market going up and down.

It should look like this

That’s the green one in my screenshot, not sure you saw it since I edited it in and you’re looking at emails.

Thank you for your response.

Let me take a look when I get home from work and let you know if I have any questions.

One thing I don’t quite follow, how do sort the values in python? The CSV comes pre sorted, but would love to add it digitally to my code.

Thank you for all the help so far.

That’s what I meant about patterns. Maybe it’s sorted by something else?

You’d tell the dataframe to sort by the Date column so that bokeh reads the data in order and plots it in order

And also make sure to sort before sending it to bokeh, the code is slightly out of order. And probably remove the extra date-parse since that’s being done twice (when reading and then again after).

Aha now I follow you, read you loud and clear. Thank you.

Let me try to play with it tonight and see if it works.

Also, I fear that blank dates (due to no market report on weekends) will return 0 and therefore making the curve ziggy zaggy - do you know how/if this can be fixed?

The spikes don’t go all the way down to zero do they? Those aren’t NaN’s are they? If they were you could filter them out.

Just grabbing a few rows, there’s 20k right next to 30k and 25k, the data is spiky.

 hs4     date
24972.0 2010-02-26
24972.0 2010-03-02
26106.0 2010-03-03
29289.0 2010-03-06
20356.0 2010-03-08
19906.0 2010-03-09
17111.0 2010-03-11
17006.0 2010-03-12
29711.0 2010-03-15
30372.0 2010-03-16
30817.0 2010-03-17
31344.0 2010-03-18
32133.0 2010-03-19
32611.0 2010-03-22

So is there something wrong with the spiky data points? And if so, what. A problem with the source input? Do they actually mean something else?

I guess this is what data scientists entertain themselves with.
I don’t have a clue so I couldn’t tell you what’s okay to do with them.

26/02/2010 24972
1/3/2010 25333
2/3/2010 25717
3/3/2010 26106
4/3/2010 26622
5/3/2010 27178
8/3/2010 27461
9/3/2010 27833
10/3/2010 28083
11/3/2010 28717
12/3/2010 29444
15/03/2010 29711
16/03/2010 30372
17/03/2010 30817
18/03/2010 31344
19/03/2010 32133
22/03/2010 32611

These are the same dates taken from the CSV file opened in Excel. Here the data shows a normal trend.

I think there is something wrong with the formatting of the dates. I tried adding the following:

#Import data-->
df = pd.read_csv("baltic_data.csv", parse_dates=['Date'])
data_source = ColumnDataSource(df)

# Sort by date
df.sort_values(by=['Date'])

However, it didnt help.

If it’s the date formatting then you could tell pandas all about that.
Or maybe pandas and excel have different ideas about what csv looks like (not a great format, it’s exactly what its name suggets: comma separated values. separated how? separated. yay.)
perhaps if excel is able to read it as intended, it can export it in some sane format. or maybe even write it back out as a better(?) csv.