Dootaframe#

Dootaframe is a pure-Python dataframe library. It is designed to be simple to write, understand and extend. Performance is not the main focus of this project.

This is the worst thing since sliced bread.

Overview of concepts#

Similar to Pandas, we have two main concepts, Series and DataFrame.

Series#

A Series is a one-dimensional container of items. While it can hold any type of Python object, it is particulary good at holding numerical data.

If you are familiar with database systems or spreadsheet applications, you might be inclined to store each row of your data inside a Series, similar to this.

>>> john = Series(["John", "Doe", 26])
>>> jane = Series(["Jane", "Doe", 25])
>>> jack = Series(["Jack", "Daniels", 38])

This would technically work, but it is the wrong approach with this library. What we want to do is to flip this around, so each column is stored in its own Series, like this.

>>> fnames = Series(["John", "Jane", "Jack"])
>>> lnames = Series(["Doe", "Doe", "Daniels"])
>>> ages   = Series([26, 25, 38])

At first glance, this seems a little incovenient.

>>> ages = Series([26, 25, 38])
>>> ages.min
25
>>> ages.max
38

DataFrame#

A DataFrame is a collection of Series’.

Series Reference#

We can now take a more detailed look at the Series class. As we discussed above, a Series is a one-dimensional collection of elements. When you provide this collection to the Series class, it adds a bunch of convenient utilities to it.

Series protocol#

As we want to be flexible, there are no checks in dootaframe for particular collections. Instead, we rely on something we call “the Series protocol”. That’s a really fancy name for a trivial concept.

Basically all we require from an input to Series is that it implements the __len__ and __getitem__ methods. This is a very lax requirement, and it includes basically any collection you might want to use. That includes tuples, lists, numpy arrays etc.

>>> Series((1, 2, 3))
Series(1, 2, 3)
>>> Series([1, 2, 3])
Series(1, 2, 3)
>>> import numpy as np
>>> Series(np.array([1, 2, 3]))
Series(1, 2, 3)
>>> d = {"name": "Leo", "age": 24}
>>> Series(d)
Series('name', 'age')
>>> Series(list(d)).apply(lambda x: d[x])
Series('Leo', 24)

Laziness#

There is no rule against more dynamic structures either. Series does not load the entire dataset into memory, so you can even use some on-the-fly collections.

>>> class PowersOfTwo:
...   def __len__(self):
...     return 9999999999
...
...   def __getitem__(self, index):
...     return 2 ** index
...
>>> s1 = Series(PowersOfTwo())
>>> s2 = s1.apply(lambda x: x + 123)
>>> s2[50]
1125899906842747

Now, we definitely did not calculate 9999999999 powers of two and then add 123 to them. In dootaframe, a lot of the operations on Series are lazy. That means until we need the result, or use an operation that requires looking at the whole collection, computation will only happen on the rows that are requested.

Optimized storage#

As we discussed above, the Series API is very flexible and accepts pretty much anything you give to it. For most use cases, Series contain a large number of the same data type.

>>> s = Series([1, 2, 3, 4, 5])
>>> s.underlying_storage
[1, 2, 3, 4, 5]

In the normal case, we can see that the input we gave is being stored as a Python list. This is okay for doing small explorations, but it’s not very efficient for storing a lot of data.

In cases like this, we can ask dootaframe to optimize the backing storage of a Series.

>>> s = Series([1, 2, 3, 4, 5]).optimize_storage
>>> s.underlying_storage
b'\x01\x02\x03\x04\x05'

Instead of a dynamic Python list that has one item for each number, dootaframe was able to convert the backing storage into an immutable byte array. This is okay because it satisfies both the __len__ and the __getitem__ parts of the Series protocol.

That’s not the only storage optimization either. We have the same thing for collections of byte arrays, collections of unsigned 16-bit integers and more.

Optimization of byte arrays#

>>> s = Series([b"He", b"ll", b"o wor", b"l", b"d"])
>>> s.underlying_storage
[b'He', b'll', b'o wor', b'l', b'd']
>>> s = s.optimize_storage
>>> s.underlying_storage
<BytesSeriesStorage with 5 items and 11-byte buffer>
>>> s.underlying_storage.buf
b'Hello world'
>>> s.underlying_storage.begins
Series(0, 2, 4, 9, 10)
>>> s.underlying_storage.lens
Series(2, 2, 5, 1, 1)
>>> s
Series(b'He', b'll', b'o wor', b'l', b'd')

Instead of a Python list that has 5 buffers in it, dootaframe compacts everything into a single buffer and stores the spans of the individual chunks. The end result is the same though.

API Documentation#

Below is some auto-generated documentation.

Simple dataframe library in Pure Python.

class dootaframe.Series(s)#

Bases: object

One-dimensional container of values.

This class includes whatevers.

apply(func: Callable) dootaframe.Series#

Apply a function to each member of the series.

Parameters
funcCallable

The function whose output will be used as the new value.

Returns
Series

The new series.

Examples

>>> s = Series([1, 2, 3, 4])
>>> s.apply(lambda x: x + 1)
Series(2, 3, 4, 5)
property as_list: List#

Turn the Series into a Python list.

Examples

>>> s = Series([1, 2, 3, 4])
>>> s.as_list
[1, 2, 3, 4]
property as_numpy#
property asc: dootaframe.Series#

Sort the items in this series from lowest-to-highest.

concat(other: dootaframe.Series) dootaframe.Series#

Append another series to the end of this one.

Parameters
otherSeries

The other series to append to this one.

Returns
Series

A new series where other is appended to this one.

property desc#
property enumerate#
property floats#
property ints#

Convert every item of the Series into an integer.

This uses the native int() method to convert.

property length: int#

The number of items in this series.

property max#

The maximum value contained within this Series.

property mean#
property median#
property min#

The minimum value contained within this Series.

property optimize_storage: dootaframe.Series#

Create an storage-optimized version of this series.

A Series needs to accommodate storing arbitrary data, including data of different types. While this is very flexible, it means that all the book-keeping data can take a significant amount of memory.

This method tries to optimize certain kinds of storage patterns, and makes them use a lot less memory.

order_by(fn)#
order_by_desc(func: Callable)#
property solidify: dootaframe.Series#
property sorted#
property strings#

Turn every item of the Series into a string.

Returns
Series

The same series, but every item is converted into a string.

property sum#
property underlying_storage#
property uniq#
property uniq_count#