I deliberately didn't start with a title that included the word "documentation", but stay with me for a second.
If you haven't heard there is a project in the world of Spark called Project Zen. This isn't breaking news and it's been going since at least May of 2020, but it's surprising how many people still haven't heard of it.
According to Databricks in June of 2020, PySpark accounted for 68% of notebook commands, and PyPI has over 5 million monthly downloads of PySpark. That's a lot of people using Python with Spark. With that many people using it, it's not surprising that there is a move now to make Python more usable with Spark, from type hinting support, to making errors more understandable.
At the same time as these changes are being implemented Databricks is enhancing their user interface to not only take advantage of these new features, but also to bring in their own support for Python users.
So, why did I mentioned documentation? Well, for 2 reasons. First of all the Python documentation is being re-written to make it easier to navigate and just more user friendly. Second, to help drive this the PySpark docstrings are being re-written using numpydoc, this makes generated API docs easier to read.
Databricks are helping to make this available by allowing data engineers, analysts, and data scientists to show these docstrings when working in notebooks. Meaning it's now easier to see the documentation directly in the notebook, rather than having to call "help" or go searching the Internet for the information. And everyone can make life easier for everyone else by documenting their code (see, got to the reason in the end).
For example. The following is a simple Pandas UDF which takes a column containing a string, and returns the initials of that string.
Using numpydoc makes the docstring readable when we're looking at the code directly, but it also look nice when we use the new SHIFT+TAB keyboard shortcut in Databricks Runtime 7.4 and above.
People often forget about the documentation, or just write something short and simple, but adding in some good documentation not only helps when you re-visit the code later, but also helps anyone else who might be using your code.
This is just a small part of what Project Zen is bringing to the Python community for Spark, and what Databricks is doing on top of that. So keep and eye out for new and needed features as they arrive in new releases.
Comments