Rename PySpark Result File

September 05, 2024

Due to the distributed nature of Apache Spark, when writing result, we can't specify name for the result file. This makes the result file hard to predict which I need for my process orchestration. In my case, I need to write the result to S3 and I finally found a way to do this within a reasonable amount of time by utilizing aws wrangler, Panda, and optionally Arrow. I basically feed Spark dataframe to aws wrangler and have it write to S3 using a specific name.

Here's link to my sample: https://github.com/nik-yo/PySparkFilename

Search This Blog

Nikki's Lunch

Rename PySpark Result File

Comments

Post a Comment

Popular posts from this blog

Sentinel One Strikes Again. No internet connection. Uninstall Sentinel One Agent.

Error When Generating OpenAPI Documents: Missing required option '--project'

A2 Hosting with .NET Core 2.1