<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Spark on KK's Blog (fromkk)</title><link>https://fromkk.com/tags/spark/</link><description>Recent content in Spark on KK's Blog (fromkk)</description><generator>Hugo</generator><language>en</language><managingEditor>bebound@gmail.com (KK)</managingEditor><webMaster>bebound@gmail.com (KK)</webMaster><lastBuildDate>Sun, 10 Aug 2025 18:44:06 +0800</lastBuildDate><atom:link href="https://fromkk.com/tags/spark/index.xml" rel="self" type="application/rss+xml"/><item><title>Dynamic Allocate Executors when Executing Jobs in Spark</title><link>https://fromkk.com/posts/dynamic-allocate-executors-when-executing-jobs-in-spark/</link><pubDate>Sun, 18 Jul 2021 16:52:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/dynamic-allocate-executors-when-executing-jobs-in-spark/</guid><description>&lt;p&gt;I wrote a Spark program to process logs. The number of logs always changes as time goes by. To ensure logs can be processed instantly, the number of executors is calculated by the maximum of logs per minutes. As a consequence, the CPU usage is low in executors. In order to decrease resource waste, I tried to find a way to schedule executors during the execution of program.&lt;/p&gt;
&lt;p&gt;As shown below, the maximum number of logs per minutes can be a dozen times greater than the minimum number in one day.&lt;/p&gt;</description></item><item><title>Program Crash Caused by CPU Instruction</title><link>https://fromkk.com/posts/program-crash-caused-by-cpu-instruction/</link><pubDate>Sun, 17 May 2020 17:36:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/program-crash-caused-by-cpu-instruction/</guid><description>&lt;p&gt;It&amp;rsquo;s inevitable to dealing with bugs in coding career. The main part of coding are implementing new features, fixing bugs and improving performance. For me, there are two kinds of bugs that is difficult to tackle: those are hard to reproduce, and those occur in code not wrote by you.&lt;/p&gt;
&lt;p&gt;Recently, I met a bug which has both features mentioned before. I write a Spark program to analyse the log and cluster them. Last week I update the code, use Facebook&amp;rsquo;s &lt;a href="https://github.com/facebookresearch/faiss" target="_blank" rel="noopener noreffer "&gt;faiss&lt;/a&gt; library to accelerate the process of find similar vector. After I push the new code to spark, the program crashed. I found this log on Spark driver:&lt;/p&gt;</description></item><item><title>Import custom package or module in PySpark</title><link>https://fromkk.com/posts/import-custom-package-or-module-in-pyspark/</link><pubDate>Thu, 02 Apr 2020 22:24:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/import-custom-package-or-module-in-pyspark/</guid><description>&lt;p&gt;First zip all of the dependencies into zip file like this. Then you can use one of the following methods to import it.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-nil" data-lang="nil"&gt;|-- kk.zip
| |-- kk.py
&lt;/code&gt;&lt;/pre&gt;&lt;h2 id="using-py-files-in-spark-submit"&gt;Using &amp;ndash;py-files in spark-submit&lt;/h2&gt;
&lt;p&gt;When submit spark job, add &lt;code&gt;--py-files=kk.zip&lt;/code&gt; parameter. &lt;code&gt;kk.zip&lt;/code&gt; will be distributed with the main scrip file, and &lt;code&gt;kk.zip&lt;/code&gt; will be inserted at the beginning of &lt;code&gt;PATH&lt;/code&gt; environment variable.&lt;/p&gt;
&lt;p&gt;Then you can use &lt;code&gt;import kk&lt;/code&gt; in your main script file.&lt;/p&gt;</description></item></channel></rss>