/images/avatar.png

Timezone in JVM

I wrote a Scala code to get the current time. However, the output is different on the development server and docker.

import java.util.Calendar

println(Calendar.getInstance().getTime)

On my development server, it outputs Sun Oct 18 18:01:01 CST 2020, but in docker, it print a UTC time.

I guess it related to the timezone setting and do a research, here is the result.

How Did JVM Detect Timezone

All of the code can be found in this function: private static synchronized TimeZone setDefaultZone()

Using cibuildwheel to Create Python Wheels

Have you ever tried to install MySQL-python? It contains the C code and need to compile the code while install the package. You have to follow the steps in this articles: Install MySQL and MySQLClient(Python) in MacOS. Things get worse if you are using Windows.

Luckily, as new distribution format Wheel has been published in PEP 427.

The wheel binary package format frees installers from having to know about the build system, saves time by amortizing compile time over many installations, and removes the need to install a build system in the target environment.

Retrieve Large Dataset in Elasticsearch

It’s easy to get small dataset from Elasticsearch by using size and from. However, it’s impossible to retrieve large dataset in the same way.

Deep Paging Problem

As we know it, Elasticsearch data is organised into indexes, which is a logical namespace, and the real data is stored into physical shards. Each shard is an instance of Lucene. There are two kind of shards, primary shards and replica shards. Replica shards is the copy of primary shards in case nodes or shards fail. By distributing documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy and scalability. By default, Elasticsearch create 5 primary shards and one replica shard for each primary shards.

Program Crash Caused by CPU Instruction

It’s inevitable to dealing with bugs in coding career. The main part of coding are implementing new features, fixing bugs and improving performance. For me, there are two kinds of bugs that is difficult to tackle: those are hard to reproduce, and those occur in code not wrote by you.

Recently, I met a bug which has both features mentioned before. I write a Spark program to analyse the log and cluster them. Last week I update the code, use Facebook’s faiss library to accelerate the process of find similar vector. After I push the new code to spark, the program crashed. I found this log on Spark driver:

C-m, RET and Return Key in Emacs

I use Emacs to write blog. In the recent update, I found M-RET no longer behave as leader key in org mode, but behave as org-meta-return. And even more strange is that in other mode, it behave as leader key. And M-RET also works in terminal in org mode. In GUI, pressing C-M-m can trigger leader key.

SO I opened this issue, with the help of these friends, the issue has been fixed. Here is the cause of the bug.

Import custom package or module in PySpark

First zip all of the dependencies into zip file like this. Then you can use one of the following methods to import it.

|-- kk.zip
|   |-- kk.py

Using –py-files in spark-submit

When submit spark job, add --py-files=kk.zip parameter. kk.zip will be distributed with the main scrip file, and kk.zip will be inserted at the beginning of PATH environment variable.

Then you can use import kk in your main script file.